[Rebuild] Idea for ODbL transition strategy

Dermot McNally dermotm at gmail.com
Tue Feb 7 18:48:59 GMT 2012


Hi folks,

I've been thinking some more about the mechanism Frederik suggests
(I've just left the whole mail quoted for context), and I like it more
and more. In particular, it seems to offer us opportunities to exploit
the required database schema changes to facilitate aspects of our
remapping and even perhaps to refine our approach to the trickier
parts of our go/no-go rules such as how to handle splits and merges.

So having regard to the mechanism as outlined, suppose we do the following:

* Apply the required database schema changes ASAP

* Ensure that "customers" of the database, particularly the API, will
tolerate both the schema changes and the fact that the new columns
will, for a period, not be guaranteed to be populated.

* In a slight break with Fred's approach, _not_ (yet) assume that the
current version will always be marked to be in a clean state (no
licence transition yet taken place), but allow the API to always serve
up whatever is in there, clean or not (in effect keep doing what we
have been doing)

* Initially selectively (bbox? editing user?), _do_ process the
history records for objects and do mark them with the statuses
Frederik suggests. For kicks, probably adapt our rails code to report
on the status of each historical version.

* In some cases, the processing of history will demonstrate a fully
clean object (let's assume for now that splits are not to be dealt
with explicitly, if they are, hmm, tricky). In such cases, _do_ mark
the current version as clean. This is nice for a few reasons:

  - Can front-load the processing for the bulk of all data to long
before the actual change

  - Can help create much better stats and visualisations (cleanmap,
badmap) than we have today, and they would be official

  - Could probably allow us, for objects with some badness in their
history, to experiment with different cleaning logics - for this, we
might want to introduce some extra status options ("harmless", or
whatever) to allow reversal of any experiments we might do.

* Ideally, we will quickly firm up on the remaining open issues for
our switchover logic. Once we do, we can, slowly, without excessive DB
load, process every object in the database and assign every version
and the current version a status. This gives us an even better set of
stats and visualisations, and the tools will from this point on show
exactly what will and won't survive the change. (unsolved issue: This
requires us to decide how to cope with potentially-cleaning changes to
unclean objects, something Fred's approach didn't have to consider).

* With sufficient thought, and assuming the issues already identified
can be managed, this would allow for an April transition based solely
on the rules we apply when interacting with the DB, the actual DB
changes having taken place beforehand.



One parting comment, about a "feature" that seems to be present even
in Frederik's original mail - if, after the licence switchover, a user
suddenly agrees to the change, reprocessing the object history _could_
take place to attempt to recover tags (or even whole objects) lost in
the change. Clearly, with possible post-April remapping, that could be
dangerous, but perhaps we might contemplate tools, particularly for
territories where a lot is lost and remapping is known not to have
happened to any great extent.

Is this half-way practical?

Dermot



On 31 January 2012 21:44, Frederik Ramm <frederik at remote.org> wrote:
> Hi,
>
>   here's a sketch for how we could do "this whole thing". I'm not saying
> this is the best way forward but it seems to me that this is relatively
> smooth and pragmatic.
>
> It is based on an idea floated in DWG by Matt Amos; at the time called the
> "Interdiction API".
>
>
> 1 DATA STRUCTURES
> -----------------
>
> As you know, we have a set of "current" tables where the latest version of
> each object resides, and a set of what I call "history" tables, where all
> versions of each object are stored, including the current version. (In fact
> the "current" tables are named "current_nodes" etc, and the "history" tables
> are named just "nodes", not "history_nodes", but I call them "history" to
> make the distinction clear.)
>
> I propose to amend the history tables (nodes, ways, and relations) by adding
> one column that i will provisionally call "release_status".
>
> The "release_status" can have three values:
>
> 1 - normal. Can be released freely.
> 2 - ccbysa. Can be released only through a special API call that asks for
> CC-BY-SA data.
> 3 - suppressed. Can be released only to administrators. This is for future
> use.
>
> The latest version of any object - the one in the "current" table as well as
> the version with the highest version number in the history table - must
> *always* have a release_status of "normal".
>
> Adding this column to the ~ 2 billion rows in these tables would increase
> the database size by 2 GB plus a little overhead (when using a PostgreSQL
> type of "char" - double that when using "smallint"). As we currently use
> about 2.5 TB of disk space, the increase in size would be a negligible 0.1%.
>
>
> 2 DATA INPUT AND OUTPUT
> -----------------------
>
> 2.1 Rails API/cgimap
>
> 2.1.1 Reading
>
> The most-used API read calls - the "map" call for reading data in a bounding
> box and the REST object access calls (/node/1 etc.) - would not have to be
> changed since they operate on the current tables only, and by definition the
> current tables only ever contain data in release_status "normal".
>
> Accesses to changesets would not have to be changed.
>
> Those calls that access object history - either specific versions like
> /node/1/1 or full history like /node/1/history - will have to be modified to
> omit data for versions that are not of release_status "normal". These
> versions could either be omitted completely, or they could be replaced by an
> object stub that still has the metadata but nothing else:
>
> <node id="1" version="1" user="fred" uid="999" changeset="1" lat="59"
> lon=8">
> <tag k="foo" v="bar" />
> </node>
> <node id="1" version="2" user="phil" uid="998" changeset="2"
> release_status="ccbysa" />
> <node id="1" version="3" user="joe" uid="997" changeset="3" lat="58" lon=7">
> <tag k="foo" v="bar" />
> </node>
>
> Potentially, we could add a special history call that people can use to
> access CC-BY-SA licensed old versions, e.g.
> /retrieve-cc-by-sa-licensed-data/node/1/2 would return
>
> <!-- this data is licensed under CC-BY-SA 2.0 only -->
> <node id="1" version="2" user="phil" uid="998" changeset="2" lat=56 lon=3>
> <tag k="foo" v="bar" />
> </node>
>
> This call would return nothing for a version flagged with release_status
> "normal" (since we don't want to hand out CC-BY-SA licensed versions of our
> standard database).
>
> (Later, we would perhaps add an option for specially authorized users to
> access "suppressed" objects too; that would be used for cases where data was
> removed due to a copyright infringement.)
>
> 2.1.2 Writing
>
> Whenever a new version is uploaded to OSM, everything works normally. The
> new version replaces the old version in the current tables, and is added
> into the history tables. The release_status of anything newly added is
> always "normal".
>
> 2.2 Osmosis (diffs)
>
> Diff creation happens normally. I believe that no changes to Osmosis would
> be required. Technically, Osmosis is capable of extracting old versions from
> the database and therefore would need to know about release_status, but in
> practice this operation is not supported on our database.
>
> 2.3 Planet dump and History planet
>
> The planet dump works off the current tables and therefore needs no changes
> at all. The history planet dump requires the same changes as discussed in
> 2.1.1, namely dropping those historic versions that have a release status
> other than "normal", or replacing them with stubs.
>
>
> 3 GETTING THERE
> ---------------
>
> When the license change is executed, two things will have to be done.
>
> First, every single object we now have in our database must be put into a
> state where it is ODbL compatible. This means it must either be remapped to
> a state where it contains no contribution from a decliner (unless it is
> tagged odbl=clean), or it must be deleted. (Every "deleted" object is
> considered compatible with ODbL.)
>
> This can be achieved using normal API calls. Minus a few details, this
> basically means that we work on the area with standard mechanisms until the
> OSMI algorithm has nothing left to complain about.
>
> We can hope that a lot of this will have been done by normal mappers when
> the day comes, but where it has not been done, a bot can be written that
> simply evaluates the history for each problematic object and decides which
> action has to be taken to make this object compatible. Such a bot could be
> test-run on a small area even in today's production system.
>
> The second thing that needs to be done is setting the "release_status" flag
> to "ccbysa" on all historic versions of an object that are problematic. This
> is not just all versions created by a decliner (!) but also all versions
> derived from such versions, until such time as nothing of the decliner's
> data remains. This second step cannot be done through the API; but we can
> prepare the list of affected objects/versions based on a full history
> planet, out in the open, and then effectively run a long list of "update"
> statements. Since these will all be "update ... set release_status='ccbysa'
> where ...", they are reversible (by simply resetting release_status for
> everything).
>
>
> 4 EVALUATION OF THIS APPROACH
> -----------------------------
>
> This approach is probably the least invasive of all possible approaches; it
> can be run on the live database (no dump/reimport), it keeps all the IDs in
> place, it changes the behaviour of the system only minimally, and it has the
> advantageous side effect of being usable for copyright infringement cases as
> well.
>
> This approach does not lead to a "pure" ODbL database; the database still
> contains CC-BY-SA elements but since we never publish that combined
> database, this should be legally clean.
>
> Non-relicensed data does not "vanish" in this approach; it leaves "scars" in
> the object history that are visible, and it can still be retrieved if
> someone explicitly asks for it. This can be considered a good or a bad
> thing. The history of changesets remains fully intact.
>
> In contrast to other solutions, this approach will not release storage space
> used for CC-BY-SA data, and will not re-number objects.
>
>
>
> --
> Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00'09" E008°23'33"
>
> _______________________________________________
> Rebuild mailing list
> Rebuild at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/rebuild



-- 
--------------------------------------
Igaühel on siin oma laul
ja ma oma ei leiagi üles



More information about the Rebuild mailing list