[Rebuild] Idea for ODbL transition strategy

Frederik Ramm frederik at remote.org
Tue Jan 31 21:44:23 GMT 2012


Hi,

    here's a sketch for how we could do "this whole thing". I'm not 
saying this is the best way forward but it seems to me that this is 
relatively smooth and pragmatic.

It is based on an idea floated in DWG by Matt Amos; at the time called 
the "Interdiction API".


1 DATA STRUCTURES
-----------------

As you know, we have a set of "current" tables where the latest version 
of each object resides, and a set of what I call "history" tables, where 
all versions of each object are stored, including the current version. 
(In fact the "current" tables are named "current_nodes" etc, and the 
"history" tables are named just "nodes", not "history_nodes", but I call 
them "history" to make the distinction clear.)

I propose to amend the history tables (nodes, ways, and relations) by 
adding one column that i will provisionally call "release_status".

The "release_status" can have three values:

1 - normal. Can be released freely.
2 - ccbysa. Can be released only through a special API call that asks 
for CC-BY-SA data.
3 - suppressed. Can be released only to administrators. This is for 
future use.

The latest version of any object - the one in the "current" table as 
well as the version with the highest version number in the history table 
- must *always* have a release_status of "normal".

Adding this column to the ~ 2 billion rows in these tables would 
increase the database size by 2 GB plus a little overhead (when using a 
PostgreSQL type of "char" - double that when using "smallint"). As we 
currently use about 2.5 TB of disk space, the increase in size would be 
a negligible 0.1%.


2 DATA INPUT AND OUTPUT
-----------------------

2.1 Rails API/cgimap

2.1.1 Reading

The most-used API read calls - the "map" call for reading data in a 
bounding box and the REST object access calls (/node/1 etc.) - would not 
have to be changed since they operate on the current tables only, and by 
definition the current tables only ever contain data in release_status 
"normal".

Accesses to changesets would not have to be changed.

Those calls that access object history - either specific versions like 
/node/1/1 or full history like /node/1/history - will have to be 
modified to omit data for versions that are not of release_status 
"normal". These versions could either be omitted completely, or they 
could be replaced by an object stub that still has the metadata but 
nothing else:

<node id="1" version="1" user="fred" uid="999" changeset="1" lat="59" 
lon=8">
<tag k="foo" v="bar" />
</node>
<node id="1" version="2" user="phil" uid="998" changeset="2" 
release_status="ccbysa" />
<node id="1" version="3" user="joe" uid="997" changeset="3" lat="58" lon=7">
<tag k="foo" v="bar" />
</node>

Potentially, we could add a special history call that people can use to 
access CC-BY-SA licensed old versions, e.g. 
/retrieve-cc-by-sa-licensed-data/node/1/2 would return

<!-- this data is licensed under CC-BY-SA 2.0 only -->
<node id="1" version="2" user="phil" uid="998" changeset="2" lat=56 lon=3>
<tag k="foo" v="bar" />
</node>

This call would return nothing for a version flagged with release_status 
"normal" (since we don't want to hand out CC-BY-SA licensed versions of 
our standard database).

(Later, we would perhaps add an option for specially authorized users to 
access "suppressed" objects too; that would be used for cases where data 
was removed due to a copyright infringement.)

2.1.2 Writing

Whenever a new version is uploaded to OSM, everything works normally. 
The new version replaces the old version in the current tables, and is 
added into the history tables. The release_status of anything newly 
added is always "normal".

2.2 Osmosis (diffs)

Diff creation happens normally. I believe that no changes to Osmosis 
would be required. Technically, Osmosis is capable of extracting old 
versions from the database and therefore would need to know about 
release_status, but in practice this operation is not supported on our 
database.

2.3 Planet dump and History planet

The planet dump works off the current tables and therefore needs no 
changes at all. The history planet dump requires the same changes as 
discussed in 2.1.1, namely dropping those historic versions that have a 
release status other than "normal", or replacing them with stubs.


3 GETTING THERE
---------------

When the license change is executed, two things will have to be done.

First, every single object we now have in our database must be put into 
a state where it is ODbL compatible. This means it must either be 
remapped to a state where it contains no contribution from a decliner 
(unless it is tagged odbl=clean), or it must be deleted. (Every 
"deleted" object is considered compatible with ODbL.)

This can be achieved using normal API calls. Minus a few details, this 
basically means that we work on the area with standard mechanisms until 
the OSMI algorithm has nothing left to complain about.

We can hope that a lot of this will have been done by normal mappers 
when the day comes, but where it has not been done, a bot can be written 
that simply evaluates the history for each problematic object and 
decides which action has to be taken to make this object compatible. 
Such a bot could be test-run on a small area even in today's production 
system.

The second thing that needs to be done is setting the "release_status" 
flag to "ccbysa" on all historic versions of an object that are 
problematic. This is not just all versions created by a decliner (!) but 
also all versions derived from such versions, until such time as nothing 
of the decliner's data remains. This second step cannot be done through 
the API; but we can prepare the list of affected objects/versions based 
on a full history planet, out in the open, and then effectively run a 
long list of "update" statements. Since these will all be "update ... 
set release_status='ccbysa' where ...", they are reversible (by simply 
resetting release_status for everything).


4 EVALUATION OF THIS APPROACH
-----------------------------

This approach is probably the least invasive of all possible approaches; 
it can be run on the live database (no dump/reimport), it keeps all the 
IDs in place, it changes the behaviour of the system only minimally, and 
it has the advantageous side effect of being usable for copyright 
infringement cases as well.

This approach does not lead to a "pure" ODbL database; the database 
still contains CC-BY-SA elements but since we never publish that 
combined database, this should be legally clean.

Non-relicensed data does not "vanish" in this approach; it leaves 
"scars" in the object history that are visible, and it can still be 
retrieved if someone explicitly asks for it. This can be considered a 
good or a bad thing. The history of changesets remains fully intact.

In contrast to other solutions, this approach will not release storage 
space used for CC-BY-SA data, and will not re-number objects.



-- 
Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00'09" E008°23'33"



More information about the Rebuild mailing list