[Rebuild] Idea for ODbL transition strategy
Frederik Ramm
frederik at remote.org
Tue Jan 31 21:44:23 GMT 2012
Hi,
here's a sketch for how we could do "this whole thing". I'm not
saying this is the best way forward but it seems to me that this is
relatively smooth and pragmatic.
It is based on an idea floated in DWG by Matt Amos; at the time called
the "Interdiction API".
1 DATA STRUCTURES
-----------------
As you know, we have a set of "current" tables where the latest version
of each object resides, and a set of what I call "history" tables, where
all versions of each object are stored, including the current version.
(In fact the "current" tables are named "current_nodes" etc, and the
"history" tables are named just "nodes", not "history_nodes", but I call
them "history" to make the distinction clear.)
I propose to amend the history tables (nodes, ways, and relations) by
adding one column that i will provisionally call "release_status".
The "release_status" can have three values:
1 - normal. Can be released freely.
2 - ccbysa. Can be released only through a special API call that asks
for CC-BY-SA data.
3 - suppressed. Can be released only to administrators. This is for
future use.
The latest version of any object - the one in the "current" table as
well as the version with the highest version number in the history table
- must *always* have a release_status of "normal".
Adding this column to the ~ 2 billion rows in these tables would
increase the database size by 2 GB plus a little overhead (when using a
PostgreSQL type of "char" - double that when using "smallint"). As we
currently use about 2.5 TB of disk space, the increase in size would be
a negligible 0.1%.
2 DATA INPUT AND OUTPUT
-----------------------
2.1 Rails API/cgimap
2.1.1 Reading
The most-used API read calls - the "map" call for reading data in a
bounding box and the REST object access calls (/node/1 etc.) - would not
have to be changed since they operate on the current tables only, and by
definition the current tables only ever contain data in release_status
"normal".
Accesses to changesets would not have to be changed.
Those calls that access object history - either specific versions like
/node/1/1 or full history like /node/1/history - will have to be
modified to omit data for versions that are not of release_status
"normal". These versions could either be omitted completely, or they
could be replaced by an object stub that still has the metadata but
nothing else:
<node id="1" version="1" user="fred" uid="999" changeset="1" lat="59"
lon=8">
<tag k="foo" v="bar" />
</node>
<node id="1" version="2" user="phil" uid="998" changeset="2"
release_status="ccbysa" />
<node id="1" version="3" user="joe" uid="997" changeset="3" lat="58" lon=7">
<tag k="foo" v="bar" />
</node>
Potentially, we could add a special history call that people can use to
access CC-BY-SA licensed old versions, e.g.
/retrieve-cc-by-sa-licensed-data/node/1/2 would return
<!-- this data is licensed under CC-BY-SA 2.0 only -->
<node id="1" version="2" user="phil" uid="998" changeset="2" lat=56 lon=3>
<tag k="foo" v="bar" />
</node>
This call would return nothing for a version flagged with release_status
"normal" (since we don't want to hand out CC-BY-SA licensed versions of
our standard database).
(Later, we would perhaps add an option for specially authorized users to
access "suppressed" objects too; that would be used for cases where data
was removed due to a copyright infringement.)
2.1.2 Writing
Whenever a new version is uploaded to OSM, everything works normally.
The new version replaces the old version in the current tables, and is
added into the history tables. The release_status of anything newly
added is always "normal".
2.2 Osmosis (diffs)
Diff creation happens normally. I believe that no changes to Osmosis
would be required. Technically, Osmosis is capable of extracting old
versions from the database and therefore would need to know about
release_status, but in practice this operation is not supported on our
database.
2.3 Planet dump and History planet
The planet dump works off the current tables and therefore needs no
changes at all. The history planet dump requires the same changes as
discussed in 2.1.1, namely dropping those historic versions that have a
release status other than "normal", or replacing them with stubs.
3 GETTING THERE
---------------
When the license change is executed, two things will have to be done.
First, every single object we now have in our database must be put into
a state where it is ODbL compatible. This means it must either be
remapped to a state where it contains no contribution from a decliner
(unless it is tagged odbl=clean), or it must be deleted. (Every
"deleted" object is considered compatible with ODbL.)
This can be achieved using normal API calls. Minus a few details, this
basically means that we work on the area with standard mechanisms until
the OSMI algorithm has nothing left to complain about.
We can hope that a lot of this will have been done by normal mappers
when the day comes, but where it has not been done, a bot can be written
that simply evaluates the history for each problematic object and
decides which action has to be taken to make this object compatible.
Such a bot could be test-run on a small area even in today's production
system.
The second thing that needs to be done is setting the "release_status"
flag to "ccbysa" on all historic versions of an object that are
problematic. This is not just all versions created by a decliner (!) but
also all versions derived from such versions, until such time as nothing
of the decliner's data remains. This second step cannot be done through
the API; but we can prepare the list of affected objects/versions based
on a full history planet, out in the open, and then effectively run a
long list of "update" statements. Since these will all be "update ...
set release_status='ccbysa' where ...", they are reversible (by simply
resetting release_status for everything).
4 EVALUATION OF THIS APPROACH
-----------------------------
This approach is probably the least invasive of all possible approaches;
it can be run on the live database (no dump/reimport), it keeps all the
IDs in place, it changes the behaviour of the system only minimally, and
it has the advantageous side effect of being usable for copyright
infringement cases as well.
This approach does not lead to a "pure" ODbL database; the database
still contains CC-BY-SA elements but since we never publish that
combined database, this should be legally clean.
Non-relicensed data does not "vanish" in this approach; it leaves
"scars" in the object history that are visible, and it can still be
retrieved if someone explicitly asks for it. This can be considered a
good or a bad thing. The history of changesets remains fully intact.
In contrast to other solutions, this approach will not release storage
space used for CC-BY-SA data, and will not re-number objects.
--
Frederik Ramm ## eMail frederik at remote.org ## N49°00'09" E008°23'33"
More information about the Rebuild
mailing list