[Rebuild] Do I win a prize if I am the first to post?

errt at gmx.de errt at gmx.de
Wed Jan 11 20:06:19 GMT 2012


So, I'll give this another try, hopefully this goes to the whole list now.

Hi everyone,
thanks Frederik for that initial impulse.

First of all, I'm not that deeply into all that technical points, so 
whatever I say might be total nonsense, I'll just write up what I think :p

That's quite a list of possible ways to handle objects edited by both, 
agreers and decliners, all with their benefits and problems, I numbered 
them in the quote below for easier reference. My own idea would be 
somewhere in between some of your listed ones, much like what your 
method [2] is about:
Let's start with a clean database and then go through all changesets, 
just applying those done by agreers. If something isn't there to be 
changed (e.g. a changeset removed a tag, but that tag hasn't been added 
to the new database, as it was entered by a decliner), just ignore that 
change.
That's probably not technically trivial, especially e.g. in cases where 
nodes are added to a way, but the way isn't in the same structure as it 
was when the real changset happened. Changing tags isn't easy either, as 
small changes should probably not be made (e.g. a decliner entered a 
street name and someone just addent an accent or something like that), 
but biggers ones should be (e.g. a decliner added a street name and 
someone exchanged it completely). But if we can figure out, how to solve 
such cases, I think this method would have some clear benefits.
First of all, it would leave us with an unbroken history. No needs to 
change any consumers of the data (as in [1] and [6], possibly also in 
[2]), as the history appears to be continuous (and is, somewhat). The 
API also doesn't need any changes, as the database won't have any holes 
or objects flagged to be invisible. Of course, this would falsify 
history as changesets that happened aren't included anymore and later 
changesets will change to something they not really have been. But I 
think we should do that step. Some decliners probably wouldn't want to 
appear in the history anymore after the changeover and some might even 
reconsider their decision if not only the objects they created and the 
data they brought into the project ist deleted, but also their name will 
be removed from the history (note this
should not be the reason for going this way, but this sideeffect might 
be considered positive).
Secondly, the history would also be clean, in that no non-relicensed 
data will be in there. Nobody will be tempted to recover data from the 
history that's not clean (of course, it could be recovered from the last 
CC planet dump, but that's a lot more difficult), especially not new 
mappers in the future that might not know about the licensechange 
exactly and just discover some information in the history and recover 
it. If done this way, the missing versions in the history probably won't 
do much harm, as no information from them lives on, but if we sort of 
merge all versions of an object into one or two versions (as in [4] and 
[5]), lots of information about the data, the changeset comments, time 
and creators, the source information and more ist lost and this could 
lead to problems if anything has to be traced back.

So for your example, the result would look like that:
Version 1 of way created by woodpeck
Version 2: woodpeck adds "oneway=yes" (and just "oneway=yes" no
streetname in this version, anywhere)

No data by a decliner lives on and the history is continuous and clear. 
That was an easy one, I know, other changes will be much more difficult, 
but perhaps we can find ways to deal with at least most of the problems.

Another problem is that it's not easy to recover data if a decliner 
agrees past the changedate, but I'm not sure we even want that option 
(not the possibilty to agree past the changedate, just the ability to 
recover their previously dropped data).

As I said, I'm not that deeply technically involved and all this might 
be plain bullshit, but still I think this might be the best option for a 
clean history, no changes to any programs consuming the data, as much 
data kept as possible and the possibility for a fine granularity of
exceptions to keep even more data if there are no legal problems, even 
if it's the most technically challenging method.

And now some final thougths on your list, just to classify:
There are in fact two possible ways for the change (or does one see 
more?), either dropping decliners changesets and object versions or 
flagging them so they won't be delivered any more, but leaving all 
others in their place, so information is retained but programs will need 
to account for the history holes or really rebuilding the database so 
version numbers change and information is effectively lost, not just 
hidden, but with the history being continuous and no changes needed for 
the consumers.
So if there are just these two models (or a small number of them), 
perhaps we should first decide on the general way to go and do the 
details of 'what will happen to this tag or that node in this specific 
example' later. As stated above, I currently favor the latter option of
a real rebuild, but let's see what the discussion will lead to.

As for your mentioning of things like densifying the id space, I think 
this could be a real option especially in case of a real rebuild (as 
defined above), as the references at least to object versions will be 
broken already, so we can just go forward and break the references to
the objects too, this wouldn't do that much more harm. Or does anyone 
know of external databases having references to our objects that would 
be broken but should not be? This would have to be done in a second 
process step after the actual rebuild, though, I think, as the 
references to old objects that will be dropped have to be removed 
before. Other changes like a world-wide deletion of created_by tags or 
similar could also be done without too much more effort, we could fix 
common typos or anything like that in a world-wide scale if we already 
have to touch every object.

Well, just my two cents, and probably enough for my first post on this 
list, too,
Regards,
Dominik

Am 11.1.2012 01:05, schrieb Frederik Ramm:
> [...]
> [1]One could think "let's just keep those versions done by agreers, 
> and drop those by decliners, and let's make a new version of all 
> objects that contains only the content not added by decliners."
>
> This would lead to a situation where some versions are missing. Parts 
> of our Rails code might have to be hardened against that - it is 
> possible that somewhere we have code that just counts versions from 1 
> to n. Also it is possible that client software out in the wild has 
> such problems, and if we decide to go this way it would be good to 
> offer something like relicensing.dev.openstreetmap.org with such a 
> "database with holes" so that clients can be tested against that.
>
> Then there is the issue that data by decliners might affect more than 
> the current version, e.g.
> [example]
>
> We would now delete version 2 from our database, so only 1,3,4 are 
> kept. But what happens to the "name=Blah Road" tag that is still 
> present in version 3?
>
> [2]We can either remove that tag from version 3, thereby falsifying
> history (making it look like the tag was never there) - probably a bad
> idea.
>
> [3]Or we can remove all versions that contain any information
> contributed by non-agreers, which might be a lot, and we would lose a
> lot of history along the way.
>
> [4]Another option is dropping the whole history for everything now,
> and start with a clean database where version 1 (or version n) is the
> current version and no other versions exist. (We could keep a
> read-only version of the last CC-BY-SA database with full Rails port
> functions on a simple server somehow, doesn't matter if it's slow -
> just so that people can still access history if they want, but that
> would all be under CC-BY-SA.)
>
> [5]Or we could opt for a limited keeping of history whereby every
> object with more than one historic version is reduced to having
> exactly two versions - v1 is the very first, and v2 is the current
> one, and everything in between is removed.
> [...]
>
> [6]My final idea is a slightly outlandish variant of the above but
> even easier: Simply make the new API return *no* pre-changeover
> versions at all, and keep all the pre-changeover versions in a special
> CC-BY-SA-only API.
> [...]



More information about the Rebuild mailing list