[OSM-dev] Measuring the current state of play wrt new contributor terms
jim at cloudmade.com
Mon Aug 9 22:27:47 BST 2010
I'm working in an approach model the impact of the acceptance of the new contributor terms on the data. I wanted to kick this around the dev list and see what people thought of the approach as I get it going...
It should be noted to start with, that edits by bots do not require the acceptance of the terms and conditions to be kept in the database. Nor do large batch loads of data from a specific PD source such as the Tiger data.
These sets of data will need to be taken out of the analysis if they are connected to contributors who have not accepted the new terms and conditions.
Any editor can be classified as:
* Accepting the terms and conditions
* Not having answered yet
* Refusing the new terms and conditions
Database objects have two main sets of properties:
* Geometry: For points this is the lat/long, for ways this is the set of nodes, for polygons this is the set of nodes and for relations this is the member sets.
* Attributes (tags)
Objects also have history and each historical change is linked to a user. Each historical change can impact either the geometry, the attributes or both.
If an editor has refused the terms and conditions (and they are not a bot/batch load as defined above), we need to take the change sets they have done and remove the impact of these sets from the data. This means:
* Where geometries have changed:
* If the object has a prior version, take the prior version of the geometry and then apply subsequent geometry edits past theirs
* If the object does not have a prior version, then it is probably lost to the db along with its tags.
* Where attributes have changed:
* If the specific tag deleted or changed existed in a prior version, roll back that tag to the latest prior version (which could mean re-adding deleted tags) and then roll forward subsequent edits to that tag. Other tags should be unaffected.
* If the tag was added in this edit, then it is probably lost to the db
This approach seems reasonable and is what I am starting to model, of course the actual detail of the processing of edits by any editors who do not sign up for the new terms and conditions needs to be determined by the LWG. I am happy to help with the implementation of the final logic however.
PROCESS FOR ANALYSIS:
Using the history file from 2010-08-02, with periodic diff files added over time, and a feed of the userids of those who have accepted the new terms and conditions (also updated over time) I plan to model and report on the impact on a regular basis.
The report would be along the lines of a table with one row for each of:
and for each row/object type show:
* # objects in db as of last update
* # tags in db as of last update
* # objects totally clean (all editors in the history have accepted)
* # objects initial editor accepted
* # objects clean so far (no editors have refused)
* # objects initial editor refused (entire object may be lost)
* # objects with partial data loss (one or more editors have refused, but not the first one)
* # tags totally clean (all editors in the history have accepted)
* # tags initial editor accepted
* # tags clean so far (no editors have refused)
* # tags initial editor refused (entire tag series may be lost)
* # tags with partial data loss (one or more editors have refused, but not the first one)
As this goes further along, we can look at what types of data are at risk and break the object out into countries... Only as needed of course.
This will take some significant work and processing power, so i want to be sure that the methods and metrics are of use to the community and reflect our intent as a community. Hence posting it here in Dev...
Looking forward to some constructive feedback.
Jim Brown - CTO CloudMade
email: jim at cloudmade.com<mailto:jim at cloudmade.com>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the dev