[Rebuild] Strategy change for the redaction bot
gnonthgol at gmail.com
Sat Jun 2 20:14:20 BST 2012
The work with the license change bot have almost halted the last month, with only around 10 commits and no real progress on the main algorithms. This is partly because people have been occupied with other stuff (work, life, mapping, final exams, etc) and partly because the problems with the algorithms are really hard.
There are two problems that the progress have halted on. If we solve these problems we might bump into new problems, or we just have to make and test the glue between the bot and the database or history files and we are done. The two hard unsolved problems we have now are:
1) Detecting wether a change to a tag is trivial enough that it can not be copyrighted. This includes fixing typos, expanding abbreviations and normalizing tags.
2) Removing changes in the geometry of the ways and relations.
I have written a bit on the problems and suggestions to make the problems a bit easier.
TRIVIAL TAG CHANGE PROBLEM
The first problem is a difficult problem for computers to do. It needs a lot of knowledge about the language used in the tags like how to detect misplelings, abbrevs and words of order. In addition what is trivial and significant changes to a tag can be subjective and may start big debates after the redaction bot. It is impossible for any human to know all types of trivial changes in all languages and it is even harder for a computer program to get this right.
There are no way set the limits of what is trivial and significant changes too far in one or the other direction as either way will cause changes from non-acceptors in the final result. If we set too many changes as significant a trivial change by an acceptor will clean the tag and cause the information added by the non-acceptor to be included in the final result. On the other hand if we set too many changes to be trivial then significant changes by non-acceptors can be marked clean and not be redacted.
What we can do is to make the algorithm give three results "trivial", "significant" and "don't know". I the later case it will return a safe default that will cause the bot to redact as much as possible. If we do this we might loose more information that is strictly necessary but we would not keep any data from non-acceptors.
The second problem we currently have is equally hard for computers. Geometries can be totally changed by non-acceptors (like the 'reverse way' and 'sort relation members' functions does) and it can be hard if not impossible to apply the changes done by acceptors on a totally different geometry. I have not seen any programs that can do this. The closest program that does something similar (patch) can not handle these cases and require human intervention if this is found.
If we get this wrong we will probably mangle geometries around the world and make them look like vandals have been there and broken everything (ok, it may not be that bad). We want the final result to be good data, and not just the best license.
To make a better algorithm for this we need to look at the planet and see to what extent this is an issue. A lot of relations do not matter what order the members are in or we can do simple functions on the data to determine what order it should be in. It may be so little problems that some remapping efforts or that a human can do the redactions on those objects. However if this is a big problem, we need some really really smart algorithms that are able to read what a mapper meant to do with this edit and apply it to the clean geometry.
I want to get some comment on the changes to the bot before I start to redesign the bot. We still needs better algorithms for both problems, but this might remove the requirement that the algorithm have to be perfect. If we get enough people to work on the bot we can get this thing done before people gets too restless to keep at bay.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Rebuild