[Rebuild] Updates on the redaction bot

Sat Jun 9 00:24:32 BST 2012

I am finally done with my final exams for this semester and have turned my attention to the redaction bot. You have asked for updates and now I finally have some.

We have gotten some install instructions for linux from Peteris Krisjanis. The redaction bot only requires ruby >= 1.9.1 and git. This can be found on any modern distribution. With the new install instructions it is easy to install and try out for yourself.

The python script that Malte Kuhn submitted that includes a better algorithm for finding if two strings are trivial because they differ only in abbreviations ("NE Foo st." == "North East Foo Street") have now gotten its own branch for testing. It still uses python as opposed to the rest of the bot that is written in ruby. It uses a clever algorithm that creates a set of abbreviation rules that then gets applied to one of the strings until the two strings matches each other. It currently uses a brute force but if any of you have a better algorithm for deciding what rule to apply it would run at a nice speed. We still only have tests and common abbreviations for english, german and russian. We would like to have tests for more languages.

In may Matt Amos added a new cli client that lets you inspect the results from running the redaction bot on an object from the api. It now supports running the bot on an entire history extract as well as from a list of objects. It still only outputs raw ruby objects and not a xml file that can be used by other programs yet.

I have been adding some optimization to the relation code. Some profiling showed that the longest common subsequence algorithm used in calculating the relation geometries difference was slowing the bot down. I added some code that tries to get a faster solution to the trivial problems. I suspect this have to be duplicated for the ways whenever it uses the same algorithm. I tried running it on monacco.osh and it went from 30min before optimization to 5min after. There are probably some bugs in that code and we need people to review the code and to test it.

The last two commits in the optimization are up for discussion. The first one say that certain values of the key 'type' on relations are not copyrightable and are therefore not affected by the redaction bot. This makes the final product a lot less mangled and we do not have to 'loose' a lot of multipolygons, routes and borders because they were created by a decliner although most of the members are from acceptors. The second commit say that we do not care about the order of multipolygons. This makes the bot use a much simpler algorithm for half the relations in the database. I hope a simpler algorithm means less bugs and faster code, although it adds a few more lines of code. There are probably more relations where the order of the members does not matter.

There have been a bit talk on the list about visualizing the redactions in josm. This could be done if the output from the redaction bot could be read by the wtfe layer that is used by the licensechange plugin. I do not know how this could be done. However if we got the redaction bot to output xml it would be possible to use that in osm2pgsql and have the cleanmap show the result of the redaction bot.

If you have any questions, ideas or patches feel free to contact people on the mailing list or on irc. Thank you for all your contributions.

Gnonthgol