[OSM-talk] Redaction requests - the information I need

Tue Nov 13 10:44:41 GMT 2012

The DWG periodically receives request to redact content from the database.
As the person generally responsible for running those redactions, I thought
I'd share some detail on what information I need to redact something. This
is not going to cover the legal reasons to redact something. Those are not
handled by the DWG but by the LWG. Most redaction cases are legally pretty
simple. This is also not going to cover DMCA requests which require
"Identification of the material that is claimed to be infringing or to be
the subject of infringing activity and that is to be removed or access to
which is to be disabled, and information reasonably sufficient to permit the
service provider to locate the material."

I am using bad data as shorthand for data that needs to be redacted. I'm
also not going to cover verifying that something needs to be redacted.

The changeset redaction bot is a ruby program that uses the same logic as
the redaction bot run on the database with the ODbL changeover. As it
doesn't have DB access it uses API calls and does not run as quickly so if a
massive redaction ever needed to be done it might not be suitable. Aside
from assorted configuration options, it can take a list of what to redact in
three ways:

1. A changeset. This is the most common way to call it and generally it is
easiest to deal with a redaction request if they have provided a list of
changesets. I have scripts that will take a list of changesets and call it
for each changeset, verifying the results. The changeset is downloaded as a
.osc file and processed

2. A .osc file. This is mainly used for changesets that are too large to
download through the API. It is parsed to get a list of objects and
versions.

3. A list of objects where bad data was added. An example entry from a list
would be "w1234v5" which indicates that version 5 of way 1234 introduced bad
data. n is used for node and r for relation. Typically it will be version 1
of an object where bad data was introduced.

4. An object and a version range. There is a special script that can be used
to redact specific objects from the database without applying the normal
logic to determine what versions. This is not normally used and was only
used about 5 times total.

Once it has got a list of objects, it then proceeds to download them from
the API and use its logic to determine what needs to be removed. It then
does two things

1. Deletes data as required using changesets. In many cases this essentially
the same as reverting the changeset so if someone has already reverted it
nothing is done here.

2. Uses the redaction API call to hide old versions of the objects that
cannot be shown.

What does this mean for you as a mapper requesting a redaction?

The preferred way to request a redaction is generally to give a list of
changesets that need to be processed. Anything other requests generally have
to be turned into a list of changesets. Sometimes only a small part of a
changeset needs to be redacted. In those cases a list of objects may cause
less damage from redacting unnecessary content.

I have assorted tools including a changeset database and a pgsnapshot
database. I can turn information like "all changesets by this user with
'$foo' in the comment" into a list of changesets fairly easily. The DWG also
has experience identifying exactly what to redact but it is preferable to
request them in one of the above formats. Requests that need investigation
to determine what to do will take much longer.