[OSM-dev] GDPR implementation on planet.osm.org

Frederik Ramm frederik at remote.org
Tue Jun 19 20:54:07 UTC 2018


as you probably know, the EU data protection rules compel us to be a bit
less open in handing out personal data to everyone. Following LWG's
analyses and recommendations, the OSMF has decided to implement
restrictions on publishing user names and changeset IDs.

The general plan is to allow everyone "in OSM" (i.e. with an OSM
account) to fully access all data as before (and have a policy that says
you must only use the personal data for OSM purposes), while removing
user names, user IDs, and changeset IDs from the publicly availalbe data
(i.e. what you can get without an OSM account).

This requires changes to the API which I've started to sketch here:

but this message is about changes to the downloads on
planet.openstreetmap.org. Here's a three phase plan for changing the way
we run planet.openstreetmap.org, and I would like to hear feedback about
the feasibility from users and those familiar with running the site
alike. I haven't run this by the sysadmins so if there are any bloopers
I hope they will be pointed out. (I will put this up on
https://wiki.openstreetmap.org/wiki/GDPR/Planet.osm_Migration and try to
work in any results from discussion here but if you're more comfortable
to edit directly on the Wiki that's fine too.)


Phase 1 - Introduction of no-userdata files

This does not require software development and could start immediately,
but some scripting is required.

1a. set up a new domain for OSM internal data downloads, e.g.
"osm-internal.planet.openstreetmap.org", initially duplicating all data.

Issue: name of domain?
Issue: ironbelly disk usage is at 70%, possible to add space?

1b. modify the planetdump.erb in the planet chef cookbook to generate
versions without user information of all the weekly dumps, in addition
to the versions with user information; have the versions without user
information stored in the old "planet.openstreetmap.org" tree, and the
versions with user information in the new "osm-internal" tree.

Issue: should files have the same names on internal and public site, or
should they be called "planet-with-userdata" and "planet" or something?

1c. modify the replication.cron.erb as follows:

* have osmosis write minutely replication files to the new "internal" tree
* run a shell script after generating the replication files that will
find the newly generated file, pipe it through osmium stripping user
information, and write the result to the old "planet" tree, copying the
state.txt files as needed
* run the osmosis "merge-diff" tasks separately on both trees OR run on
internal tree only and pipe result through osmium as above
* write changeset replication XMLs to the new "internal" tree only

For step 1c, it might make sense to announce a maintenance window
beforehand during which the changes will be made, so that consumers who
rely on user data can stop their replication for a few hours and then
make the switch.

1d. modify planet.openstreetmap.org index pages to point to the internal
page in case people wish to download stuff with user data; place marker
on internal page that these files are with user data.

At the end of phase 1, we will have this situation:

* new changeset diffs only on the "internal" tree
* regular diffs come in two flavours, with and without user data
* planet dumps etc. also come in two flavours
* old files are unchanged
* consumers will automatically get the stuff without user data
* consumers who need user data will have to change their URLs

Phase 2 - Cleaning out old files that contain user data

This can be done slowly in the background over the course of however
long it takes:

2a. remove all changeset dumps and changeset diffs from the public tree.
2b. run all .osc, .osm.pbf, and .osm.bz2 files on the public tree
through osmium, scrubbing user data (retain file timestamp if possible)
and re-creating .md5 files where necessary

Phase 3 - Controlling access to files with user data

Once the parallel systems are up and running, we will want to

3a. issue guidelines about what you are allowed to do with the user data
3b. ensure that everyone who has an OSM account agrees to these
guidelines one way or the other,
3c. start requiring an OSM login for all downloads from the internal,
"with userdata" tree.

One possible technical solution for 3c is
https://github.com/geofabrik/sendfile_osm_oauth_protector which also
comes with a guide for users on how to run it in a scripted setup.

Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00'09" E008°23'33"

More information about the dev mailing list