[OSM-talk] The long tail

Thu Jul 6 12:15:33 BST 2006

OSM is going through this difficult process with the OSMF and people are
losing confidence in my sanity as I havn't released the entire database.
I've tried to stay above some of the politics of this, but when Imi
think's your mad something is very wrong.

In a private email, Imi sent this which he's agreed I can publish:

>> Then there are the bad server performance problems.
>> Next was, that there are privacy issues.
>> Then some concerns about forking projects came up.
>> Now you say that you are threatened by the gouvernment to close down
>> the server. When asked for evidence you just answer that we all agree
>> with you anyway.

Which are the reasons he thinks I'm mad, and he also thinks that I'm
trying to control the database and not release it.

I would like to. I think the biggest growth happened for me when the
first planet.osm was released. I didn't think anybody would use it and
then all this cool stuff happened with peoples uses. I was really amazed
but in retrospect it's kind of obvious that people would do cool stuff
once the data was there, that's the whole point of the project! :-)

So, I want to address these points and give you some new information.
Then I'm going to point to the license problem and suggest a solution to
all of them.

>> Then there are the bad server performance problems.

This point, I think, is that bad server performance limits the ability
to produce a planet.osm. It does. The process takes _hours_ and totally
fails if you try to do the TIGER data too. This was improved by NickW's
planet.rb script and the node table speedup, but it still needs work.
This can be fixed with having it run on a slave, which is what I'm going
to do Saturday but people also mentioned running a clone elsewhere, but
there are:

>> Next was, that there are privacy issues.

Yes. There are deep privacy issues. I used to volunteer for an
organisation called FIPR. (www.fipr.org) which at a glance is a EFF-like
organisation in the UK. I became involved because when I was an
undergraduate I set up a webcam to look over the bike rack at college.
Lots of bikes (including mine, a day aftyer I got it) were being stolen.
The webcam allowed anyone in the department to view a live stream of the
bikes to keep an eye on them.

This is similar to the ideas of 'the transparent society' which is a
book by David Brin, IIRC. It comes from the thought that either you have
to trust specific people to watch CCTV like we do now, or we could just
totally open up CCTV to everyone. Which is better? David Brin argues
that the latter is the lesser of two evils. In fact, this is being tried
out in a housing estate in the UK.

So I set this up and then got some angry privacy emails from Ian Brown.
Ian was a PhD student at the time (I think) in the department and was
also director of FIPR. He pointed out many things like it was totally
illegal, it was against college rules, there should be a notice to say
CCTV and so on. I thought he was mad.

Through more conversations I came around and then got quite immersed in
things like privacy, copyright and data protection. He was even more
adament than I last night when we were talking about giving out the
database, that I shouldn't.

Why?

Before I go in to the exact pieces of data, I want to point out two
things. First, we don't have a privacy policy. I'm going to leave that
there, but it's really important. Second, retrospective privacy
invasion. What we do now may have unforseen privacy implications later
on. For example, not many people on usenet expected a full searchable
archive like there is now. Also, it may well be possible to extract
face recognition out of all of those people you're in photos with in
flickr. Two examples.

We have very sensitive data in OSM. Some have pointed out that they
don't care if it's released. That's cool. Opt-in privacy is what we need
though, not opt-out. By that I mean people should decide they want their
data released not have to opt-out of it being released.

So, traces. Traces tell a story about where you've been, how fast you
went and so on. They have opt-in privacy where you can make your data
public if you wish. Stripping the timestamp doesn't help much. Matt Amos
thought about this way back at the beginning of OSM because he didn't
want people to figure out where he lived. He couldn't just switch off
his GPS when he was a few minutes away as that would lead to a circle of
space without traces pointing to where he lived. He tried other methods
but eventually gave up. I thought he was mad at the time, but now I see
sense.

So let's drop the traces from the dump, people say. Ok cool.

There's the user table. Drop that people say, we don't want the user
list. Ok.

Now segments, nodes and ways. All of these have user data attached which
is sensitive. Drop that people say. Ok. Now the timestamps will give a
lot about the user doing it and you could figure out unique users as
we're not all editing at random times, many of us do it at specific
times. Drop that? Ok.

What we're left with are three or four database tables that have to have
various columns dropped from all of them.

And this is almost exactly the same as planet.osm.

The difference is that the history data, the bits that say 'node foo was
here and then moved to there' arn't included. I've been wrong in the
past, but isn't the entire point that the newer data is better than the
old? To me it looks like instead of distributing all the source code to
openstreetmap like we do, people are insisting I produce the database
behind subversion will all the changes for the past two years.

Once you accept that privacy is important (which I'm sure some people
wont) and that we don't have a policy... Then you end up with something
that is pretty much like planet.osm but more work to produce. But I'll
still do it, it'll be cool for the animations of progress if nothing
else!

>> Then some concerns about forking projects came up.

I think forking is a concern, I don't think it's cool. But, I'm going to
encourage it with the solution I suggest below.

>> Now you say that you are threatened by the gouvernment to close down
>> the server.

What Imi is talking about here is something I have talked a little bit
about in person with a few people and then I tried to explain in an
email to some people including imi without telling the full story. Let
me tell the full story:

I was sent an email some time ago by someone from a yahoo webmail
account. This email included 20,000 or so postcodes with their latitude
and longitude. It was suggested that I add these to freethepostcode and
that they had been collected by a courier over many years as he too had
been frustrated by the lack of data avialable. I did three things.

I replied and enquired what GPS unit he'd used, how he'd collected them,
could we meet up and so on. I sent the list of postcodes to two people
who are on this list and can make themselves known if they wish. One of
them had the full postcode database, the other is good at maths and data
analysis (The first one might be too :-). I sought advice on my legal
position if I published them through the website, through a company and
so on.

I received a brief reply from the guy and he didn't really answer any
questions which made me suspicious. The two guys compared the 20,000
list I was sent to the freethepostcode list and to the real postcode
data (which I've never seen).

What they found was very interesting. The real postcode list was in OSGB
so they converted it to lat/lon and subtracted the 20,000 list from the
authoratative list to get the 'error' on each postcode. That is, if I
take a postcode reading in my front garden, but the 'real' postcode is
on the roof, then my reading will be 10 or 20 meters out or something.

When you do this with freethepostcode data you get a scatter plot that
if I remember was usually 50 meters out in a circle (some were north,
some south or east etc of the real position). The entire graph was
slightly shifted to the east also, which is an interesting topic for
another time.

When the same was done with the 20,000 list, something stunning
happened. The scatter plot was a little map of the united kingdom and
the error was tiny, if I remember it was mostly less than a few meters.
Obviously this was not data that had been collected as the guy had said.
Now this little map can come from two sources, either it was watermarked
to do that (to prove it was based on a certain set of data) or it was
the real data that had undergone a small projection error when
converting real data to OSGB. I believe the jury is out on that.

Either way, it wasn't free data.

It (the 20,000 postsocde data) was also all over the country when the
guy claimed to be london based (and indeed, his email came from Camden
which is also where I lived). The legal advice was that if I put the
data up then I subverted the notice on the front of freethepostcode
which reads something like 'you only upload data you collect with a
gps'.

Now as I, Steve Coast, am the publisher of the data and I own the domain
name 'freethepostcode.org', I am liable if anyone uploads copyrighted
data (and to openstreetmap by the way). My defence is that if attacked,
I could remove all the data submitted by the person who they claimed
submitted copyrighted data. You could argue the details and this indeed
was Napsters defence against the RIAA, and the RIAA then had to give
lists of all the songs they wanted taken down.

You can argue some of the finer points about does the domain name mean
ownership, or is the ISP the 'publisher' but the basics are correct.

So if I subverted that mechanism then I would not be performing due
diligence, and someone in court could prove I knew all about copyright
(thanks to working at FIPR) and that would be the end of
freethepostcode. So I didn't upload the data based on the advice and
evidence to hand. This is good because in the future if anyone tries to
take me to court over copyright infringment I can point to this and
other cases and show that I perform due dilligence.

So I was not 'threatened by the government' per se. I was invited to
commit copyright infringment which may have brought freethepostcode and
openstreetmap down (they were on the same box at the time). I leave it
as an exercise to the reader to figure out the motivation behing such a
person, and who they might work for.

Based on advice and personal preference I chose to keep this story to
myself and a few other people I spoke to about it at the time (feel free
to wave, you know who you are :-). It (and other incidents) made me much
more aware of my personal liability when publishing data.

So these are my reasons for not just dumping the database and giving it
to everyone. I hope that you agree that they are legitimate. Whether or
not you agree, they mostly drop away when the foundation is set up (pls
come to the IRC meeting :-) and it (the foundation) is doing the
publishing and the privacy infringing. I should say it limits my
liability (and transfers it to others depending on how it gets set up
(which will be discussed on IRC!)), but it's a lot better than me being
personally liable.

There is a safety valve in all of this. And his name is Nick Hill. It's
not quite the three pillars model, but Nick has full access to the
database as the guy who built and helped set up the new servers. If you
think I'm totally mad, convince Nick Hill to do it. I hope that Nick
would have the same privacy and copyright concerns as I do.

In any case, hopefully after Saturday we'll have daily planet.osm dumps
(with that naming convention someone specced out) which I should have
worked harder to produce in the past. Then, we can integrate the cool
openlayers stuff crschmidt's done in to the front page and get away from
the flaky tiles we're serving, if NickW doesn't mind too much that we'd
be dropping his work.

Now I want to point to the elephant in the room. The License. I chose it
without much thought way back at the start of the project and it's
proved a problem ever since. Like privacy and copyright infringment,
it's this thing that doesn't matter to most people but does deeply to a
few. If you look on the legal list and past blog posts people on this
list have made, there are deep problems with it and the clauses to do
with derived works and attribution (which currently means giving a full
list of users, I think). I won't go in to the details but many of you
have seen the license debates.

I was talking all this over with Tom last night and he had an idea: Why
not start again? It would solve everything. Let me explain.

To some degree, all of these problems will be solved by the OSMF which
will happen anyway. A privacy policy, liability, the license... all
these things will be handled but I don't think the results will be
pretty. The distribution of the entire user list as the 'attribution' of
the data conflicts with the privacy aspect for example. I'm sure you
could think of more things.

So I'm not going to force this, but I think we as a community should
consider doing the following. Get a privacy policy put together. Get the
license right, or go public domain like freethepostcode. The public
domain bit is what I meant by 'this would encourage forking'. And when
these two things are in place, shut down openstreetmap and mail all the
users (we also have no spam policy by the way).

In this email it will explain to all 2,500 or so people that we need to
get these things in place for the project to continue. It will explain
the privacy policy we decide. It will explain the license and
attribution and so on we decide (and if we can't decide, I think we
should go public domain so people can fork if they like). It will say
that 'if you agree, click here and all your data will be in the new
openstreetmap under these terms'.

I think if we get it right, most people will be happy to allow their
data to be used under the new terms. And besides I don't know the
statistics but I think the key contributors data wise are actually
pretty small in number and on this list.

The OSM legal list might be a place to discuss some of the finer points
and get a policy together, even if the policy is 'there is no privacy'
and the license is 'public domain' which is the absolute simplest answer
that would solve everything.

If we did this, it would solve all the pressing problems I think. It's
not something that should be done immediately, but with a bit of thought
and perhaps when the OSMF is set up so that it can make these decisions.
If we got these things right, we'd have a rock solid base for OSM to
continue in years to come.

I hope this brain dump has been useful.

have fun,

SteveC steve at asklater.com http://www.asklater.com/steve/