[OSM-talk] The long tail

Thu Jul 6 12:59:28 BST 2006

Steve

Thanks for taking the time to explain all this to the main mailing list.

For what its worth, I agree with all the points you have made.

David

----- Original Message ----- 
From: "SteveC" <steve at asklater.com>
To: <talk at openstreetmap.org>
Sent: Thursday, July 06, 2006 12:15 PM
Subject: [OSM-talk] The long tail

> OSM is going through this difficult process with the OSMF and people are
> losing confidence in my sanity as I havn't released the entire database.
> I've tried to stay above some of the politics of this, but when Imi
> think's your mad something is very wrong.
> 
> In a private email, Imi sent this which he's agreed I can publish:
> 
>>> Then there are the bad server performance problems.
>>> Next was, that there are privacy issues.
>>> Then some concerns about forking projects came up.
>>> Now you say that you are threatened by the gouvernment to close down
>>> the server. When asked for evidence you just answer that we all agree
>>> with you anyway.
> 
> Which are the reasons he thinks I'm mad, and he also thinks that I'm
> trying to control the database and not release it.
> 
> I would like to. I think the biggest growth happened for me when the
> first planet.osm was released. I didn't think anybody would use it and
> then all this cool stuff happened with peoples uses. I was really amazed
> but in retrospect it's kind of obvious that people would do cool stuff
> once the data was there, that's the whole point of the project! :-)
> 
> So, I want to address these points and give you some new information.
> Then I'm going to point to the license problem and suggest a solution to
> all of them.
> 
>>> Then there are the bad server performance problems.
> 
> This point, I think, is that bad server performance limits the ability
> to produce a planet.osm. It does. The process takes _hours_ and totally
> fails if you try to do the TIGER data too. This was improved by NickW's
> planet.rb script and the node table speedup, but it still needs work.
> This can be fixed with having it run on a slave, which is what I'm going
> to do Saturday but people also mentioned running a clone elsewhere, but
> there are:
> 
>>> Next was, that there are privacy issues.
> 
> Yes. There are deep privacy issues. I used to volunteer for an
> organisation called FIPR. (www.fipr.org) which at a glance is a EFF-like
> organisation in the UK. I became involved because when I was an
> undergraduate I set up a webcam to look over the bike rack at college.
> Lots of bikes (including mine, a day aftyer I got it) were being stolen.
> The webcam allowed anyone in the department to view a live stream of the
> bikes to keep an eye on them.
> 
> This is similar to the ideas of 'the transparent society' which is a
> book by David Brin, IIRC. It comes from the thought that either you have
> to trust specific people to watch CCTV like we do now, or we could just
> totally open up CCTV to everyone. Which is better? David Brin argues
> that the latter is the lesser of two evils. In fact, this is being tried
> out in a housing estate in the UK.
> 
> So I set this up and then got some angry privacy emails from Ian Brown.
> Ian was a PhD student at the time (I think) in the department and was
> also director of FIPR. He pointed out many things like it was totally
> illegal, it was against college rules, there should be a notice to say
> CCTV and so on. I thought he was mad.
> 
> Through more conversations I came around and then got quite immersed in
> things like privacy, copyright and data protection. He was even more
> adament than I last night when we were talking about giving out the
> database, that I shouldn't.
> 
> Why?
> 
> Before I go in to the exact pieces of data, I want to point out two
> things. First, we don't have a privacy policy. I'm going to leave that
> there, but it's really important. Second, retrospective privacy
> invasion. What we do now may have unforseen privacy implications later
> on. For example, not many people on usenet expected a full searchable
> archive like there is now. Also, it may well be possible to extract
> face recognition out of all of those people you're in photos with in
> flickr. Two examples.
> 
> We have very sensitive data in OSM. Some have pointed out that they
> don't care if it's released. That's cool. Opt-in privacy is what we need
> though, not opt-out. By that I mean people should decide they want their
> data released not have to opt-out of it being released.
> 
> So, traces. Traces tell a story about where you've been, how fast you
> went and so on. They have opt-in privacy where you can make your data
> public if you wish. Stripping the timestamp doesn't help much. Matt Amos
> thought about this way back at the beginning of OSM because he didn't
> want people to figure out where he lived. He couldn't just switch off
> his GPS when he was a few minutes away as that would lead to a circle of
> space without traces pointing to where he lived. He tried other methods
> but eventually gave up. I thought he was mad at the time, but now I see
> sense.
> 
> So let's drop the traces from the dump, people say. Ok cool.
> 
> There's the user table. Drop that people say, we don't want the user
> list. Ok.
> 
> Now segments, nodes and ways. All of these have user data attached which
> is sensitive. Drop that people say. Ok. Now the timestamps will give a
> lot about the user doing it and you could figure out unique users as
> we're not all editing at random times, many of us do it at specific
> times. Drop that? Ok.
> 
> What we're left with are three or four database tables that have to have
> various columns dropped from all of them.
> 
> And this is almost exactly the same as planet.osm.
> 
> The difference is that the history data, the bits that say 'node foo was
> here and then moved to there' arn't included. I've been wrong in the
> past, but isn't the entire point that the newer data is better than the
> old? To me it looks like instead of distributing all the source code to
> openstreetmap like we do, people are insisting I produce the database
> behind subversion will all the changes for the past two years.
> 
> Once you accept that privacy is important (which I'm sure some people
> wont) and that we don't have a policy... Then you end up with something
> that is pretty much like planet.osm but more work to produce. But I'll
> still do it, it'll be cool for the animations of progress if nothing
> else!
> 
>>> Then some concerns about forking projects came up.
> 
> I think forking is a concern, I don't think it's cool. But, I'm going to
> encourage it with the solution I suggest below.
> 
>>> Now you say that you are threatened by the gouvernment to close down
>>> the server.
> 
> What Imi is talking about here is something I have talked a little bit
> about in person with a few people and then I tried to explain in an
> email to some people including imi without telling the full story. Let
> me tell the full story:
> 
> I was sent an email some time ago by someone from a yahoo webmail
> account. This email included 20,000 or so postcodes with their latitude
> and longitude. It was suggested that I add these to freethepostcode and
> that they had been collected by a courier over many years as he too had
> been frustrated by the lack of data avialable. I did three things.
> 
> I replied and enquired what GPS unit he'd used, how he'd collected them,
> could we meet up and so on. I sent the list of postcodes to two people
> who are on this list and can make themselves known if they wish. One of
> them had the full postcode database, the other is good at maths and data
> analysis (The first one might be too :-). I sought advice on my legal
> position if I published them through the website, through a company and
> so on.
> 
> I received a brief reply from the guy and he didn't really answer any
> questions which made me suspicious. The two guys compared the 20,000
> list I was sent to the freethepostcode list and to the real postcode
> data (which I've never seen).
> 
> What they found was very interesting. The real postcode list was in OSGB
> so they converted it to lat/lon and subtracted the 20,000 list from the
> authoratative list to get the 'error' on each postcode. That is, if I
> take a postcode reading in my front garden, but the 'real' postcode is
> on the roof, then my reading will be 10 or 20 meters out or something.
> 
> When you do this with freethepostcode data you get a scatter plot that
> if I remember was usually 50 meters out in a circle (some were north,
> some south or east etc of the real position). The entire graph was
> slightly shifted to the east also, which is an interesting topic for
> another time.
> 
> When the same was done with the 20,000 list, something stunning
> happened. The scatter plot was a little map of the united kingdom and
> the error was tiny, if I remember it was mostly less than a few meters.
> Obviously this was not data that had been collected as the guy had said.
> Now this little map can come from two sources, either it was watermarked
> to do that (to prove it was based on a certain set of data) or it was
> the real data that had undergone a small projection error when
> converting real data to OSGB. I believe the jury is out on that.
> 
> Either way, it wasn't free data.
> 
> It (the 20,000 postsocde data) was also all over the country when the
> guy claimed to be london based (and indeed, his email came from Camden
> which is also where I lived). The legal advice was that if I put the
> data up then I subverted the notice on the front of freethepostcode
> which reads something like 'you only upload data you collect with a
> gps'.
> 
> Now as I, Steve Coast, am the publisher of the data and I own the domain
> name 'freethepostcode.org', I am liable if anyone uploads copyrighted
> data (and to openstreetmap by the way). My defence is that if attacked,
> I could remove all the data submitted by the person who they claimed
> submitted copyrighted data. You could argue the details and this indeed
> was Napsters defence against the RIAA, and the RIAA then had to give
> lists of all the songs they wanted taken down.
> 
> You can argue some of the finer points about does the domain name mean
> ownership, or is the ISP the 'publisher' but the basics are correct.
> 
> So if I subverted that mechanism then I would not be performing due
> diligence, and someone in court could prove I knew all about copyright
> (thanks to working at FIPR) and that would be the end of
> freethepostcode. So I didn't upload the data based on the advice and
> evidence to hand. This is good because in the future if anyone tries to
> take me to court over copyright infringment I can point to this and
> other cases and show that I perform due dilligence.
> 
> 
> So I was not 'threatened by the government' per se. I was invited to
> commit copyright infringment which may have brought freethepostcode and
> openstreetmap down (they were on the same box at the time). I leave it
> as an exercise to the reader to figure out the motivation behing such a
> person, and who they might work for.
> 
> Based on advice and personal preference I chose to keep this story to
> myself and a few other people I spoke to about it at the time (feel free
> to wave, you know who you are :-). It (and other incidents) made me much
> more aware of my personal liability when publishing data.
> 
> 
> So these are my reasons for not just dumping the database and giving it
> to everyone. I hope that you agree that they are legitimate. Whether or
> not you agree, they mostly drop away when the foundation is set up (pls
> come to the IRC meeting :-) and it (the foundation) is doing the
> publishing and the privacy infringing. I should say it limits my
> liability (and transfers it to others depending on how it gets set up
> (which will be discussed on IRC!)), but it's a lot better than me being
> personally liable.
> 
> There is a safety valve in all of this. And his name is Nick Hill. It's
> not quite the three pillars model, but Nick has full access to the
> database as the guy who built and helped set up the new servers. If you
> think I'm totally mad, convince Nick Hill to do it. I hope that Nick
> would have the same privacy and copyright concerns as I do.
> 
> In any case, hopefully after Saturday we'll have daily planet.osm dumps
> (with that naming convention someone specced out) which I should have
> worked harder to produce in the past. Then, we can integrate the cool
> openlayers stuff crschmidt's done in to the front page and get away from
> the flaky tiles we're serving, if NickW doesn't mind too much that we'd
> be dropping his work.
> 
> 
> Now I want to point to the elephant in the room. The License. I chose it
> without much thought way back at the start of the project and it's
> proved a problem ever since. Like privacy and copyright infringment,
> it's this thing that doesn't matter to most people but does deeply to a
> few. If you look on the legal list and past blog posts people on this
> list have made, there are deep problems with it and the clauses to do
> with derived works and attribution (which currently means giving a full
> list of users, I think). I won't go in to the details but many of you
> have seen the license debates.
> 
> 
> I was talking all this over with Tom last night and he had an idea: Why
> not start again? It would solve everything. Let me explain.
> 
> To some degree, all of these problems will be solved by the OSMF which
> will happen anyway. A privacy policy, liability, the license... all
> these things will be handled but I don't think the results will be
> pretty. The distribution of the entire user list as the 'attribution' of
> the data conflicts with the privacy aspect for example. I'm sure you
> could think of more things.
> 
> So I'm not going to force this, but I think we as a community should
> consider doing the following. Get a privacy policy put together. Get the
> license right, or go public domain like freethepostcode. The public
> domain bit is what I meant by 'this would encourage forking'. And when
> these two things are in place, shut down openstreetmap and mail all the
> users (we also have no spam policy by the way).
> 
> In this email it will explain to all 2,500 or so people that we need to
> get these things in place for the project to continue. It will explain
> the privacy policy we decide. It will explain the license and
> attribution and so on we decide (and if we can't decide, I think we
> should go public domain so people can fork if they like). It will say
> that 'if you agree, click here and all your data will be in the new
> openstreetmap under these terms'.
> 
> I think if we get it right, most people will be happy to allow their
> data to be used under the new terms. And besides I don't know the
> statistics but I think the key contributors data wise are actually
> pretty small in number and on this list.
> 
> The OSM legal list might be a place to discuss some of the finer points
> and get a policy together, even if the policy is 'there is no privacy'
> and the license is 'public domain' which is the absolute simplest answer
> that would solve everything.
> 
> If we did this, it would solve all the pressing problems I think. It's
> not something that should be done immediately, but with a bit of thought
> and perhaps when the OSMF is set up so that it can make these decisions.
> If we got these things right, we'd have a rock solid base for OSM to
> continue in years to come.
> 
> I hope this brain dump has been useful.
> 
> have fun,
> 
> SteveC steve at asklater.com http://www.asklater.com/steve/
> 
> _______________________________________________
> talk mailing list
> talk at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/talk
> 
>