[OSM-talk] The long tail

Thu Jul 6 12:52:17 BST 2006

Andy waves :-) and points to the status page which now contains a list of
the main "node" contributors:

http://wiki.openstreetmap.org/index.php/Stats#Users_creating_the_most_map_no
des_with_time

Andy Robinson
Andy_J_Robinson at blueyonder.co.uk 

>-----Original Message-----
>From: talk-bounces at openstreetmap.org [mailto:talk-
>bounces at openstreetmap.org] On Behalf Of SteveC
>Sent: 06 July 2006 12:16
>To: talk at openstreetmap.org
>Subject: [OSM-talk] The long tail
>
>OSM is going through this difficult process with the OSMF and people are
>losing confidence in my sanity as I havn't released the entire database.
>I've tried to stay above some of the politics of this, but when Imi
>think's your mad something is very wrong.
>
>In a private email, Imi sent this which he's agreed I can publish:
>
>>> Then there are the bad server performance problems.
>>> Next was, that there are privacy issues.
>>> Then some concerns about forking projects came up.
>>> Now you say that you are threatened by the gouvernment to close down
>>> the server. When asked for evidence you just answer that we all agree
>>> with you anyway.
>
>Which are the reasons he thinks I'm mad, and he also thinks that I'm
>trying to control the database and not release it.
>
>I would like to. I think the biggest growth happened for me when the
>first planet.osm was released. I didn't think anybody would use it and
>then all this cool stuff happened with peoples uses. I was really amazed
>but in retrospect it's kind of obvious that people would do cool stuff
>once the data was there, that's the whole point of the project! :-)
>
>So, I want to address these points and give you some new information.
>Then I'm going to point to the license problem and suggest a solution to
>all of them.
>
>>> Then there are the bad server performance problems.
>
>This point, I think, is that bad server performance limits the ability
>to produce a planet.osm. It does. The process takes _hours_ and totally
>fails if you try to do the TIGER data too. This was improved by NickW's
>planet.rb script and the node table speedup, but it still needs work.
>This can be fixed with having it run on a slave, which is what I'm going
>to do Saturday but people also mentioned running a clone elsewhere, but
>there are:
>
>>> Next was, that there are privacy issues.
>
>Yes. There are deep privacy issues. I used to volunteer for an
>organisation called FIPR. (www.fipr.org) which at a glance is a EFF-like
>organisation in the UK. I became involved because when I was an
>undergraduate I set up a webcam to look over the bike rack at college.
>Lots of bikes (including mine, a day aftyer I got it) were being stolen.
>The webcam allowed anyone in the department to view a live stream of the
>bikes to keep an eye on them.
>
>This is similar to the ideas of 'the transparent society' which is a
>book by David Brin, IIRC. It comes from the thought that either you have
>to trust specific people to watch CCTV like we do now, or we could just
>totally open up CCTV to everyone. Which is better? David Brin argues
>that the latter is the lesser of two evils. In fact, this is being tried
>out in a housing estate in the UK.
>
>So I set this up and then got some angry privacy emails from Ian Brown.
>Ian was a PhD student at the time (I think) in the department and was
>also director of FIPR. He pointed out many things like it was totally
>illegal, it was against college rules, there should be a notice to say
>CCTV and so on. I thought he was mad.
>
>Through more conversations I came around and then got quite immersed in
>things like privacy, copyright and data protection. He was even more
>adament than I last night when we were talking about giving out the
>database, that I shouldn't.
>
>Why?
>
>Before I go in to the exact pieces of data, I want to point out two
>things. First, we don't have a privacy policy. I'm going to leave that
>there, but it's really important. Second, retrospective privacy
>invasion. What we do now may have unforseen privacy implications later
>on. For example, not many people on usenet expected a full searchable
>archive like there is now. Also, it may well be possible to extract
>face recognition out of all of those people you're in photos with in
>flickr. Two examples.
>
>We have very sensitive data in OSM. Some have pointed out that they
>don't care if it's released. That's cool. Opt-in privacy is what we need
>though, not opt-out. By that I mean people should decide they want their
>data released not have to opt-out of it being released.
>
>So, traces. Traces tell a story about where you've been, how fast you
>went and so on. They have opt-in privacy where you can make your data
>public if you wish. Stripping the timestamp doesn't help much. Matt Amos
>thought about this way back at the beginning of OSM because he didn't
>want people to figure out where he lived. He couldn't just switch off
>his GPS when he was a few minutes away as that would lead to a circle of
>space without traces pointing to where he lived. He tried other methods
>but eventually gave up. I thought he was mad at the time, but now I see
>sense.
>
>So let's drop the traces from the dump, people say. Ok cool.
>
>There's the user table. Drop that people say, we don't want the user
>list. Ok.
>
>Now segments, nodes and ways. All of these have user data attached which
>is sensitive. Drop that people say. Ok. Now the timestamps will give a
>lot about the user doing it and you could figure out unique users as
>we're not all editing at random times, many of us do it at specific
>times. Drop that? Ok.
>
>What we're left with are three or four database tables that have to have
>various columns dropped from all of them.
>
>And this is almost exactly the same as planet.osm.
>
>The difference is that the history data, the bits that say 'node foo was
>here and then moved to there' arn't included. I've been wrong in the
>past, but isn't the entire point that the newer data is better than the
>old? To me it looks like instead of distributing all the source code to
>openstreetmap like we do, people are insisting I produce the database
>behind subversion will all the changes for the past two years.
>
>Once you accept that privacy is important (which I'm sure some people
>wont) and that we don't have a policy... Then you end up with something
>that is pretty much like planet.osm but more work to produce. But I'll
>still do it, it'll be cool for the animations of progress if nothing
>else!
>
>>> Then some concerns about forking projects came up.
>
>I think forking is a concern, I don't think it's cool. But, I'm going to
>encourage it with the solution I suggest below.
>
>>> Now you say that you are threatened by the gouvernment to close down
>>> the server.
>
>What Imi is talking about here is something I have talked a little bit
>about in person with a few people and then I tried to explain in an
>email to some people including imi without telling the full story. Let
>me tell the full story:
>
>I was sent an email some time ago by someone from a yahoo webmail
>account. This email included 20,000 or so postcodes with their latitude
>and longitude. It was suggested that I add these to freethepostcode and
>that they had been collected by a courier over many years as he too had
>been frustrated by the lack of data avialable. I did three things.
>
>I replied and enquired what GPS unit he'd used, how he'd collected them,
>could we meet up and so on. I sent the list of postcodes to two people
>who are on this list and can make themselves known if they wish. One of
>them had the full postcode database, the other is good at maths and data
>analysis (The first one might be too :-). I sought advice on my legal
>position if I published them through the website, through a company and
>so on.
>
>I received a brief reply from the guy and he didn't really answer any
>questions which made me suspicious. The two guys compared the 20,000
>list I was sent to the freethepostcode list and to the real postcode
>data (which I've never seen).
>
>What they found was very interesting. The real postcode list was in OSGB
>so they converted it to lat/lon and subtracted the 20,000 list from the
>authoratative list to get the 'error' on each postcode. That is, if I
>take a postcode reading in my front garden, but the 'real' postcode is
>on the roof, then my reading will be 10 or 20 meters out or something.
>
>When you do this with freethepostcode data you get a scatter plot that
>if I remember was usually 50 meters out in a circle (some were north,
>some south or east etc of the real position). The entire graph was
>slightly shifted to the east also, which is an interesting topic for
>another time.
>
>When the same was done with the 20,000 list, something stunning
>happened. The scatter plot was a little map of the united kingdom and
>the error was tiny, if I remember it was mostly less than a few meters.
>Obviously this was not data that had been collected as the guy had said.
>Now this little map can come from two sources, either it was watermarked
>to do that (to prove it was based on a certain set of data) or it was
>the real data that had undergone a small projection error when
>converting real data to OSGB. I believe the jury is out on that.
>
>Either way, it wasn't free data.
>
>It (the 20,000 postsocde data) was also all over the country when the
>guy claimed to be london based (and indeed, his email came from Camden
>which is also where I lived). The legal advice was that if I put the
>data up then I subverted the notice on the front of freethepostcode
>which reads something like 'you only upload data you collect with a
>gps'.
>
>Now as I, Steve Coast, am the publisher of the data and I own the domain
>name 'freethepostcode.org', I am liable if anyone uploads copyrighted
>data (and to openstreetmap by the way). My defence is that if attacked,
>I could remove all the data submitted by the person who they claimed
>submitted copyrighted data. You could argue the details and this indeed
>was Napsters defence against the RIAA, and the RIAA then had to give
>lists of all the songs they wanted taken down.
>
>You can argue some of the finer points about does the domain name mean
>ownership, or is the ISP the 'publisher' but the basics are correct.
>
>So if I subverted that mechanism then I would not be performing due
>diligence, and someone in court could prove I knew all about copyright
>(thanks to working at FIPR) and that would be the end of
>freethepostcode. So I didn't upload the data based on the advice and
>evidence to hand. This is good because in the future if anyone tries to
>take me to court over copyright infringment I can point to this and
>other cases and show that I perform due dilligence.
>
>
>So I was not 'threatened by the government' per se. I was invited to
>commit copyright infringment which may have brought freethepostcode and
>openstreetmap down (they were on the same box at the time). I leave it
>as an exercise to the reader to figure out the motivation behing such a
>person, and who they might work for.
>
>Based on advice and personal preference I chose to keep this story to
>myself and a few other people I spoke to about it at the time (feel free
>to wave, you know who you are :-). It (and other incidents) made me much
>more aware of my personal liability when publishing data.
>
>
>So these are my reasons for not just dumping the database and giving it
>to everyone. I hope that you agree that they are legitimate. Whether or
>not you agree, they mostly drop away when the foundation is set up (pls
>come to the IRC meeting :-) and it (the foundation) is doing the
>publishing and the privacy infringing. I should say it limits my
>liability (and transfers it to others depending on how it gets set up
>(which will be discussed on IRC!)), but it's a lot better than me being
>personally liable.
>
>There is a safety valve in all of this. And his name is Nick Hill. It's
>not quite the three pillars model, but Nick has full access to the
>database as the guy who built and helped set up the new servers. If you
>think I'm totally mad, convince Nick Hill to do it. I hope that Nick
>would have the same privacy and copyright concerns as I do.
>
>In any case, hopefully after Saturday we'll have daily planet.osm dumps
>(with that naming convention someone specced out) which I should have
>worked harder to produce in the past. Then, we can integrate the cool
>openlayers stuff crschmidt's done in to the front page and get away from
>the flaky tiles we're serving, if NickW doesn't mind too much that we'd
>be dropping his work.
>
>
>Now I want to point to the elephant in the room. The License. I chose it
>without much thought way back at the start of the project and it's
>proved a problem ever since. Like privacy and copyright infringment,
>it's this thing that doesn't matter to most people but does deeply to a
>few. If you look on the legal list and past blog posts people on this
>list have made, there are deep problems with it and the clauses to do
>with derived works and attribution (which currently means giving a full
>list of users, I think). I won't go in to the details but many of you
>have seen the license debates.
>
>
>I was talking all this over with Tom last night and he had an idea: Why
>not start again? It would solve everything. Let me explain.
>
>To some degree, all of these problems will be solved by the OSMF which
>will happen anyway. A privacy policy, liability, the license... all
>these things will be handled but I don't think the results will be
>pretty. The distribution of the entire user list as the 'attribution' of
>the data conflicts with the privacy aspect for example. I'm sure you
>could think of more things.
>
>So I'm not going to force this, but I think we as a community should
>consider doing the following. Get a privacy policy put together. Get the
>license right, or go public domain like freethepostcode. The public
>domain bit is what I meant by 'this would encourage forking'. And when
>these two things are in place, shut down openstreetmap and mail all the
>users (we also have no spam policy by the way).
>
>In this email it will explain to all 2,500 or so people that we need to
>get these things in place for the project to continue. It will explain
>the privacy policy we decide. It will explain the license and
>attribution and so on we decide (and if we can't decide, I think we
>should go public domain so people can fork if they like). It will say
>that 'if you agree, click here and all your data will be in the new
>openstreetmap under these terms'.
>
>I think if we get it right, most people will be happy to allow their
>data to be used under the new terms. And besides I don't know the
>statistics but I think the key contributors data wise are actually
>pretty small in number and on this list.
>
>The OSM legal list might be a place to discuss some of the finer points
>and get a policy together, even if the policy is 'there is no privacy'
>and the license is 'public domain' which is the absolute simplest answer
>that would solve everything.
>
>If we did this, it would solve all the pressing problems I think. It's
>not something that should be done immediately, but with a bit of thought
>and perhaps when the OSMF is set up so that it can make these decisions.
>If we got these things right, we'd have a rock solid base for OSM to
>continue in years to come.
>
>I hope this brain dump has been useful.
>
>have fun,
>
>SteveC steve at asklater.com http://www.asklater.com/steve/
>
>_______________________________________________
>talk mailing list
>talk at openstreetmap.org
>http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/talk