[OSM-talk] Zero tolerance on imports

Tue Feb 22 17:02:49 GMT 2011

2011/2/22 Peter Budny <peterb at gatech.edu>:
> Anders Arnholm <anders at arnholm.se> writes:
> On the contrary... the bigger the database, the more we need tools to
> help us understand and manipulate the data.  When there are only 100
> POI nodes in a city, I can easily check them all by hand.  When there
> are 100000, that's when automated or semi-automated tools are necessary.

to do what? Why would you want to check all the POIs? If I come over
something that is missing I add it, if there is something that is not
there in reality and I am aware I delete it (or more often move it to
the right position). The more the data gets used, the more the errors
get found. OSM is a project with tens of thousands of contributors but
we will need millions of users and possibly a back channel to maintain
all the data. No bot on earth can tell you if a POI is at the right
position, is well described or is there at all (in the real world).

> Sorry, I avoided your question.  As for imports: the bigger OSM gets,
> the harder it is to ensure coverage.  If I got the supposed McDonald's
> POI dataset, how would we know whether OSM already has 100% of them, or
> only 98%?

ask McDonald's or even better, let them check ;-) Who cares if we have
all McDonald's if not they themselves? Besides that your question is
very simple: count them.

> This discussion has somehow conflated robots and tools with imports, and
> that may be partially my fault.  But if we had better tools for
> performing imports, it might be easier to stitch them together with
> existing hand-edited data, and imports wouldn't be such a destructive
> process.

While I am not generally against imports I began to be more and more
against them in past few years. The benefit of publicly available data
imported in our database is very little in respect to a parallel crowd
sourced dataset (e.g. you could also compare them one against the
other to find problems in either one). I don't know about the quality
of the publicly available data in the US (I guess TIGER was a
super-neat and up-to-date dataset, but my knowledge is based on people
admiring it here on the ML) but so-called "official" data which I have
seen is often worse than what people think about it. Nobody can
actually afford to spend so much time on the data like we do ;-).

cheers,
Martin