[OSM-talk] import de-duplication

Robert (Jamie) Munro rjmunro at arjam.net
Thu Jul 12 14:21:32 BST 2007

Hash: SHA1

Simon Hewison wrote:
> Just had a few thoughts about systems to merge in large datasets (AND, Tiger etc).
> Where there is existing data in the vicinity, we need to search both datasets 
> for attributes that match, and where there is a match, do we update OSM, or 
> leave it alone?
> For instance, if AND says the equivalent of a segment from X1,Y1 to X2,Y2 is 
> highway=motorway;oneway=true;ref=A17.. And OSM says there's a segment from 
> X1+0.000023,Y1-0.00064 to X2-0.00037,Y2+0.0000567 is highway=motorway.. Do we 
> expect it's actually the same road? If so, do we just update the tags on the 
> OSM data set.

It's harder than that because there may be 3 segments approximating a
curve in one data set, and 5 segments giving a smoother approximation in
the other data set. Also we have no way to know if the OSM data is any
good. Some OSM data is great and probably loads better than TIGER. Some
OSM data is sketchy and not really useful.

> The other option would be to leave a de-militarized zone for a few metres 
> surrounding the existing features in the OSM data set, and for the AND data 
> not to be automatically imported but the nearest features from the AND dataset 
> tagged 'here be dragons'; then for humans to go in and join things up on the 
> borders.

I agree that this option is better. The DMZ should be probably small,
though. Less than 100m or so. In fact, if the 2 datasets overlapped a
bit, that would probably make it easier to join them.

Maybe we can do something that carefully tries to join the borders by
looking for common tags in road names etc.

Personally, I believe strongly that we should get external data in the
DB as soon as possible, not review it in JOSM before it is uploaded. It
can be reviewed in Mapnik / OSMarender / Gosmore perfectly well once it
is in the db, and fixed with JOSM, Potlatch, the applet, or any other

If there is a TIGER county that has no existing OSM data in, just upload
it, and get them all uploaded ASAP, so that there is no chance of people
starting to map them from scratch, and so that people who live nearby
can see. Same with AND's data. Also if people can see "It's got near
where I live but is missing the footpath", they are more likely to click
edit and add the footpath than if all they can see is a big empty space.

Robert (Jamie) Munro
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


More information about the talk mailing list