[OSM-dev] Duplicate data from Tiger import

Jon Burgess jburgess at uklinux.net
Tue Nov 28 20:01:09 GMT 2006


On Tue, 2006-11-28 at 09:48 +0000, SteveC wrote:
> * @ 28/11/06 01:21:03 AM jburgess at uklinux.net wrote:
> > On Sun, 2006-11-26 at 22:24 +0000, Jon Burgess wrote:
> > > I noticed that Mapnik was taking much longer than normal to process some
> > > areas of the US and have found that there are instances where the same
> > > data is duplicated over 100 times. For example see:
> > > 
> > > http://www.openstreetmap.org/api/0.3/map?bbox=-84.316874,39.16047,-84.315683,39.161368
> > > 
> > > If this is displayed in JOSM there are only 5 distinct nodes and yet the
> > > raw XML shows that each of the nodes, segments and ways is duplicated
> > > 102 times. 
> > > 
> > > 
> > > I don't know whether this is a problem with the original tiger data or
> > > the import process, but it looks like something needs to be done to
> > > remove the redundant data. 
> > > 
> > > 	Jon
> > > 
> > 
> > Today I tried devising an enhanced osm2pgsql.c which would exclude
> > duplicate ways while generating the SQL. I've got something which seems
> > to work and indicates that around 60% of all nodes and ways in the
> > planet-061112 are duplicates. 
> 
> How do you define a dupe?

The program filter things which are interesting to Mapnik. Items are
considered duplicates as follows:


Nodes: identical lat & lon only (other attributes and tags are ignored).


Segments: identical to & from, OR - to/from are themselves duplicates of
other nodes and there is another segment between the duplicate nodes, 
e.g.
  nodeA(0,0), nodeB(1,1), nodeC(0,0), nodeD(1,1)
  segmentP(nodeA, nodeB), segmentQ(nodeC, nodeD)

nodes A/C & B/D are dupes. Also segment P/Q is also a dupe due to the
duplicate nodes.


Ways: identical list of segments & certain tags are identical (the
exportTags used by mapnik: highway, name etc). The order of the segments
does not matter. Following a similar process to that used by segments,
ways are duplicates if the segments are the same or are duplicates of
one another.




> Where are these things, are they in the US (eg the TIGER import) or
> somewhere else?
> 

Don't know yet. I'll need to add some more stats counters to match tiger
data.

	Jon






More information about the dev mailing list