[OSM-dev] Duplicate data from Tiger import

Tue Nov 28 01:21:03 GMT 2006

On Sun, 2006-11-26 at 22:24 +0000, Jon Burgess wrote:
> I noticed that Mapnik was taking much longer than normal to process some
> areas of the US and have found that there are instances where the same
> data is duplicated over 100 times. For example see:
> 
> http://www.openstreetmap.org/api/0.3/map?bbox=-84.316874,39.16047,-84.315683,39.161368
> 
> If this is displayed in JOSM there are only 5 distinct nodes and yet the
> raw XML shows that each of the nodes, segments and ways is duplicated
> 102 times. 
> 
> 
> I don't know whether this is a problem with the original tiger data or
> the import process, but it looks like something needs to be done to
> remove the redundant data. 
> 
> 	Jon
> 

Today I tried devising an enhanced osm2pgsql.c which would exclude
duplicate ways while generating the SQL. I've got something which seems
to work and indicates that around 60% of all nodes and ways in the
planet-061112 are duplicates. 

Once the duplicate entries are removed, the number of rows in planet_osm
drops from 3.6 to 1.5 million, which should improve the mapnik rendering
time.

I don't want to provide a copy of the code just yet. I'd still like to
make some improvements to it, such as improving the 2GB memory usage. I
also want to leave mapnik running overnight on the database to make sure
it the output still looks reasonable (and to measure the effect on
rendering times).

	Jon