[OSM-dev] Duplicate data from Tiger import

Wed Nov 29 10:45:43 GMT 2006

Hi OSM,

(I've been away from the OSM mailing list for months now -- too little
signal -to- noise, in my never humble opinion.)  Over that same time,
I've been keeping an eye on the TIGER import, handling county
prioritization requests, etc.

Unfortunately, the last rewrite I did of the TIGER import code had some
severe bugs in its disaster recovery code.  (I'm really the only set of
eyes looking at this code, so bugs are inevitable.)  These bugs have
caused the massive way duplication in the OSM database.  Specifically,
since May 09 the TIGER import process has timed-out over 14 thousand
times, and some proportion of those time-outs have caused dupes.
Time-outs ("server is down") are to be expected as a project evolves,
but I quote the number to give you folks an idea of the *need* for
disaster recovery in any long time-scale OSM data-mucking process.

(For those of you who think I'ver been silly to use the API rather than
directly hitting MySQL, note that the problems from these 14k time-outs
would have been *worse* against the raw database.)

Over this same time period, the server on which the TIGER import is
running has had persistent connectivity problems.  Obviously I cannot
complain, since the server has been generously donated for the cause.
But this has made debugging problematic, to say the least.

In the mean time my home computing situation has stablized -- I now have
a flat with decent DSL and a hip Linux box.  So the server hosting
problems should go away.

In the next few days I intend to rewrite the disaster recovery process
(again) to include a database that robustly records exactly which of the
millions of TIGER ways have been imported.  This will also make
reporting on the TIGER import process much easier.  ("How far along is
my favorite county?")

Those of you (Russ, Blars, etc.) who have requested prioritization of
specific US counties will still be first on the list.

Steve -- please drop everything done by the "ben_tiger" user, as I'll be
re-doing the import as "ben_tiger101".

If anyone would like to help writing Ruby for the TIGER import, please
contact me off-list.

		Thanks,
		Ben

On Tue, 28 Nov 06 @08:01pm, Jon Burgess wrote:
> On Tue, 2006-11-28 at 09:48 +0000, SteveC wrote:
> > * @ 28/11/06 01:21:03 AM jburgess at uklinux.net wrote:
> > > On Sun, 2006-11-26 at 22:24 +0000, Jon Burgess wrote:
> > > > I noticed that Mapnik was taking much longer than normal to process some
> > > > areas of the US and have found that there are instances where the same
> > > > data is duplicated over 100 times. For example see:
> > > > 
> > > > http://www.openstreetmap.org/api/0.3/map?bbox=-84.316874,39.16047,-84.315683,39.161368
> > > > 
> > > > If this is displayed in JOSM there are only 5 distinct nodes and yet the
> > > > raw XML shows that each of the nodes, segments and ways is duplicated
> > > > 102 times. 
> > > > 
> > > > 
> > > > I don't know whether this is a problem with the original tiger data or
> > > > the import process, but it looks like something needs to be done to
> > > > remove the redundant data. 
> > > > 
> > > > 	Jon
> > > > 
> > > 
> > > Today I tried devising an enhanced osm2pgsql.c which would exclude
> > > duplicate ways while generating the SQL. I've got something which seems
> > > to work and indicates that around 60% of all nodes and ways in the
> > > planet-061112 are duplicates. 
> > 
> > How do you define a dupe?
> 
> The program filter things which are interesting to Mapnik. Items are
> considered duplicates as follows:
> 
> 
> Nodes: identical lat & lon only (other attributes and tags are ignored).
> 
> 
> Segments: identical to & from, OR - to/from are themselves duplicates of
> other nodes and there is another segment between the duplicate nodes, 
> e.g.
>   nodeA(0,0), nodeB(1,1), nodeC(0,0), nodeD(1,1)
>   segmentP(nodeA, nodeB), segmentQ(nodeC, nodeD)
> 
> nodes A/C & B/D are dupes. Also segment P/Q is also a dupe due to the
> duplicate nodes.
> 
> 
> Ways: identical list of segments & certain tags are identical (the
> exportTags used by mapnik: highway, name etc). The order of the segments
> does not matter. Following a similar process to that used by segments,
> ways are duplicates if the segments are the same or are duplicates of
> one another.
> 
> 
> 
> 
> > Where are these things, are they in the US (eg the TIGER import) or
> > somewhere else?
> > 
> 
> Don't know yet. I'll need to add some more stats counters to match tiger
> data.
> 
> 	Jon
> 
> 
> 
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev