[Talk-us] talk-us-ma: Duplicate nodes in mass

Greg Troxel gdt at ir.bbn.com
Fri Aug 21 14:04:48 BST 2009


Frederik Ramm <frederik at remote.org> writes:

> There are roughly 7.1 million nodes in Massachussetts, and 270k of them 
> share the same location as another node. This is just an analysis based 
> on location, not on tags, but it can be assumed that most of those 270k 
> nodes are not intentionally duplicate. It is possible that there are 
> duplicate ways as well. But it is not a big problem, it is something 
> that could be fixed in a day.
>
> I could help with this but I would need very clear instructions what to 
> look for, and what to do. Merging nodes may lead to duplicate ways that 
> share exactly the same nodes (these can probably be removed 
> automatically), but there might also be situations where you have one 
> way that uses the nodes A, B1, C1 and one way that uses B2, C2, D (with 
> B1 being at the same location as B2, and C1 at the same location as C2), 
> so after merging nodes you'd then end up with the non-identical ways 
> A,B,C and B,C,D... all this should be considered beforehand.

I have looked a bit more and have a proposal for an automated edit.  I
am trying to have this be as narrow as possible while still making
progress.  My proposal below intends to join up roads that were cleaved
at town borders.

(There is another source of duplicate nodes, which is the open-space
database polygons.  These duplicate nodes are not problematic, partly
because one doesn't route on open space polygons, and partly because
they are each in their own way that happen to touch.  So maybe they
should be merged at some point, but it's far less important.)

First, an example:

  http://www.openstreetmap.org/?node=70786569

At this location is also node 66355413.  This is the border of Stow and
Maynard.  In this case the road name and width changes on the ways
(which matches reality).

Each way just ends at the town border, and there is a pair of coincident
nodes.  Each of this duplicated node pair is the last node in a way, and
is in only one way.

----------------------------------------
Find the set of duplicated nodes D, where each element d is a set of
nodes at the same location.

foreach d in D (CONTINUE starts back on the next d, even if nested)

  if the number of nodes in d > 2 CONTINUE

  foreach n in d

    if n has tags other than "attribution" or "source" CONTINUE

    if n.attribution does not match
      "Office of Geographic and Environmental Information (MassGIS)"
      CONTINUE

    if n.source does not regexp-match "^massgis_import_v0.1_[0-9]*" CONTINUE

    if n is not in exactly one way CONTINUE

    if n is not the end node in the way CONTINUE

  MERGE the two nodes in d, picking the value of source from the
  lower-numbered node.
----------------------------------------

I am quite confident this won't do anything harmful, and it would be
very interesting to see how many of the 270k duplicate nodes (presumably
135k- locations) go away from this.

Comments/analysis very welcome.  You can easily find these by opening up
part of mass in josm and selecting long ways. They typically end at town
boundaries.  Another node pair  is 62178385 and 73732046.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 193 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/talk-us/attachments/20090821/82d1948c/attachment.pgp>


More information about the Talk-us mailing list