[Talk-ca] duplicate address data‏

Mon Mar 30 12:31:53 UTC 2015

Hi Daniel,

thanks for the input. It helps me to understand some of the reasons for the problems I found.
Also thanks for checking the proposed algorithm.

I am still working on the code in the housenumber2 branch for mkgmap and I want to finish this first.
I'll probably don't find time to do much more coding before end of summer,
so I hope that I've inspired someone to start cleaning up.
If not, maybe I'll find the time in some months.

Gerd

From: jfd553 at hotmail.com
To: gpetermann_muenchen at hotmail.com
Subject: RE: [Talk-ca] duplicate address data‏
Date: Mon, 30 Mar 2015 07:03:40 -0400

Bonjour Gerd, I used to work for Natural Resources Canada (NRCan) who produced Canvec files (note 1). I am actually the guy who made the conversion from government to .osm map format. The objective was to provide 50K topographic maps data to the community in OSM format, without modifications to the original data (if possible).  Reading your emails, I understand there are three problems mixed together:  Initial addr interpolation, multiple/bad Imports, and inconsistencies between OSM and governmental data… Initial addr interpolation:  The interpolation lines and addresses were created from governmental street network available at the time of conversion.  There were slight changes in the algorithm used to create addresses interpolation between the different versions of the Canvec Product – however, most of them should look similar. However, errors in original data were discovered when producing the interpolation but could not be repaired (such as few meters road segments, bad addressing scheme, etc…). Such errors were exceptions, not the norm. Addresses were available only for first/last coordinates of original line segments, whatever the length of that line segment. Sometime it results in address interpolation line with the same address on both ends of the line; sometime you will find hundreds of potential addresses between both ends. It might be helpful to know that the width between interpolation lines and the original street network was set to 20m for tertiary-motorway, 15m for lower highway classes. It produced some strange artefacts sometime. Multiple/bad Imports:  The Canadian OSM community asked being able to import Canvec data by layers (i.e. only street or waterway network rather than the whole file) which explain the Canvec data model and the way contributors had imported their data.  Some contributor had imported data layers without considering existing OSM content – which often included previously imported Canvec data. It creates a lot of duplicated objects as you have found out! In areas where the street networks were well developed, some contributors imported only the address interpolation layer, which creates the third problem… Inconsistencies in resulting OSM data: There are inconsistencies between OSM and governmental data!-)  The data model of governmental street network differs from the OSM data model. I had to convert them to mimic the Karlsruhe Schema. When only address interpolation layer were imported, the geometry of the street network does not necessarily fit the geometry of the address interpolation schema. It results that street segments will cross address interpolation lines or may be found outside the interpolation lines of that street.  Street names may then be different from the street names in addresses nodes. From my experiences, there is no way to know which one is the actual road name. The algorithm you proposed seem right, even though I am not sure looking at Canvec in the source would help (point 8).  Hope it will help.Daniel  Notes (1): Some documentation you may have already read even if the addressing schema is not documented …http://wiki.openstreetmap.org/wiki/CanVechttp://wiki.openstreetmap.org/wiki/CanVec:_Geometric_Modelhttp://wiki.openstreetmap.org/wiki/CanVec:_Transportation_(TR)    From: Gerd Petermann [mailto:gpetermann_muenchen at hotmail.com] 
Sent: March-29-15 01:44
To: talk-ca at openstreetmap.org
Subject: Re: [Talk-ca] duplicate address data‏ Hi Stewart,

>> I don't care much about special cases.
>
>I'd say that rural addressing is between 10-20% of addresses in Ontario.
>Far from a special case.

OK. I understand that this is a problem, I just don't care about it because
I can't solve it with my knowledge.

>
>> I wanted to point out that the OSM data base for Canada contains a
>> huge amount of
>> - useless data like duplicated addr:interpolation ways including nodes
>> from different imports
>>  which IMHO should be removed ASAP
>
>Yes, I agree that there are some errors, but we can't guarantee that the
>Canvec 10 data will be much better, or that some of the older data is
>bad just because of its version. Imports work really badly in Canada, as
>our source data isn't wonderful and we don't have enough folks on the
>ground to verify.

Let's start with the simple problem first.
I don't want to replace data, I just want to remove completely obsolete
data. I don't know what's the best way to do that.
I can code a small program which scans a download from geofrabrik
with rules like this:
1) select nodes which are referenced as first or last node 
in addr:interpolation ways
and which are not referenced by any othe way or relation,
2) of those nodes find the ones with equal (or almost equal) coordinates and
equal tags except source=*, mark them
4) select such a pair of equal nodes, lets call them n1 and n2
5) select the addr:interpolation ways that have such marked nodes,
lets call them w1 and w2. 
6) make sure that w1 and w2 have no common node
7) make sure that w1 and w2 end with another pair of marked nodes
 8) if both ways have a source tag containing "CanVec", select the one
with the older version, lets call it w_older
9) make sure that none of the nodes referenced by w_older
is referenced by an other way or relation
10) remove w_older and all it nodes

I think we will find thousends of ways. 
I have no idea how bots are working on the OSM database, but I think
this would be a task for one.
If I would write such a program, it would produce an *.osm file
containing a lot of rows like this (or whatever is needed to delete the ways and nodes)

<?xml version='1.0' encoding='UTF-8'?>
<osm version='0.6' upload='true' generator='CanVec-Cleaner'>
  <bounds minlat='45.4333348' minlon='-76.3457702' maxlat='45.4351546' maxlon='-76.3437317' origin='CGImap 0.3.3 (11726 thorn-01.openstreetmap.org)' />
  <node id='972298820' action='delete' timestamp='2010-10-31T15:13:05Z' uid='186592' user='Johnwhelan' visible='true' version='1' changeset='6240358' lat='45.4338469' lon='-76.3437594'>  </node>
  <node id='972299268' action='delete' timestamp='2010-10-31T15:13:25Z' uid='186592' user='Johnwhelan' visible='true' version='1' changeset='6240358' lat='45.4346425' lon='-76.3457425'> </node>
  <way id='83504524' action='delete' timestamp='2010-10-31T15:19:40Z' uid='186592' user='Johnwhelan' visible='true' version='1' changeset='6240358'> </way>
</osm>

>
>> - wrong data like
>> …
>>  *  addr:interpolation ways with nodes that refer to a different street
>
>Is there a way to make interpolation names change if the street name is
>edited/corrected? Unless this happens, I see these errors as inevitable.
I see no easy way to automate that. The problem is that you can't say for
sure that road has the right name and all addr:interpolation nodes are wrong.
I guess one could try to analyse the changesets, but I have no knowlege here.

Gerd

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-ca/attachments/20150330/555f2051/attachment-0001.html>