[Imports-us] Vermont, U.S. address import

Thu Oct 6 12:13:04 UTC 2022

Jared <osm at wuntu.org> writes:

> In regards to Greg's concerns:
>
> Concern 1: How will we deal with existing addresses and conflation?
> When I originally investigated importing this data last year, I quickly
> discovered the complexities involved in dealing with existing OSM data.  I
> had brief conversations with the people that worked on the Maine and New
> York imports and looked at their import scripts.
> As a result, this current proposal for Vermont is much more manual.  There
> are quite a few towns that have very few existing OSM addresses. I just did
> a quick query and found that there are 99 towns that have less than 100
> existing address entries in OSM.  For these "low hanging fruit" I intend to
> generate a list of addresses to import, and manually remove the existing
> OSM addresses.  I'll do this through visual inspection of the text, and by
> observing the data points in JOSM.  This is the process I went through for
> the sample data file I included on the wiki page.
>
> As I go through the process of manually removing duplicates, I expect to
> learn patterns of identifying them, and can determine if there is a way for
> me to make this process more efficient while maintaining a high level of
> accuracy.  I can update the import wiki page with further details and
> return to the import list for further feedback and approval.  Otherwise, I
> would just use a manual process.

I think it's best to do this programmatically, but if you are doing it
manually for towns with < 100 existing addresses I don't object.

> Concern 2: Should the "ref:vcgi:esiteid" tag be included or not?
> While not a silver bullet, I find that having this unique key that connects
> a node back to the origin database helpful for building confidence when
> evaluating whether an address exists in OSM or not.  If I find a node in
> OSM that has this unique esiteid, I can be confident that it already
> exists, and I can remove it from my list of items that need manual
> consideration.  I personally find it helpful, and don't find it obtrusive,
> but if there are prior discussions that you can point me to, I'd be
> interested in learning more.

I don't have handy links, but my impression from reading the import list
for years is that it is broadly agreed that foreign keys don't belong.
When you are doing a new import/conflation (say in 2 years when VT
releases an update), you have to actually conflate and check.  Just
because something has a key doesn't mean you can overwrite it.  Some
human may have modified the data to fix it.  The only automatic
overwrite that's ok is to check that the address data on a node matches
exactly the data that was imported, and that the import source is now
different.

You are going to have to deal witha matching addresses between import
source and OSM programmatically like in #1 above, once you move beyond
non-addressed towns.  Once you do that, the ref won't help, as it won't
be 100% reliable.  Therefore it's noise.

> Concern 3: Should the "source:VCGI/E911_address_points" be included on a
> node?  Or only in a changeset comment?
> If you have links to further docs/discussions about this, I'd like to make
> sure I understand the current best practices. I agree that adding a source
> to the changeset tag makes more sense.  I don't fully understand
> the implications for future updates to imported nodes.
> I have updated the import proposal wiki page by removing the source tag
> from the individual node, and adding it to the changeset tag.

This seems really well established on this list and I don't know where
it's writen down.  Having vast numbers of source keys on points is just
noise, and they won't get reliably removed when the data is edited.  And
there isn't anything really useful about it.  Future conflation needs to
look at history to be sure if the data remains exactly what was
imported; the source tag doesn't prove that the current data matches.
(e.g. what if I change 5 to 7 on a house number because the import was
wrong, and don't remove the source tag because a) I don't really
understand it and b) the rest of the fields still came from there.)  And
if you are just conflating as in 'find addresses in dataset that aren't
in osm' then it doesn't help either.

Anyone who thinks consensus includes source tags on nodes should speak
up.  I posit that almost no one who has been on the import list for a
year thinks that.

> Concern 4: The "Conflation" section of the proposal is vague, and makes it
> sound like the project could morph in potentially dangerous ways without
> approval.
> I've updated the section to read:
> **
> "For the scope of this particular import project, conflation will be
> avoided/skipped. Any preexisting addresses will be left as-is. New
> addresses will be imported as standalone nodes (not conflated with existing
> building outlines).
>
> If addresses need to be conflated, they will be dealt with in an update to
> this project, or as part of a separate project, either of which will get
> reviewed and approved."
> **
> Let me know if it still needs further clarification. Basically, my
> philosophy is to deal with the easy parts now, and anything that is more
> complicated will be dealt with in a future project.

I don't think it's ok for an import to add features tha are duplicates
of existing data.  You agree as you talk about towns with less than 100.
Also conflation is itself a funny term as there are two separate issues:

  1) Find subset of source dataset that is not already in OSM.  Generate
  OSM-format file that would be uploaded.

  1A) Like (1), but find objects like house outlines and add tags to
  those instead.

  2) For items in the source dataset that are already in OSM, figure out
  what to do.   There can be a more complicated  merge where you add
  some fields to partial matches.

I think you are saying: "importing will be nodes only, avoiding building
conflation.  In this stage, addresses are only imported for towns with
<100 existing address points, and those will be manually removed from
the upload file.  Thus no duplicate data will be introduced."

If so that's ok, but i think it's good to say things extra clearly to
make it clear that the things which shouldn't happen won't.

> Concern 5: Have you evaluated whether there are points in the database with
> the
> same location, what you are going about that, and why?
> I have not done an exhaustive search, but in the 55,000ish addresses I've
> added manually so far, I don't recall this being an issue with the VCGI
> data.  But, I've primarily focussed my efforts on rural and
> residential areas where the vast majority of addresses are for single
> family dwellings, or occasionally a duplex with two distinct addresses. Let
> me know if you have suggestions about how to identify these.  Is searching
> for points that share the same exact lat/long adequate?  Are you aware of a
> script that already does this?

Basically you should load this into postgis and there are queries to
write to find points that are very close to each other.  something like

  select  a, b where ST_Distance(a,b) < 2

to find points within 2m of each other.

In MA, we found tons of stacked points for multi-family dwellings.  I
would expect the same (much fewer in number in VT I agree) in other
states.

One thing that could be done is to combine to one OSM object that has

unit=1;2;3;4

and I don't remember the addressing scheme consensus on that.
My point really is that you should be super clear on whether this is an
issue.  I think it's fine to skip importing points that are hard; your
low-hanging fruit idea is fine, and you'll learn a lot and can then do a
2nd round.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/imports-us/attachments/20221006/f44084cc/attachment.sig>