[Imports-us] Fwd: Vermont, U.S. address import

Wed Oct 12 15:58:28 UTC 2022

Hi all,

I've done some more work reviewing this import and have come up with a few
issues <https://github.com/JaredOSM/vermont-address-import/issues> that
should cause us to continue to take a slow and measured approach with this
import. In particular:

   - ZIP codes from E911 may be incorrect #7
   <https://github.com/JaredOSM/vermont-address-import/issues/7> (see issue
   for some fun Postal Service history).

I've also put together a draft set of scripts (Add conflation scripts to
compare import to existing OSM addresses #6
<https://github.com/JaredOSM/vermont-address-import/pull/6>) that will
download existing addresses from OSM, load them into a local SQLite
database and conflate the import data with the existing data, dividing it
into buckets of "missing from OSM", "exact matches", "exact tag matches
that are displaced or otherwise should be manually reviewed", and fuzzy
matches with "tag conflicts". Fuzzy matches are made by normalizing to
lower-case & alpha-numeric streets and house numbers as well as by looking
at <5m non-matching points (could be improved further). These conflation
scripts are *very* much a proof of concept/work in progress and I'm not
sure if the buckets are quite right to facilitate easy manual review of
conflicts. They are also pretty hacked together and could probably be
refactored into a better tool-chain that could be used more widely. I think
this gets to a few of Greg's concerns. The foreign key wasn't needed. ;-)

As a side note, I ended up manually editing almost all of the addresses in
Burlington and expanded their addr:street suffixes (St, Rd, Ave, etc) road
by road. It seems that these may have been imported previously because they
had uniform errors in casing and lack of expansion across the city.
Hopefully this will allow Burlington to be a better conflation test going
forward.

I'll be traveling for the next week and won't be able to do much work on
cleaning up the conflation scripts or doing more validation for a bit, but
wanted to get this out there for review and discussion in the meantime.

Best,
Adam

On Sun, Oct 9, 2022 at 11:29 PM Adam Franco <adamfranco at gmail.com> wrote:

> Hi Jared, I've submitted PR #2
> <https://github.com/JaredOSM/vermont-address-import/pull/2> in which I've
> updated your initial generation script to take command line arguments so
> that it could be run against arbitrary input files, then added a second
> script `generate_all.php` which will run the script for all town input
> files and write those back to the draft folder.
>
> These changes should allow easy re-running of the script as tweaks are
> made to allow easier tracking of changes to the output and checking for
> unintended consequences of the tweaks.
>
> I hope that over the next few days I will be able to more deeply review
> the addresses for additional towns that already have significant address
> coverage and look for further discrepancies with what is coming from E911.
> I'm giving myself until Wednesday to provide some more feedback. :-)
>
> On Sat, Oct 8, 2022 at 10:43 PM Jared <osm at wuntu.org> wrote:
>
>> Greg, Your outline seems reasonable, but is outside the scope of what I'm
>> looking to tackle at the moment.  Let me know if you think you'd be
>> interested in working on a phase two where a more sophisticated automated
>> tool is developed to deal with the more complex towns.
>>
>> At the moment, I'd like to move forward with my mostly manual import of
>> the towns with less than 100 existing OSM addresses.  I've now created
>> draft import files for 12 towns.  See data here:
>> https://github.com/JaredOSM/vermont-address-import/tree/main/data_files_to_import/draft
>> The script I have is doing a good job of expanding street names, and I
>> believe my manual review process is working.
>>
>> Those of you that have provided feedback, or anyone else, please let me
>> know if you have any remaining concerns with me proceeding with the project
>> outlined here:
>> https://wiki.openstreetmap.org/wiki/VCGI_E911_address_points_import
>>
>> Thanks,
>> Jared
>>
>>
>> On Sat, Oct 8, 2022 at 9:16 AM Greg Troxel <gdt at lexort.com> wrote:
>>
>>>
>>> Jared <osm at wuntu.org> writes:
>>>
>>> > Can you walk me through a real example so I can understand how you
>>> would
>>> > identify existing addresses?
>>> >
>>> > Let's take Addison, Vermont for example.
>>> >
>>> > The VCGI e911 dataset has 987 address points in Addison.  Here's the
>>> data
>>> > file:
>>> >
>>> https://github.com/JaredOSM/vermont-address-import/blob/main/town_e911_address_points/e911_address_points_addison.geojson
>>> >
>>> > When I run an overpass query for all elements in Addison that have a
>>> > housenumber or street: https://overpass-turbo.eu/s/1mxX
>>> > I find that there are already a total of 142 nodes and ways with
>>> address
>>> > information OSM.
>>> >
>>> > By looking at the overpass results, I can immediately see that 55 of
>>> the
>>> > existing OSM elements have a "ref:vcgi:esiteid" Key/Value pair.
>>> Without
>>> > any further queries, I have a high level of confidence that I can
>>> remove
>>> > all 55 address points from my import file, as they are not even
>>> > worth considering for an automated import.  This seems like a safe and
>>> > efficient way of eliminating the chance of importing duplicate data.
>>> > Obviously the other data points need to be evaluated, but why not
>>> remove
>>> > the 55 for which I have high confidence?
>>>
>>> Were I doing this, I'd want to take each VCGI datapoint and sort it
>>> into one of:
>>>
>>>   - address exists in OSM, all VCGI address fields are present and
>>> match, and location matches
>>>   - address exists in OSM, not all VCGI address fields are present, and
>>> location matches
>>>   - address exists in OSM, some VCGI fields do not match and location
>>> matches
>>>   - address exists in OSM, location does not match (>= 5m?)
>>>   - address does not exist in OSM, but a previous VCGI import added it,
>>>     and then it was manually deleted (must not be re-added by an
>>>     automated process! **)
>>>   - address does not exist in OSM, and no OSM address point is within 10m
>>>   - address does not exist in OSM, but there is an OSM address point
>>>     within 10m (this is "OSM and VCGI disagree on the address of a
>>>     location", or it might be "OSM has building and VCGI also has unit
>>>     addresses, or it might be something we don't understand yet)
>>>
>>> at least.  This needs looking at nodes and ways for address tags, and
>>> probably the distances are not quite right, and may need to be bigger in
>>> rural areas and smaller in more urban places.  Then, look at those bins
>>> and see what's in them figure out if they are correctly sorted, and
>>> refine the rules and perhaps the categories.  This is why I keep talking
>>> about programs, not using josm plugins.  In my view, the processing
>>> method should not be constrained by what some existing tools do; it
>>> needs to adapt to the realities of the data.
>>>
>>> For the above categories, some lead to "no action".  Some lead to "add
>>> fields to existing object".  Some lead to "generate worklist for field
>>> verification", and perhaps to "report bug to VCGI".
>>>
>>> In the above, I'm not using a foreign key in OSM.  If the data is
>>> present and matchable, great.  If it doesn't match (but should have)
>>> you'll pick it up as "address point in OSM in same place but different
>>> content".  But you need to pick that up with address points that
>>> *weren't* imported, so the foreign key really doesn't help simplify the
>>> processing.   And for things that really were imported before, the
>>> matching will succeed easily.
>>>
>>> ** For this, the import needs to either search history to identify
>>>    things added by a previous import changeset and then removed, or to
>>>    keep a record of what was imported, and to skip processing in a new
>>>    import of records that were prevously imported -- whether or not they
>>>    are still present.
>>>
>>> Also, it would be great to be able to identify things that were imported
>>> but are no longer in the VCGI dataset, and sort those into substantially
>>> manually modified vs not.
>>>
>>> I agree that it's reasonable to first find "address does not exist in
>>> OSM and there is no nearby object with any address in OSM" and restrict
>>> scope to that as long as there is near zero of "re-adding data that hand
>>> mappers have found to be incorrect and removed".
>>>
>>> I see doing this as looping over the import data and doing a db query
>>> for street name and number, and another for address objects near the
>>> coordinates.
>>>
>>> I think this first requires data cleaning to expand acronyms and
>>> transform street names to OSM capitalization etc.  That can be done as a
>>> first-step match going from the set of street names in VCGI and in OSM
>>> for a given town.  This all assumes municipal boundaries already in
>>> place, or that all address points in OSM (that are in VT) have town
>>> names.  Modulo issues of addresses that are not in towns of course, if
>>> that is possible (it's not in MA, but I know we're odd).
>>>
>> _______________________________________________
>> Imports-us mailing list
>> Imports-us at openstreetmap.org
>> https://lists.openstreetmap.org/listinfo/imports-us
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/imports-us/attachments/20221012/07db7677/attachment.htm>