[Imports] [Imports-us] [Talk-us-newyork] Update: New York State GIS SAM Address Points Import

Skyler Hawthorne osm at dead10ck.com
Thu Feb 11 18:32:08 UTC 2021


Feb 11, 2021 13:01:15 Kevin Kenny <kevin.b.kenny at gmail.com>:

> Skyler and I both know how to do fuzzy string matching. My first implementation of Hunt-Szymanski-McIlroy was in 1977 or so, working from a preprint from McIlroy that Al Aho had sent me; this was right before the Hunt-Szymanski paper appeared in CACM. (It was being deployed to replace an earlier matching system that used Wagner-Fischer and turned out to be far too memory intensive on the limited machines of the time.)

> The discussion is more on how to define the match quality for this particular application: how much fuzz is a good idea?
> 
> Wanton deduplication is _not_ a good idea in this particular application. The import is, for instance, revealing quite a few misspelt street names in OSM, and the misspellings would be very close (in terms of Hamming or Levenshtein distance) to the correct name - but we _want_ to get the misspellings fixed.

I agree, a simple Levenshtein distance would hide typos where it's off by one letter, which would be nice to correct. I think before I continue the import, I'm just going to mitigate the FPs in a more targeted way and just do client side matching by stripping apostrophes and doing case insensitive matching. I think that will strike a better balance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/imports/attachments/20210211/54ef3897/attachment.htm>


More information about the Imports mailing list