[Imports] Address import from government open data in Serbia
Branko Kokanovic
branko at kokanovic.org
Tue Mar 28 21:18:53 UTC 2023
Hi,
I didn't know for openrefine software, seems like a nice think to be aware of! However, we opted for "full control" approach. Our algorithm (just shouting it here, someone might find it useful) to create mapping from "ALL CAPS" to "All Caps" is something like:
* check in curated list of overridden street names (those are names that we crowdsourced in online spreadsheet and put in files as special cases)
* Find streets in OSM by cadastre reference (since streets are also open data). If found, we are sure that mapping is correct
* Normalize "ALL CAPS" name (remove punctuation, put to lowercase, trim...) and try to find that normalized name in OSM. If found, assume that this is correct street name
* Do best effort. Keep "First Letter" (as we have lot of names of people, so mostly first letter is capital case) and create list of words that are exception ("street", "river", "valley", "brigades", "stream", "creek"...). This is highly specific to grammar rules.
Regarding osminspector, we will surely use it during and after import.
WRT question how we plan to do conflation, we also opted for "full control" solution - harder, but more customizable, I think. We might be wrong on this, maybe it was overarchitecture, but this is what we think will give us best ratio of import quality/speed of import. 2.5 mil address is not small number. Basically, we have daily job which is set of pipelines[1] that downloads cadastre data, as well as PBF from OSM, does some normalization, street name mapping and then conflation, generates HTML and import .osm files and uploads everything. Conflation is done by matching street names by Levensthein distance, housenumbers as numeric and distance as numeric too and doing linear combination of these to get percentage of match. If match is perfect (100%), we prepare .osm files to be imported to JOSM (in these files, we just add "ref" to existing entities). If there is not a single address at all within 200m (0% match), which is very common case in villages today, we prepare .osm files to be added as new nodes to OSM. If there is partial match (between 0-100%), we do hands-off and leave it to human to sort things manually. There is import instructions in wiki how to handle those .osm files and I just published instruction video[2] (in Serbian, I will add subtitles these days),
Thanks for great suggestions! Branko
[1] https://gitlab.com/osm-serbia/adresniregistar/-/blob/main/Makefile
[2] https://peertube.openstreetmap.fr/w/s7tiAyeK592Btj9ficfHJH
On Tue, Mar 28, 2023, at 13:40, Cascafico Giovanni wrote:
> Hello Branko,
>
> I'd like to suggest openrefine [1] for ALLCAPS and mispelling issues. The tool can save a sequence of regex replaces on huge lists. Besides, a replacing sequence is automatically saved and can be a resource in case of further imports.
>
> Like others pointed out, I found osminspector [2] a very useful tool for post-import quality assessment.
>
> I didn't understand how you plan to perform conflation. My approach would be using osm_conflator tool and audit service [3]. Basically osm_conflator works on nodes by overpass extracting a category (ie, addr:) and trying to match import candidates in a certain radius. Once a set of candidates is generated, actual conflation (audit) can be done via crowd-checking on a shared map like this [4].
>
>
>
> [1] https://openrefine.org/
> [2] https://tools.geofabrik.de/osmi/?view=addresses&lon=20.40677&lat=44.84030&zoom=12
> [3] https://wiki.openstreetmap.org/wiki/Import/Catalogue/Milan_addresses_import
> [4] http://audit.osmz.ru/map/MI-M9
> _______________________________________________
> Imports mailing list
> Imports at openstreetmap.org
> https://lists.openstreetmap.org/listinfo/imports
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/imports/attachments/20230328/9e0a172f/attachment-0001.htm>
More information about the Imports
mailing list