[Imports] Address import from government open data in Serbia
Branko Kokanovic
branko at kokanovic.org
Sun Mar 26 20:11:55 UTC 2023
Hi all,
Serbia government released address registry (streets and housenumbers with point geometries) on 12. December 2022. as open data from national cadastre ("RGZ" in the rest of the mail). We are preparing import to OSM now and we want to ask for feedback to our approach, what we are missing and to answer any questions. Below are some details that we have been working on for past month.
tl;dr
Main wiki (en and sr): https://wiki.openstreetmap.org/wiki/Serbia/Projekti/Adresni_registar
Main topic (sr only): https://community.openstreetmap.org/t/uvoz-adresnog-registra-plan/8916
--------------------------
1. Data quality
Data from RGZ is checked by sampling and looks like it is of good quality, far better than what we have today in OSM. In OSM, except couple of major cities (Belgrade have 80% of addresses for example), addresses are mostly non-existent in Serbia. There are some cases where address exists in RGZ, but building is missing on satellite imagery (either not yet erected or is demolished), but we agreed to add these addresses too. There are also cases with local shops (on the ground floor) are empty/ruined, but exists in RGZ.
In total, there are around 2.428.000 addresses in RGZ and 250.000 addresses in OSM. We think that 190.000 of these addresses can be "conflated" (by adding new tag) and 60.000 will have to be resolved case by case. Rest of it (>2.000.000) could be imported simply as points.
One downside is that addresses in cadastre are given as "ALL CAPS" (e.g. data from RGZ would have something like "ABBEY ROAD" as street name). We fixed this (more on this later).
--------------------------
2. Preparation
We have couple of topics in community forum, but main thread is this one: https://community.openstreetmap.org/t/uvoz-adresnog-registra-plan/8916. We prepared wiki page for import here: https://wiki.openstreetmap.org/wiki/Serbia/Projekti/Adresni_registar.
As a community, we first discussed and agreed on tagging schema for addresses. It was always kind of assumed to be Karlsruhe schema, but now we discussed in fine details if address should be node or way, what if there are apartments and some shops on ground floor etc. Main thread for this specific topic is here: https://community.openstreetmap.org/t/uvoz-adresnog-registra-pravila-tagovanja-adresa-u-srbiji/8915 and final outcome is written in details here: https://wiki.openstreetmap.org/wiki/Serbia/Adresses.
Regarding street names that are given as "ALL CAPS", as a community, we agreed to import them with "Proper Casing", including grammatical and on-the-ground rules (punctuation, spacing, hyphens, correcting cases, plurals, striping "street" as suffix in <1% of cases...). We took all addresses in RGZ (there is 30.000 distinct addresses), put them online in https://lite.framacalc.org/tgux01sydx-9ztp and distributed work among us to fix their naming. This work is done already and we will use this (proper) naming when doing import. Main topic for this is here: https://community.openstreetmap.org/t/pravilno-imenovanje-ulica/96891/
We also created new tile server that shows only street geometries and housenumbers based on RGZ data. It can be accessed here: https://tiles.openstreetmap.rs/rgz/{zoom}/{x}/{y}.png. It will be invaluable help when importing addresses and when solving cases manually.
We also agreed on script when housenumbers have letters (e.g. "30b"). We could use either Cyrillic ("30б") or Latin ("30b"). Cyrillic is what "name" tag usually we have for streets, but if we used it for "addr:housenumber" too, we would run into problem - support in geocoders for different languages in "addr:housenumber" tag is almost non-existent. So, for purely pragmatic reasons, we opted for using latin for housenumbers (we will use "30b" instead of "30б"). This is not related to Serbia, but it affects us greatly and we talked about it at length even 4 years ago: https://community.openstreetmap.org/t/cirilica-i-latinica-u-kucnim-brojevima/88545. Whomever wants to tackle this problem, please contact me, I will be eager to help.
We also agreed what to do when OSM and RGZ differ in addresses - we agree to keep OSM if there is note tag, and clarified all of that in import instructions.
Finally, we introduced new tag to reference addresses with RGZ - "ref:RS:kucni_broj" (translated to English as "ref:RS:housenumber"), as well as new "source=RGZ_AR_Import" for changesets (we plan to add this to https://taginfo.openstreetmap.org/projects once we import start).
--------------------------
3. Import
There are lot of addresses to import. We want import to have human in the loop, but to be as easy as possible. We created web site to help us with this: https://openstreetmap.rs/download/ar/.
There are 3 main cases when doing import:
* Adding new addresses - it should be as easy as going to above-mentioned web site, navigate to municipality and settlement and downloading .osm file with new addresses. All .osm files are split into max 100 addresses that can be imported at once. Web site is refreshed daily. Rest of the instructions are on wiki page, but boils down to: use JOSM, check geometries one by one, move addresses on top of houses on satellite images, check naming of streets and upload. We might even try to automate this after couple thousands of addresses are added, enough of time passed and based on overall feedback from community. We will use separate bot account for this, if we go down this route.
* Conflating existing addresses - it should be as easy as downloading another .osm where we only add "ref:RS:kucni_broj" tag while repeating same procedure as above. Also bounded to max 100 addresses and refreshed daily. It should be noted here that only addresses that are matching 100% (both street name and housenumber are exact and within 200m between OSM and RGZ) are proposed in .osm files here.
* Fuzzy matched addresses - this is hardest case and no automation is given. There are "only" 60.000 of these addresses and this will take most of the time. We plan to use mentioned tile server with rendered addresses to aid in this, but it will still require a lot of work as there is lot of randomness here (typos in names, old addresses, locations that are too far away...)
We plan to create special tutorial video to make onboarding for people easier, same as we did for import of administrative boundaries (https://vimeo.com/401994061), but this time on https://peertube.openstreetmap.fr (to be honest, we had this video on fediverse at https://peertube.live/videos/watch/d5ef0a85-2578-4c7d-8430-1395e853eca7, but it is gone now...).
Overall, my personal hope is that with dozens of very active people (that we have in community already) and dozen more that are sporadically active and with good tooling, we can have 80% of addresses from RGZ imported in OSM by end of 2023.
--------------------------
4. Quality assurance
As previously mentioned, we created website https://openstreetmap.rs/download/ar/ which is refreshed daily and which has section for QA. Idea is to monitor import as we go and detect any problems. These are some of things we plan to check continuously:
* duplicated ref:RS:kucni_broj - in ideal case, there will be no 2 OSM entities with same "ref:RS:kucni_broj" tag as there are no duplicates in RGZ either. This report should have 0 entries. Report is done and generated daily.
* addresses in buildings - as we agreed in tagging schema, we want to place addresses on buildings' way (if building exist). However, if we require this, it will mean to add building for each address and it will slow down import itself. So, plan is to proceed with import and add addresses as node, and have this QA check that will let us know where we have addresses as nodes inside buildings. Today, we counted 76.000 address nodes inside buildings, out of which 57.500 are simple cases where there is single address node inside building that can be deleted and moved to building. Rest (18.000) are cases that need to be checked case by case (POIs, multiple addresses, typos...). For simple case, we even have generated .osm files to automate this problem. One of the problem of moving these nodes is that we are losing history for them. We split .osm files to 10 entities max, so to make it easier for anyone to find deep history of deleted nodes. Report is done and generated daily.
* QA on conflated addresses - once we detect "ref:RS:kucni_broj" tag on some OSM address, we will have couple of checks - is that reference to RGZ actually exists in RGZ, is it too far away from RGZ data, that those street names/housenumbers match... This report will tell us all this and it should ideally have 0 entries. This report is still being worked on.
--------------------------
5. Licence
License of this data is clarified as open data in https://data.gov.rs/sr/terms/. This data is released in the open on 12. December 2022. by changes in the law as defined in this PDF: https://geosrbija.rs/?mdocs-file=6186 (article 34) which (along with other RGZ documents) can be downloaded here: https://geosrbija.rs/dokumentacija/. While we already imported some other data from Open Data portal (GTFS, admin boundaries, national heritage...), we also wanted to add RGZ as a source to https://www.openstreetmap.org/copyright and we have PR request: https://github.com/openstreetmap/openstreetmap-website/pull/3959. Afterwards, we also contacted LWG (on 08. March 2023.) for further consulting, but answer is still pending. However, we think we are on safe side to start import even now. Please raise concerns if this is not the case!
Thanks, Branko
More information about the Imports
mailing list