From lonvia at denofr.de Sat Apr 3 16:25:08 2021 From: lonvia at denofr.de (Sarah Hoffmann) Date: Sat, 3 Apr 2021 17:25:08 +0100 Subject: [Geocoding] GSoC 2021 - Extracting QA reports from Nominatim In-Reply-To: References: <20210325111953.GA31238@denofr.de> Message-ID: <20210403162508.GA14067@denofr.de> Hi, On Tue, Mar 30, 2021 at 12:49:55AM +0200, Antonin Jolivat wrote: > No, the tool should be regulary run against an existing database. The idea > > is the following: we have the global instance of Nominatim running at > > https://nominatim.openstreetmap.org. Once a night, we would stop the > > regular > > update, let the tool run over the database and create the error reports > > in the form of vector tiles that are suitable for being displayed with > > Osmoscope, then continue upating the database with the newest data from > > OSM. > > > > This is not completely clear for me, as I understand the tool should be run > during the night before starting the process of updating the database with > the newest data from OSM, is it correct? > Because as you wrote this, it seems that the tool should run in the middle > of the regular update (like stopping the update, running the QA analysis > tool and then continue the update). > If it should be started beside the update process, why shouldn't it be > after the update process so that the new data will be analysed too? The Nominatim server runs updates not only once a day but minutely. As soon as you upload an edit in OpenStreetMap, it gets the new data and incorporates it into its database. And a minute later, you can search for what you have mapped. So the idea is that once a night, we stop that process for a bit, so that quality control can go over the current state of the database. It will just be a lot easier, if you don't have to consider that the data might change while the script is extracting the errors. But that's the reason, why the script has to be fast. If updates are stopped longer than an hour, then there will be complaints by impatient mappers. ;) > I tried to imagine a basic solution for some data issues, for example, > could you tell me if these basics examples are correct (of course it is > still very global logic and not totally accurate): > - For "admin boundaries with more than one member with role 'label'" we > would need to lookup "placex" table to find relation with class=boundary > and type=administrative. Then for the finded relations we would lookup > "planet_osm_rels" find each previously selected relations and check their > node members with a role label. > - For "same Wikidata IDs appearing on more than one place node" we would > need to lookup placex and finded duplicate Wikidata IDs, for example I made > a quick and simple SQL query for tests purpose: > SELECT * FROM placex p1 WHERE > (SELECT count(*) from placex p2 where p1.osm_type = 'N' AND p2.osm_type = > 'N' > AND p1.extratags->'wikidata' = p2.extratags->'wikidata') > 1 > - Others will be more or less tricky but will always need some lookup on > the database too in order to check osm objects. Yes, that looks about right. > After an initial thought, I think the main difficulties of the project lies > in the fact of having a modular and performant tool. > For the modular part, it should be easy to plug to the tool a new "rule" of > QA analysis like "place nodes of the same type and name close to each > other" for example. > For the performance part, as each rule may need to lookup the database (and > a lot of time on the same table), the tool will need a mechanism to make > everything smooth by limiting lookups, caching some results and by > coordinating the "rules" executions. That's some interesting ideas for performance improvement. I admit I only had thought in terms of "getting the SQL right, so it does not run two hours". > The first ideas I have concerning the structure of the tools (very global > one, I didn't had the time to really design it yet) are: > - Something like a QAUnit or QARule object will define a specific unique > rule (like "place nodes of the same type and name close to each other"), so > that each rule will be independant and it would be easy to add a new one. > - A core part where every QAUnit or QARule will be plugged (like > registered). It will be the main process responsible for the execution of > each QAUnit and their coordination. > - I haven't a really good idea on this yet as it is the hardest part but, > the core (or another module linked to the core) should handle the most > consuming processes, for example the lookup of a big table which is needed > by multiple QAUnit. So the idea would be that each QAUnit requests some > data or "informations", for example objects from the "placex" table with > some characteristics, and it would be optimized by the core which would > anticipate and store these information for the plugged QAUnits. > - When every QAUnit has been processed, a module will be responsible to > write a vector tile (or GeoJSON file) based on templates for each QAUnit > (at the moment I imagine one file per rule but maybe it would need to be > different for some rules). > > If you could tell me what you think about what I said, it would help me to > understand if I am pointing out the right issues of the project. I know it > is still very abstract, but I want to design the global idea and then dive > deeper into it and design a more concrete POC if I am on the right tracks. That goes in the right direction. If I may offer an advice though: while thinking about the design, also give some thought about how it can be developped iterately. The system you describe above is pretty feature- complete, I'd say. That's cool but also think about the opposite: what is the minimal system you'd need to have a valid result (say, a geojson file for one hard-coded rule for example). And what are the steps to get to the feature-complete version from there. If you plan this way, there is a much better chance, you have a working system in the end, even when you hit some bumps on the road. Sarah