[Geocoding] GSoC 2021 - Extracting QA reports from Nominatim

Mon Mar 29 22:49:55 UTC 2021

Hi Sarah,

Thanks for your reply.
To return quickly to the first point, It really was only for theoretical
knowledge purposes. I am completely aware that It doesn't guarantee
anything and that there is a lot of competition on the overall
openstreetmap organization.
Anyway thanks for your honesty on this point, it is good to know that there
is not necessarily only one student working on Nominatim.

No, the tool should be regulary run against an existing database. The idea
> is the following: we have the global instance of Nominatim running at
> https://nominatim.openstreetmap.org. Once a night, we would stop the
> regular
> update, let the tool run over the database and create the error reports
> in the form of vector tiles that are suitable for being displayed with
> Osmoscope, then continue upating the database with the newest data from
> OSM.
>

This is not completely clear for me, as I understand the tool should be run
during the night before starting the process of updating the database with
the newest data from OSM, is it correct?
Because as you wrote this, it seems that the tool should run in the middle
of the regular update (like stopping the update, running the QA analysis
tool and then continue the update).
If it should be started beside the update process, why shouldn't it be
after the update process so that the new data will be analysed too?

I am still learning the database structure and some OSM mapping structures.
As I am running through the main Nominatim tables structures I also try to
create very basic documentation for them.
I tried to imagine a basic solution for some data issues, for example,
could you tell me if  these basics examples are correct (of course it is
still very global logic and not totally accurate):
- For "admin boundaries with more than one member with role 'label'" we
would need to lookup "placex" table to find relation with class=boundary
and type=administrative. Then for the finded relations we would lookup
"planet_osm_rels" find each previously selected relations and check their
node members with a role label.
- For "same Wikidata IDs appearing on more than one place node" we would
need to lookup placex and finded duplicate Wikidata IDs, for example I made
a quick and simple SQL query for tests purpose:
SELECT * FROM placex p1 WHERE
(SELECT count(*) from placex p2 where p1.osm_type = 'N' AND p2.osm_type =
'N'
AND p1.extratags->'wikidata' = p2.extratags->'wikidata') > 1
- Others will be more or less tricky but will always need some lookup on
the database too in order to check osm objects.

After an initial thought, I think the main difficulties of the project lies
in the fact of having a modular and performant tool.
For the modular part, it should be easy to plug to the tool a new "rule" of
QA analysis like "place nodes of the same type and name close to each
other" for example.
For the performance part, as each rule may need to lookup the database (and
a lot of time on the same table), the tool will need a mechanism to make
everything smooth by limiting lookups, caching some results and by
coordinating the "rules" executions.

The first ideas I have concerning the structure of the tools (very global
one, I didn't had the time to really design it yet) are:
- Something like a QAUnit or QARule object will define a specific unique
rule (like "place nodes of the same type and name close to each other"), so
that each rule will be independant and it would be easy to add a new one.
- A core part where every QAUnit or QARule will be plugged (like
registered). It will be the main process responsible for the execution of
each QAUnit and their coordination.
- I haven't a really good idea on this yet as it is the hardest part but,
the core (or another module linked to the core) should handle the most
consuming processes, for example the lookup of a big table which is needed
by multiple QAUnit. So the idea would be that each QAUnit requests some
data or "informations", for example objects from the "placex" table with
some characteristics, and it would be optimized by the core which would
anticipate and store these information for the plugged QAUnits.
- When every QAUnit has been processed, a module will be responsible to
write a vector tile (or GeoJSON file) based on templates for each QAUnit
(at the moment I imagine one file per rule but maybe it would need to be
different for some rules).

If you could tell me what you think about what I said, it would help me to
understand if I am pointing out the right issues of the project. I know it
is still very abstract, but I want to design the global idea and then dive
deeper into it and design a more concrete POC if I am on the right tracks.

Thanks for your time and have a nice day!

Regards,
Antonin

>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/geocoding/attachments/20210330/cf673bc6/attachment.htm>