[Geocoding] Updates on db with multiple countries
Sarah Hoffmann
lonvia at denofr.de
Wed Apr 15 06:58:52 UTC 2020
Hi Rahul,
On Tue, Apr 14, 2020 at 07:22:46PM +0000, Rahul Reddy wrote:
> I am working on the issue #1683<https://github.com/osm-search/Nominatim/issues/1683>(Update script for running updates on a database with multiple countries). This<https://gist.github.com/krahulreddy/8d08a8b2a77581810effa88d1641a571> is a modified script I used. This worked for me. But there are a few issues.
>
> 1) The sequenceNumber in import_status might not be the same for all the counties. Unless this is fied, there might be data loss during updates. This could be fixed by changing the structure of the import_status table to allow country specific entries.(Is it a good idea?)
The previous script would just keep the sequence numbers outside the database
in a file per country. You can do that with pyosmium-get-changes, too. Have a
look at the '-f' parameter. But I don't see an issue if you keep the numbers
in a table in the database. Just create a separate table with all the info
you need, i.e. next to the sequence number, you'd also need the replication URL.
> 2) In Setup.php, init-updates option gets the latest date from the lib function getDatabaseDate(), which returns date corresponding to the object that has the highest osm_id. This would be wrong if the latest changes include deletions. I think comparing lastimportdate in import_status with the previous approach could be a good thing. This will help avoid repeated updates on deleted nodes.
It's correct that the function looks for the highest node ID. The little trick
here is that it then looks up the date for version 1 of that object. The OSM
database assignes node ids sequentially when new objects are created. So it is
fair to assume that the node with the highest ID in any OSM file was one of the
last ones created. Version 1 is always the 'creation' version, thus giving us
a good estimate about the date of the file. There might be some additional
deletions or modifications after that date that still made it into the file
but that is okay, because Nominatim is "replay-safe". That means you can
reapply changes to the database as long as they are still applied in order.
That all said, when using multiple files, I would not recommend to use
Nominatim's getDatabaseDate() function because the files might be from
different dates. You should instead determine the intial sequence ID from
the input files directly. pyosmium-get-changes can do that for you. Have
a look at: https://docs.osmcode.org/pyosmium/latest/updating_osm_data.html#preparing-the-state-file
> I also wrote a shell script to setup db with multiple countries, which can be found here<https://gist.github.com/krahulreddy/948679bae414b5bfbdbe5fe489126eea>.
Combining the files first. That's nice.
We really should get all this into the documentation eventually. My
suggestion would be to add an 'Advanced Installations' section in the
'Administration guide' and have a chapter about importing multiple
countries there. The scripts can go in the `utils/` directory.
> An alternate approach for setting up updates for multiple countries would be to modify the Replication URL constant. This could be done by editing the existing utils/update.php, or by maintaining a separate copy of utils/update.php with necessary modifications.
Intersting thought. You could actually borrow a hidden feature from
testing to make that work. There is the possibility to inject your
own settings before all the standard settings are configured. Just
set the NOMINATIM_SETTINGS environment variable to point to your
custom php settings file just containing the replication url. [1]
You'd still have to modify utils/update.php to make it use a
configurable table for the update status. But that sounds okay.
Feel free to give this a try.
[1] https://github.com/osm-search/Nominatim/blob/master/settings/defaults.php#L4
Kind regards
Sarah
More information about the Geocoding
mailing list