[Geocoding] Nominatim script log output - how to tell progress?

Sat Jul 4 09:52:03 UTC 2015

On 4 July 2015 at 10:47, Simon Nuttall <info at cyclestreets.net> wrote:
> On 3 July 2015 at 19:20, Sarah Hoffmann <lonvia at denofr.de> wrote:
>> On Fri, Jul 03, 2015 at 07:23:54AM +0100, Simon Nuttall wrote:
>>> Now it is showing these again:
>>>·
>>>   Done 274 in 136 @ 2.014706 per second - Rank 26 ETA (seconds): 2467.854004
>>>·
>>> Presumably this means it is now playing catchup relative to the
>>> original download data?
>>
>> I would suppose so.
>>
>>> How can I tell what date it has caught up to? (And thus get an idea of
>>> when it is likely to finish?)
>>
>> Have a look at the import_osmosis_log table. It gives you a good idea
>> how long the batches take.
>
> Ah yes - pretty slow :-(
>
> nominatim=# select * from import_osmosis_log order by endtime desc limit 12;
>       batchend       | batchsize |      starttime      |       endtime
>       |   event
> ---------------------+-----------+---------------------+---------------------+-----------
>  2015-06-09 12:54:02 |  40037028 | 2015-07-04 09:30:16 | 2015-07-04
> 09:30:29 | osmosis
>  2015-06-09 11:55:01 |  36866133 | 2015-07-04 08:57:52 | 2015-07-04
> 09:30:16 | index
>  2015-06-09 11:55:01 |  36866133 | 2015-07-04 08:34:17 | 2015-07-04
> 08:57:52 | osm2pgsql
>  2015-06-09 11:55:01 |  36866133 | 2015-07-04 08:34:06 | 2015-07-04
> 08:34:17 | osmosis
>  2015-06-09 10:55:02 |  42220289 | 2015-07-04 08:06:14 | 2015-07-04
> 08:34:06 | index
>  2015-06-09 10:55:02 |  42220289 | 2015-07-04 07:41:23 | 2015-07-04
> 08:06:14 | osm2pgsql
>  2015-06-09 10:55:02 |  42220289 | 2015-07-04 07:41:11 | 2015-07-04
> 07:41:23 | osmosis
>  2015-06-09 09:55:02 |  34076756 | 2015-07-04 07:14:30 | 2015-07-04
> 07:41:11 | index
>  2015-06-09 09:55:02 |  34076756 | 2015-07-04 06:53:59 | 2015-07-04
> 07:14:30 | osm2pgsql
>  2015-06-09 09:55:02 |  34076756 | 2015-07-04 06:53:49 | 2015-07-04
> 06:53:59 | osmosis
>  2015-06-09 08:56:01 |  26087298 | 2015-07-04 06:20:20 | 2015-07-04
> 06:53:49 | index
>  2015-06-09 08:56:01 |  26087298 | 2015-07-04 06:07:22 | 2015-07-04
> 06:20:20 | osm2pgsql
>
>
>>
>>> Is it catching up by downloading minutely diffs or using larger
>>> intervals, then switching to minutely diffs when it is almost fully up
>>> to date?
>>
>> That depends how you have configured it. If it is set to the URL
>> of the minutelies it will use minutely diffs but accumulate them
>> to batches of the size you have configured. When it has caught up
>> it will just accumulate the latest minutelies, so batches become
>> smaller.
>
> Ah yes, I see the configuration.txt has:

(oops - last email was sent prematurely)

# The URL of the directory containing change files.
baseUrl=http://planet.openstreetmap.org/replication/minute

# Defines the maximum time interval in seconds to download in a single
invocation.
# Setting to 0 disables this feature.
maxInterval = 3600

>
>
>>
>>> This phase still seems very disk intensive, will that settle down and
>>> become much less demanding when it has eventually got up to date?
>>
>> It will become less but there still is IO going on. Given that your
>> initial import took about 10 times as long as the best time I've seen,
>> it will probably take a long time to catch up. You should consider
>> running with --index-instances 2 while catching up and you should
>> really investigate where the bottleneck in the system is.

I notice that our postgresql.conf has

work_mem = 512MB

which seems a bit small?

But this seems healthy:
maintenance_work_mem = 10GB

>>
>>> Can the whole installed running Nominatim be copied to another
>>> machine? And set running?
>>>
>>> Presumably this is a database dump and copy - but how practical is that?
>>
>> Yes, dump and restore is possible. You should be aware that indexes
>> are not dumped, so it still takes a day or two to restore the complete
>> database.
>>
>>> Are there alternative ideas such as replication or backup?
>>
>> For backup you can do partial dumps that contain only tables needed
>> for querying the database. These dumps can be faster restored but
>> they are not updateable, so they are more of an interim solution
>> to install on a spare emergency server while the main DB is reimported.
>> The dump/backup script used for the osm.org servers can be found here:
>>
>> https://github.com/openstreetmap/chef/blob/master/cookbooks/nominatim/templates/default/backup-nominatim.erb
>>
>> If you go down that road, I recommend actually trying the restore
>> at least once, so you get an idea about the time and space requirements.
>>
>> Replication is possible as well. In fact, the two osm.org servers have
>> been running as master and slave with streaming replication for about
>> two weeks now. You should disable writing logs to the database.
>> Otherwise the setup is fairly standard, following largely this
>> guide: https://wiki.postgresql.org/wiki/Streaming_Replication

You've put off trying this - for now at least.

>>
>>> > string(123) "INSERT INTO import_osmosis_log values
>>> > ('2015-06-08T07:58:02Z',25816916,'2015-07-03 06:07:34','2015-07-03
>>> > 06:44:10','index')"
>>> > 2015-07-03 06:44:10 Completed index step for 2015-06-08T07:58:02Z in
>>> > 36.6 minutes
>>> > 2015-07-03 06:44:10 Completed all for 2015-06-08T07:58:02Z in 58.05 minutes
>>> > 2015-07-03 06:44:10 Sleeping 0 seconds
>>> > /usr/local/bin/osmosis --read-replication-interval
>>> > workingDirectory=/home/nominatim/Nominatim/settings --simplify-change
>>> > --write-xml-change /home/nominatim/Nominatim/data/osmosischange.osc
>>> >
>>> > Which presumably means it is updating June 8th? (What else can I read
>>> > from this?)
>>
>> See above, check out the import_osmosis_log. The important thing to take
>> away is how long it takes to update which interval. If on average the
>> import takes longer than real time you are in trouble.
>>
>>> > Also, at what point is it safe to expose the Nominatim as a live service?
>>
>> As soon as the import is finished. Search queries might interfere with
>> the updates when your server gets swarmed with lots of parallel queries
>> but I doubt that you have enough traffic for that.

Yeah - shouldn't be too many - at this stage.

>> Just make sure to keep
>> the number of requests that can hit the database in parallel at a moderate
>> level. Use php-fpm with limited pools for that and experiment with the
>> limits until you get the maximum performance.
>>
>> Sarah
>
>