[OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

Michal Migurski mike at stamen.com
Mon Oct 27 00:10:58 GMT 2008


Hi,

I've been trying to keep up to date with the dumps and diffs from http://planet.openstreetmap.org/ 
, and I'm running into a number of bugs related to cutoff dates.

In keeping my Bay Area tiles (http://mike.teczno.com/notes/cascadenik-openstreetmap.html 
) up to date, I've been grabbing complete planet.osm dumps about once  
per month, and filling in the intervening time with daily diffs. I've  
noticed some misalignments between the data in the dumps and the  
osm2pgsql importer that leads to unavoidable holes in the data.

It seems that they could be fixed in either osm2pgsql, the planet  
files, or both.

The final event in each weekly planet dump does not fall on an even  
day boundary. In the case of the most recent Oct. 22nd planet.osm, it  
was necessary to experiment with hourly diffs from that day to find  
that the boundary was approx. 2:00pm. Hourlies up to and including  
2008102213-2008102214.osc.gz failed, hourlies after that succeeded. I  
could go more granular here, checking the minute diffs as well for a  
more precise breakpoint, but it seems odd that the planet dump does  
not break cleanly on a midnight boundary so that it's possible to pick  
up the differences moving forward.

osm2pgsql itself notifies the user of inconsistencies by failing. I  
can see that effort has been put into making it more resilient (e.g. http://trac.openstreetmap.org/changeset/10464) 
. Does osm2pgsql have something like a `--force` switch? I haven't  
been able to find one. In looking at the diff files, it seems that it  
should be possible to ignore possible conflicts by simply overwriting  
whatever's in the DB with whatever's in the .osc file.

Finally, the boundaries between the hourlies and dailies seem  
misaligned.

After running the remaining hourlies for the 22nd, I attempted to pick  
up on the 23rd with a daily. The final hourly I used was  
2008102223-2008102300.osc.gz. It's my expectation that I should be  
able to immediately follow that with 20081023-20081024.osc.gz, but  
this led to duplicate key violation suggesting that there's an overlap  
between the two files. Continuing with hourlies *works*, but is  
tedious and I suspect slower than the dailies.

My sense from reading other people's experiences has been that it's a  
common pattern to rely solely on the weekly planet dumps, incurring  
the substantial overhead of parsing and importing the full 5GB dump  
once every week, and then re-rendering the complete set of tiles.

My hope has been to proceed in a more incremental fashion, since this  
makes it possible to track what specific tiles need to be re-rendered  
on a near-constant schedule, based on actual content or activity, vs.  
simple cache expiration. Right now I'm doing this daily, I'd like to  
do it as often as hourly.

I can see a few possible solutions.

The cutoff times for files on planet.openstreetmap.org could behave  
more consistently. A weekly dump should end at 11:59pm so that dailies  
can immediately pick up user activity. Hourly and daily dumps should  
be synchronized. This seems more difficult.

Or, osm2pgsql could be more fault-tolerant, so that potentially- 
overlapping .osm and .osc files can be safely used. As long as they  
are applied in chronological order, repetitions should be idempotent.  
Is this just a matter of futzing with the SQL commands to suppress  
index key collisions?

-mike.

----------------------------------------------------------------
michal migurski- mike at stamen.com
                  415.558.1610







More information about the talk mailing list