[OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence
Michal Migurski
mike at stamen.com
Mon Oct 27 00:10:58 GMT 2008
Hi,
I've been trying to keep up to date with the dumps and diffs from http://planet.openstreetmap.org/
, and I'm running into a number of bugs related to cutoff dates.
In keeping my Bay Area tiles (http://mike.teczno.com/notes/cascadenik-openstreetmap.html
) up to date, I've been grabbing complete planet.osm dumps about once
per month, and filling in the intervening time with daily diffs. I've
noticed some misalignments between the data in the dumps and the
osm2pgsql importer that leads to unavoidable holes in the data.
It seems that they could be fixed in either osm2pgsql, the planet
files, or both.
The final event in each weekly planet dump does not fall on an even
day boundary. In the case of the most recent Oct. 22nd planet.osm, it
was necessary to experiment with hourly diffs from that day to find
that the boundary was approx. 2:00pm. Hourlies up to and including
2008102213-2008102214.osc.gz failed, hourlies after that succeeded. I
could go more granular here, checking the minute diffs as well for a
more precise breakpoint, but it seems odd that the planet dump does
not break cleanly on a midnight boundary so that it's possible to pick
up the differences moving forward.
osm2pgsql itself notifies the user of inconsistencies by failing. I
can see that effort has been put into making it more resilient (e.g. http://trac.openstreetmap.org/changeset/10464)
. Does osm2pgsql have something like a `--force` switch? I haven't
been able to find one. In looking at the diff files, it seems that it
should be possible to ignore possible conflicts by simply overwriting
whatever's in the DB with whatever's in the .osc file.
Finally, the boundaries between the hourlies and dailies seem
misaligned.
After running the remaining hourlies for the 22nd, I attempted to pick
up on the 23rd with a daily. The final hourly I used was
2008102223-2008102300.osc.gz. It's my expectation that I should be
able to immediately follow that with 20081023-20081024.osc.gz, but
this led to duplicate key violation suggesting that there's an overlap
between the two files. Continuing with hourlies *works*, but is
tedious and I suspect slower than the dailies.
My sense from reading other people's experiences has been that it's a
common pattern to rely solely on the weekly planet dumps, incurring
the substantial overhead of parsing and importing the full 5GB dump
once every week, and then re-rendering the complete set of tiles.
My hope has been to proceed in a more incremental fashion, since this
makes it possible to track what specific tiles need to be re-rendered
on a near-constant schedule, based on actual content or activity, vs.
simple cache expiration. Right now I'm doing this daily, I'd like to
do it as often as hourly.
I can see a few possible solutions.
The cutoff times for files on planet.openstreetmap.org could behave
more consistently. A weekly dump should end at 11:59pm so that dailies
can immediately pick up user activity. Hourly and daily dumps should
be synchronized. This seems more difficult.
Or, osm2pgsql could be more fault-tolerant, so that potentially-
overlapping .osm and .osc files can be safely used. As long as they
are applied in chronological order, repetitions should be idempotent.
Is this just a matter of futzing with the SQL commands to suppress
index key collisions?
-mike.
----------------------------------------------------------------
michal migurski- mike at stamen.com
415.558.1610
More information about the talk
mailing list