[OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

Brett Henderson brett at bretth.com
Mon Oct 27 12:44:11 GMT 2008


Others have already commented on most of your points but I'll add my 
thoughts in case there's some gaps.

Michal Migurski wrote:
> Hi,
>
> I've been trying to keep up to date with the dumps and diffs from http://planet.openstreetmap.org/ 
> , and I'm running into a number of bugs related to cutoff dates.
>
> In keeping my Bay Area tiles (http://mike.teczno.com/notes/cascadenik-openstreetmap.html 
> ) up to date, I've been grabbing complete planet.osm dumps about once  
> per month, and filling in the intervening time with daily diffs. I've  
> noticed some misalignments between the data in the dumps and the  
> osm2pgsql importer that leads to unavoidable holes in the data.
>
> It seems that they could be fixed in either osm2pgsql, the planet  
> files, or both.
>
> The final event in each weekly planet dump does not fall on an even  
> day boundary. In the case of the most recent Oct. 22nd planet.osm, it  
> was necessary to experiment with hourly diffs from that day to find  
> that the boundary was approx. 2:00pm. Hourlies up to and including  
> 2008102213-2008102214.osc.gz failed, hourlies after that succeeded. I  
> could go more granular here, checking the minute diffs as well for a  
> more precise breakpoint, but it seems odd that the planet dump does  
> not break cleanly on a midnight boundary so that it's possible to pick  
> up the differences moving forward.
>   
Yep, as others have commented there are two tables types in the osm 
database; current tables, and history tables.  The planet dumper just 
reads current tables which is the fastest approach.  Unfortunately the 
current tables change constantly during the planet generation process 
resulting in inconsistencies.  It is possible to produce a consistent 
snapshot reading history tables and osmosis has the ability to do just 
that but it is significantly slower.  It is also possible to produce a 
consistent snapshot by taking an inconsistent planet and applying 
changesets from a point in time prior to the planet dump beginning 
through to a point after completion, this effectively produces the same 
result at much reduced load on the main database.
> osm2pgsql itself notifies the user of inconsistencies by failing. I  
> can see that effort has been put into making it more resilient (e.g. http://trac.openstreetmap.org/changeset/10464) 
> . Does osm2pgsql have something like a `--force` switch? I haven't  
> been able to find one. In looking at the diff files, it seems that it  
> should be possible to ignore possible conflicts by simply overwriting  
> whatever's in the DB with whatever's in the .osc file.
>   
Yes, that's true.  I can't comment on osm2pgsql but when osmosis 
processes changeset files it does exactly that.
> Finally, the boundaries between the hourlies and dailies seem  
> misaligned.
>   
This shouldn't be the case.
> After running the remaining hourlies for the 22nd, I attempted to pick  
> up on the 23rd with a daily. The final hourly I used was  
> 2008102223-2008102300.osc.gz. It's my expectation that I should be  
> able to immediately follow that with 20081023-20081024.osc.gz, but  
> this led to duplicate key violation suggesting that there's an overlap  
> between the two files. Continuing with hourlies *works*, but is  
> tedious and I suspect slower than the dailies.
>   
You should have been able to do what you've suggested.  If you are 
finding problems, please provide me with some example data which is 
misaligned between the two types of changesets.  I've gone to a fair bit 
of trouble to ensure that timestamp management is correct.  For example, 
all changesets and file names are using UTC even though the database 
itself is using BST.  If I've made a mistake somewhere I'd like to know 
about it.  Given that daily, hourly and minute changesets are using 
*identical* code, I find it hard to believe they're inconsistent with 
each other.
> My sense from reading other people's experiences has been that it's a  
> common pattern to rely solely on the weekly planet dumps, incurring  
> the substantial overhead of parsing and importing the full 5GB dump  
> once every week, and then re-rendering the complete set of tiles.
>   
For a long time weekly planet dumps were the only bulk data available.  
Osmosis changesets have been on the scene for some time now though and 
are gradually being utilised by more and more clients.  As the planet 
grows, this will become more critical.  Who knows, if the kinks 
gradually get ironed out of the osm2pgsql program we may even begin to 
see the main mapnik tile generator move to using changesets.
> My hope has been to proceed in a more incremental fashion, since this  
> makes it possible to track what specific tiles need to be re-rendered  
> on a near-constant schedule, based on actual content or activity, vs.  
> simple cache expiration. Right now I'm doing this daily, I'd like to  
> do it as often as hourly.
>   
Yep, that was one of my original aims.
> I can see a few possible solutions.
>
> The cutoff times for files on planet.openstreetmap.org could behave  
> more consistently. A weekly dump should end at 11:59pm so that dailies  
> can immediately pick up user activity. Hourly and daily dumps should  
> be synchronized. This seems more difficult.
>   
You only need a single consistent snapshot to get started.  You can 
download a planet, then download the two daily changesets either side of 
the planet generation window, then use osmosis to patch the planet.  
This will give you a consistent snapshot.  Once you've imported that 
into your target database you can then start using daily changesets to 
keep up to date (or hourly or minute as appropriate).

While it would be nice to have planet dumps already in consistent form, 
it does add a significant overhead to the whole process.  It's not 
terribly hard to fix on the client side.
> Or, osm2pgsql could be more fault-tolerant, so that potentially- 
> overlapping .osm and .osc files can be safely used. As long as they  
> are applied in chronological order, repetitions should be idempotent.  
> Is this just a matter of futzing with the SQL commands to suppress  
> index key collisions?
>   
For the osmosis pgsql schema it has been necessary to do checks prior to 
every insert so I know whether to do an insert or update.  It adds a 
fair bit of overhead but changesets are fairly small and can be applied 
quickly, efficiency isn't usually an issue.

Brett





More information about the talk mailing list