[OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

Michal Migurski mike at stamen.com
Mon Oct 27 20:39:08 GMT 2008


>> Yep, as others have commented there are two tables types in the osm  
>> database; current tables, and history tables.  The planet dumper  
>> just reads current tables which is the fastest approach.   
>> Unfortunately the current tables change constantly during the  
>> planet generation process resulting in inconsistencies.  It is  
>> possible to produce a consistent snapshot reading history tables  
>> and osmosis has the ability to do just that but it is significantly  
>> slower.  It is also possible to produce a consistent snapshot by  
>> taking an inconsistent planet and applying changesets from a point  
>> in time prior to the planet dump beginning through to a point after  
>> completion, this effectively produces the same result at much  
>> reduced load on the main database.
>>

I'm liking Jochen Topf's suggestion here:

	"If the planet dump plus the diff from the same day is what everybody  
wants anyway, why not do this on the server side and hold the planet  
back after the first diff is available, run this over the planet and  
then publish that as the planet?"


>> Finally, the boundaries between the hourlies and dailies seem   
>> misaligned.
>>
>
> This shouldn't be the case.
>> After running the remaining hourlies for the 22nd, I attempted to  
>> pick  up on the 23rd with a daily. The final hourly I used was   
>> 2008102223-2008102300.osc.gz. It's my expectation that I should be   
>> able to immediately follow that with 20081023-20081024.osc.gz, but   
>> this led to duplicate key violation suggesting that there's an  
>> overlap  between the two files. Continuing with hourlies *works*,  
>> but is  tedious and I suspect slower than the dailies.
>>
>
> You should have been able to do what you've suggested.  If you are  
> finding problems, please provide me with some example data which is  
> misaligned between the two types of changesets.

Try the two files mentioned above - that's where I saw this behavior,  
they're quite recent.

	2008102223-2008102300.osc.gz
	20081023-20081024.osc.gz


>> My sense from reading other people's experiences has been that it's  
>> a  common pattern to rely solely on the weekly planet dumps,  
>> incurring  the substantial overhead of parsing and importing the  
>> full 5GB dump  once every week, and then re-rendering the complete  
>> set of tiles.
>>
>
> For a long time weekly planet dumps were the only bulk data  
> available.  Osmosis changesets have been on the scene for some time  
> now though and are gradually being utilised by more and more  
> clients.  As the planet grows, this will become more critical.  Who  
> knows, if the kinks gradually get ironed out of the osm2pgsql  
> program we may even begin to see the main mapnik tile generator move  
> to using changesets.

I would love to rely on these exclusively, it's much more efficient.  
But, I was seeing a fair bit of information fall through the cracks so  
that's why I'm re-synching to planet every four weeks.



>> I can see a few possible solutions.
>>
>> The cutoff times for files on planet.openstreetmap.org could  
>> behave  more consistently. A weekly dump should end at 11:59pm so  
>> that dailies  can immediately pick up user activity. Hourly and  
>> daily dumps should  be synchronized. This seems more difficult.
>>
>
> You only need a single consistent snapshot to get started.  You can  
> download a planet, then download the two daily changesets either  
> side of the planet generation window, then use osmosis to patch the  
> planet.  This will give you a consistent snapshot.  Once you've  
> imported that into your target database you can then start using  
> daily changesets to keep up to date (or hourly or minute as  
> appropriate).
>
> While it would be nice to have planet dumps already in consistent  
> form, it does add a significant overhead to the whole process.  It's  
> not terribly hard to fix on the client side.

Probably what I need to do is get a fresh update of osm2pgsql. I can  
see now that the revision I'm using is older than #10464, where some  
inconsistency resilience was added.


-mike.

>

----------------------------------------------------------------
michal migurski- mike at stamen.com
                  415.558.1610







More information about the talk mailing list