[OSM-dev] Speeding up Osm2pgsql through parallelization?
kakrueger at gmail.com
Wed Sep 14 00:31:13 BST 2011
On 7/22/64 12:59 PM, Frederik Ramm wrote:
> partial answer:
> On 09/13/2011 02:07 AM, Kai Krueger wrote:
>> 2) Currently all the (diff-) import is done in a single transaction.
>> Therefore other db users (e.g. renderers) don't see any change until the
>> full transaction is committed. In order to do things in parallel,
>> however, there needs to be intermediary commits
>> The question though is this valid? For the initial import this is
>> probably not a problem as there won't be any db users concurrently until
>> the import is complete. However, diff imports with concurrent rendering
>> is a different matter. What will committing pending ways do to
> Renderers use the geometry tables; the "pending" way is in the data
> table where it will not usually be touched by renderers. So I don't
> see a problem here. I am however not familiar with internal Postgres
> processing and I could imagine that there is a speed penalty in
> commiting pending ways as opposed to resetting the pending flag in the
> same transaction where it was set.
Good point. Yes the pending way stuff is on the ways table and not on
the geometry rendering tables, so hopefully it shouldn't cause any
direct breakage of the rendering. What possibly could happen is that you
get some temporal inconsistencies, in the sense that on a single tile
you might have some newer ways rendered but older polygons not showing
up yet. But that should hopefully not really cause any problems.
>> 3) Currently the string cache is not thread safe. It is possible to
>> disable it via a single preprocessor define and then parallelizing at
>> least doesn't lead to crashes, but I assume it is there for a good
>> reason. Presumably with a bit of work, it should be possible to get the
>> string cache thread safe though as well. So assuming the other two
>> points aren't show stoppers, this should be possible to fix.
> Have you considered multiprocessing (i.e. fork) instead of
> multithreading - would this perhaps make these things go away
> elegantly? Personally I abhor multithreading for the complexity it
> brings at (usually) little gain compared to simply forking a few
> worker processes but of course YMMV especially if you want tight
> communication between workers.
No, I hadn't considered multiprocessing, but again, that is a good point
worth exploring further. Currently, what I have done does have a tight
integration to share to loop counter between threads, but you can
probably just split it into independent sections per worker process.
Overall, it does hopefully mean that it is worth exploring this avenue
further though, and try and get a clean enough patch to consider
applying it to osm2pgsql.
More information about the dev