[OSM-dev] Speeding up Osm2pgsql through parallelization?
Kai Krueger
kakrueger at gmail.com
Tue Sep 13 01:07:47 BST 2011
Hi,
I was thinking about ways to try and speed up osm2pgsql. Currently a
good fraction of time, both in full imports and during diff-processing,
is spent in the "going over pending ways / relations" section. Therefore
speeding up that section should bring the overall time down quite a bit.
One thought to try and speed up the "going over pending ways /
relations" is to try and parallelize it.
Preliminary results indicate that indeed using multiple threads can
potentially speed this section up substantially. (depending on the size
of the import / db and the hardware available)
Currently, osm2pgsql fetches all ways / relations that are marked as
pending in an sql query and then linearly goes through each one
processing it. What I was thinking was to, just as before, fetch all
pending ways, but instead of going through linearly, have multiple
worker threads go through the list concurrently and process them in
parallel. If there is enough ram to cache things, importing is CPU bound
and one can get nearly linear speed up. If importing is IO bound, it
might still speed things up, as more I/O requests can be submit in
parallel, which may result in more throughput (at least on rotational
disks) due to better request ordering or using more spindles in
parrallel in case of raid.
However, on the way to parallelize this, I hit a bunch of "road blocks".
Although in my initial patch, I hacked around them, I am not sure that
was always valid and so before proceeding any further, I wanted to ask
if these ideas a valid, feasible and worth proceeding further?
The (potential) road blocks I have hit so far are the following:
1) The underlying assumption is that processing the pending ways and
relations (once the normal (diff-)import of nodes ways and relations is
finished) is independent per way / relation and therefore it is valid to
process them in parallel. Particullarly is this true for all of the
output modes, i.e. including Nominatim?
2) Currently all the (diff-) import is done in a single transaction.
Therefore other db users (e.g. renderers) don't see any change until the
full transaction is committed. In order to do things in parallel,
however, there needs to be intermediary commits, so that the different
worker threads (each having their own db connection) can see the first
stage of importing nodes / ways / relations. Thus, there needs to be a
commit after the stage of reading in nodes, ways and relations, but
before the stage of "going over pending ways / relations".
The question though is this valid? For the initial import this is
probably not a problem as there won't be any db users concurrently until
the import is complete. However, diff imports with concurrent rendering
is a different matter. What will committing pending ways do to rendering?
3) Currently the string cache is not thread safe. It is possible to
disable it via a single preprocessor define and then parallelizing at
least doesn't lead to crashes, but I assume it is there for a good
reason. Presumably with a bit of work, it should be possible to get the
string cache thread safe though as well. So assuming the other two
points aren't show stoppers, this should be possible to fix.
Any thoughts on these points? Do you know of further problems with this
approach, or is it worth pursuing this approach further and get it to a
committable state?
Thanks,
Kai
More information about the dev
mailing list