[OSM-dev] Speeding up Osm2pgsql through parallelization?

Tue Sep 13 01:07:47 BST 2011

Hi,

I was thinking about ways to try and speed up osm2pgsql. Currently a 
good fraction of time, both in full imports and during diff-processing, 
is spent in the "going over pending ways / relations" section. Therefore 
speeding up that section should bring the overall time down quite a bit. 
One thought to try and speed up the "going over pending ways / 
relations" is to try and parallelize it.

Preliminary results indicate that indeed using multiple threads can 
potentially speed this section up substantially. (depending on the size 
of the import / db and the hardware available)

Currently, osm2pgsql fetches all ways / relations that are marked as 
pending in an sql query and then linearly goes through each one 
processing it. What I was thinking was to, just as before, fetch all 
pending ways, but instead of going through linearly, have multiple 
worker threads go through the list concurrently and process them in 
parallel. If there is enough ram to cache things, importing is CPU bound 
and one can get nearly linear speed up. If importing is IO bound, it 
might still speed things up, as more I/O requests can be submit in 
parallel, which may result in more throughput (at least on rotational 
disks) due to better request ordering or using more spindles in 
parrallel in case of raid.

However, on the way to parallelize this, I hit a bunch of "road blocks". 
Although in my initial patch, I hacked around them, I am not sure that 
was always valid and so before proceeding any further, I wanted to ask 
if these ideas a valid, feasible and worth proceeding further?

The (potential) road blocks I have hit so far are the following:

1) The underlying assumption is that processing the pending ways and 
relations (once the normal (diff-)import of nodes ways and relations is 
finished) is independent per way / relation and therefore it is valid to 
process them in parallel. Particullarly is this true for all of the 
output modes, i.e. including Nominatim?

2) Currently all the (diff-) import is done in a single transaction. 
Therefore other db users (e.g. renderers) don't see any change until the 
full transaction is committed. In order to do things in parallel, 
however, there needs to be intermediary commits, so that the different 
worker threads (each having their own db connection) can see the first 
stage of importing nodes / ways / relations. Thus, there needs to be a 
commit after the stage of reading in nodes, ways and relations, but 
before the stage of "going over pending ways / relations".

The question though is this valid? For the initial import this is 
probably not a problem as there won't be any db users concurrently until 
the import is complete. However, diff imports with concurrent rendering 
is a different matter. What will committing pending ways do to rendering?

3) Currently the string cache is not thread safe. It is possible to 
disable it via a single preprocessor define and then parallelizing at 
least doesn't lead to crashes, but I assume it is there for a good 
reason. Presumably with a bit of work, it should be possible to get the 
string cache thread safe though as well. So assuming the other two 
points aren't show stoppers, this should be possible to fix.

Any thoughts on these points? Do you know of further problems with this 
approach, or is it worth pursuing this approach further and get it to a 
committable  state?

Thanks,

Kai