[OSM-dev] Speeding up Osm2pgsql through parallelization?

Wed Sep 14 08:54:52 BST 2011

On 09/14/2011 01:22 AM, Kai Krueger wrote:

> It probably would be easier to use CLUSTER and then let postgresql sort
> out the rest. It would also make it easier to occasionally "re-cluster".
> I guess it wouldn't be too difficult to test it out, if only it didn't
> take so long to run those tests...
> 
> Was there a reason not to use it in the first place?

i've been wondering about that, too, but after reading the full manual
section on CLUSTER to its end i found that the tmp->sort->overwrite
way is actually recommended for large tables that are too large to fit
into RAM.

See http://www.postgresql.org/docs/9.0/static/sql-cluster.html

  [CLUSTER] can be slow on large tables because the rows are fetched
  from the table in index order, and if the table is disordered, the
  entries are on random pages, so there is one disk page retrieved for
  every row moved. (PostgreSQL has a cache, but the majority of a big
  table will not fit in the cache.)

while

  [tmp->sort->rename] uses the PostgreSQL sorting code to produce the
  desired order; this is usually much faster than an index scan for
  disordered data.

For the sorting code to perform well work_mem is key though, while
for the index recreation step maintenance_work_mem is needed.
Currently the wiki guide on pgsql configuration suggests static
values for both (and until recently did not mention work_mem at all).
There's also an issue with osm2pgsql not really returning the
cache memory to the operating system due to heap fragmentation.

Better results for large imports (table sizes much larger than
RAM size) could probably be archived by:

* making sure osm2pgsql properly returns the memory used for its
  cache to the operating system, for this i've got a working patch:

   https://github.com/hholzgra/osm2pgsql/tree/freeable_cache

* serializing the index creation and clustering steps

  running these in parallel makes sense where everything fits
  into caches in RAM so that things get CPU bound, but on large
  imports things will be IO bound anyway, and parallel index
  builds just lead to lower cache hit rates which causes even
  more IO load

* start with low default work_mem and maintenance_work_mem settings
  and raise them on a per statement level, so making the appropriate
  buffer for a given operation (work_mem for ORDER BY,
  maintenance_work_mem for CREATE INDEX) as large as possible and
  then shrinking it back to its default size afterwards

Whether CLUSTER or the current approach is better/faster for our
imports needs to be benchmarked, my personal bet would be that
CLUSTER wins as our data distribution over time is not totally
random, but this really needs to me tested out.

(one additional advantage of CLUSTER would be that peak disk
space usage during the operation would only be about two times
the data size instead of three times with the current approach)

-- 
hartmut