[OSM-dev] parallel and distributed bzip2

Frederik Ramm frederik at remote.org
Sat Aug 23 00:57:33 BST 2008


    for those of us fighting against the absymal performance of bzip2 
when creating planet files and such, in case you hadn't heard of these:

There's "pbzip2", readily available in Debian/Ubuntu repositories, which 
gives near-linear speedup by utilizing as many CPUs as you have, and if 
you are adventurous then even dbzip2 which is able to use all those 
spare machines you have sitting around for distributed bzipping: 
http://www.mediawiki.org/wiki/Dbzip2 (but I haven't managed to find the 
latest source for this and it is flagged "experimental").

Further, I have found out that a block size of 200k (-2) actually gives 
better compression than the, much slower, default of 900k; examining our 
planet files more closely I see that this is obviously a known fact 
since they are using 200k block size also.

I was a bit frustrated with pbzip2 because for my setup I need streaming 
  operation, and pbzip2 supports writing to stdout but not reading from 

Good old 7z to the rescue. I don't like 7z, it talks too much and has 
the feel of a DOS program, but it *can* do parallel compression *with* 
piping for bzip2 files:

% 7z a dummy -tbzip2 -si -so < foo.osm > foo.osm.bz2

Don't ask about the strange command line, I already said I don't like it ;-)

The bzip2 files created by 7z/pbzip2 are generally a little bit larger 
than when using non-parallel bzip2, but fully compatible.


Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00'09" E008°23'33"

More information about the dev mailing list