[OSM-dev] parallel and distributed bzip2
Frederik Ramm
frederik at remote.org
Sat Aug 23 00:57:33 BST 2008
Hi,
for those of us fighting against the absymal performance of bzip2
when creating planet files and such, in case you hadn't heard of these:
There's "pbzip2", readily available in Debian/Ubuntu repositories, which
gives near-linear speedup by utilizing as many CPUs as you have, and if
you are adventurous then even dbzip2 which is able to use all those
spare machines you have sitting around for distributed bzipping:
http://www.mediawiki.org/wiki/Dbzip2 (but I haven't managed to find the
latest source for this and it is flagged "experimental").
Further, I have found out that a block size of 200k (-2) actually gives
better compression than the, much slower, default of 900k; examining our
planet files more closely I see that this is obviously a known fact
since they are using 200k block size also.
I was a bit frustrated with pbzip2 because for my setup I need streaming
operation, and pbzip2 supports writing to stdout but not reading from
stdin.
Good old 7z to the rescue. I don't like 7z, it talks too much and has
the feel of a DOS program, but it *can* do parallel compression *with*
piping for bzip2 files:
% 7z a dummy -tbzip2 -si -so < foo.osm > foo.osm.bz2
Don't ask about the strange command line, I already said I don't like it ;-)
The bzip2 files created by 7z/pbzip2 are generally a little bit larger
than when using non-parallel bzip2, but fully compatible.
Bye
Frederik
--
Frederik Ramm ## eMail frederik at remote.org ## N49°00'09" E008°23'33"
More information about the dev
mailing list