[OSM-dev] parallel and distributed bzip2

Ævar Arnfjörð Bjarmason avarab at gmail.com
Sat Aug 23 02:42:48 BST 2008


On Fri, Aug 22, 2008 at 11:57 PM, Frederik Ramm <frederik at remote.org> wrote:
> Hi,
>
>    for those of us fighting against the absymal performance of bzip2
> when creating planet files and such, in case you hadn't heard of these:
>
> There's "pbzip2", readily available in Debian/Ubuntu repositories, which
> gives near-linear speedup by utilizing as many CPUs as you have, and if
> you are adventurous then even dbzip2 which is able to use all those
> spare machines you have sitting around for distributed bzipping:
> http://www.mediawiki.org/wiki/Dbzip2 (but I haven't managed to find the
> latest source for this and it is flagged "experimental").

svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/dbzip2

> Further, I have found out that a block size of 200k (-2) actually gives
> better compression than the, much slower, default of 900k; examining our
> planet files more closely I see that this is obviously a known fact
> since they are using 200k block size also.
>
> I was a bit frustrated with pbzip2 because for my setup I need streaming
>  operation, and pbzip2 supports writing to stdout but not reading from
> stdin.

dbzip2 supports reading from STDIN and writing to STDOUT

> Good old 7z to the rescue. I don't like 7z, it talks too much and has
> the feel of a DOS program, but it *can* do parallel compression *with*
> piping for bzip2 files:
>
> % 7z a dummy -tbzip2 -si -so < foo.osm > foo.osm.bz2
>
> Don't ask about the strange command line, I already said I don't like it ;-)
>
> The bzip2 files created by 7z/pbzip2 are generally a little bit larger
> than when using non-parallel bzip2, but fully compatible.




More information about the dev mailing list