[OSM-dev] Compression types in PBF Format

Wed Dec 1 02:56:16 GMT 2010

On Tue, Nov 30, 2010 at 7:29 PM, Stefan de Konink <stefan at konink.de> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> Hi Scott,
>
>
> Op 01-12-10 00:41, Scott Crosby schreef:
>> The real question is does supporting bzip2/lzma offer advantages that
>> are commensurate with the added implementation complexity, not just in
>> pbf2osm but in every other reader too.
>
> If any of gzip/bzip2/lzma in the general give better compression ratio's
> (20% smaller), then this compression scheme should become the default
> format. Since (sadly) PBF goes into an 'archival' format opposed to a
> wire format.

I don't see anything in principal that keeps pbf from being a wire format.

>
>> Would you be willing to run an experiment with LZMA? If it shaves a
>> gigabyte off of the planet, then I'd say its worth further
>> consideration; if it shaves 100MB, then its not. Make a case for why
>> it should be included.
>
> I completely agree. But experimenting with LZMA means first a osm2pbf
> that supports LZMA.

Or, hacking it into osmosis, which has the rest of the code already written.

> And currently I feel that the only 'true' tool that
> should do something like this should be named pgsql2pbf.

I expect it will be written eventually. The planet has doubled in size
in the last year.

> I honestly
> cannot find a single reason why it would be good to use the XML as
> intermediate format, except for legacy support.

It is human readible and much more corruption resistant. (Well, XML is
corruption resistant, bzipped XML is too. gzipped XML isn't, unless
made with the option --rsyncable.). It makes a much more secure
archival format.

> <indepth>
> And for the reader; that only was presented my flame to the osmosis
> implementation;
>
> Out of the blue the OSMOSIS implementation started to introduce -1
> userid's, this is in no place documented, neither is it a default at
> present to represent past anonymous edits with a negative userid.

Actually, osmosis used OsmUser.NONE to represent those anonymous
edits. The problem is how to represent those within the limitations of
the PBF format (see below)

> Especially since at that time the uid's couldn't be negative (by spec)
> and the format specifies 'has_uid'.

DenseInfo doesn't have a has_uid method to check, as it delta-encodes uid's.

> </indepth>

>
> Since the current osmformat.proto still has a int32 for a uid, which is
> in fact always positive number in the openstreetmap database,

To be totally pedantic, the domain of UID's is either
    {set of all nonnegative integers} + 'NULL'.
OR
    {set of all positive integers} + 'NULL'.

This is an edge case, which you properly identified and we came up
with a resolution. There's no 'right' fix, unfortunately, within the
limitations of the PBF format. The problem is that the PBF format
cannot express NULL, meaning no such user. Unless all metadata is
stripped, It must encode a UID *number*.

Not knowing which of the domains applied, or if UID's can be negative
in legitimate circumstances, I took the easy way out. I didn't need to
care what the domain of UID's was, I just used whatever integer
osmosis returned when calling OsmUser.getUID(), which happens to be -1
for OsmUser.NONE. My mistake was assuming that this mapping was
universal in the rest of the OSM stack.

> the
> problem has been reported before. Would be obvious to haven't defined it
> at all in message Info and use 0 in DenseInfo.

>
>
>> I do appreciate you finding the bugs and ambiguities in the spec by
>> being the first independent implementation, and I hope you will
>> consider running the LZMA experiment, but you have been rude and
>> insulting.
>
> Basically you are asking me to run tests that Jochen should have come up
> with to prove that your specification of multiple compression formats
> sucked.

I viewed it differently, He wanted to know if the specification needed
to be that complicated, to which I have to admit that I did not know.
That is a legitimate question. The essence of a good design is
simplicity. Each feature should have a reason for being there, a
reason strong enough to warrant being included. Does LZMA meet that
burden of proof?

Scott