[OSM-dev] Simpler binary OSM formats

Wed Apr 29 09:18:04 UTC 2015

On Wed, Apr 29, 2015 at 01:35:29AM +0200, andrew byrd wrote:
> Over the last few years I have worked on several pieces of software that
> consume and produce the PBF format. I have always appreciated the
> advantages of PBF over XML for our use cases, but over time it became
> apparent to me that PBF is significantly more complex than would be
> necessary to meet its objectives of speed and compactness.
> 
> Based on my observations about the effectiveness of various techniques
> used in PBF and other formats, I devised an alternative OSM
> representation that is consistently about 8% smaller than PBF but
> substantially simpler to encode and decode. This work is presented in an
> article at http://conveyal.com/blog/2015/04/27/osm-formats/. I welcome
> any comments you may have on this article or on the potential for a
> shift to simpler binary OSM formats.

I agree that the PBF format is rather complex. But it has some nice properties
we shouldn't forget. First and foremost that is the block structure. This
allows generating and parsing in multiple threads. I think thats an important
optimization going forward. Not that that would be difficult to add to your
format, simply adding a length field before the blocks you propose and
compressing each one on its own would more or less do it. I also think it is
important to have some kind of header for storing file meta data in a flexibly
way, PBF has that.

Looking at your proposal you seem to be very concerned with file size but not
so much with read/write speed. From my experience reading and writing PBF is
always CPU bound. Removing complexity could speed this up considerably. But
if the price is that we need zlib (de)compression it might not be worth it,
because it is rather CPU and memory intensive. Currently you can save quite
a lot of CPU time if you do not compress the PBF blocks but leave them
uncompressed. Of course the file size goes up, but if you have the storage
space that doesn't matter that much. Another issue we have to keep in mind
is memory usage. The usual compression algorithms works better if they are
run on larger pieces of data, but it means you need a lot of memory for the
original data and the compressed data at the same time. This might not matter
in many cases, but if you are reading and writing lots of files at the same
time and/or need your memory for other things, too (which is usually the case),
this might become important. The PBF format, for that matter, is pretty
problematic in that regard, because of the string table and because of the
inefficient way the Google Protobuf library deals with memory management.

About the stats in your blog posts comparing the different formats: First, I'd
like to see the numbers for the whole planet. A size difference between small
extracts doesn't really matter all that much, because the absolute size is so
small. Savings on the whole planet file would be much more interesting.

Second: The XML and PBF format usually contain the metadata that you removed
in your VEX format. Have you accounted for that in your numbers? Ie. did you
remove the metadata from XML and PBF, too? I think the numbers including the
meta data would be much more interesting. It is, of course, okay, if you don't
need that data to remove it for your internal use. But if we are talking about
a possible future standard for OSM data thats used in many places, we need at
least the option of having that data.

Incidentally I came up with a similar text format as you did. It is documented
here:
http://osmcode.org/libosmium/manual/libosmium-manual.html#opl-object-per-line-format

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.jochentopf.com/  +49-351-31778688