[OSM-dev] New OSM binary fileformat implementation.

Thu Apr 29 13:15:43 BST 2010

On Thu, Apr 29, 2010 at 2:15 AM, Frederik Ramm <frederik at remote.org> wrote:

> Scott,
>
>
> Scott Crosby wrote:
>
>> I would like to announce code implementing a binary OSM format that
>> supports the full semantics of the OSM XML.
>>
>
> This all sounds very interesting, and you seem to have spent a lot of
> thought on it and documented it well.
>
> If I understand it correctly, this is meant to be a replacement for the XML
> files as a "transport format" for XML data. It is not meant to offer random
> access in any way, and thus differs from other attempts at creating binary
> formats that could be used in lieu of databases, having indexes and all.
>
>
Roughly, yes, this is intended as a transport format, but the design is
flexible.

if the file is physically ordered so that blocks have strong geographic
locality and block metadata includes bounding boxes, then those bounding
boxes can be used to skip unneeded blocks. If the file is physically ordered
in type/id order (as the current planet is), and block metadata includes the
minimum and maximum id for each block, then, as before, the metadata can be
used to only examine desired blocks. If both searches are critical, then
generate two files, one with geographic locality and one sorted by type/id.
Storing 'planet-omitmeta.bin' *AND*  'planet.bin' is still cheaper than
storing 'planet-100303.osm.gz'

As you point out, a binary format means different things to different
people. I chose this design because it would be useful as-is and it could
offer future features without requiring changing the file format. Geographic
searches and searches by type&id merely wait on implementing code for
physically reordering the file to be written. Adding the appropriate fields
to the metadata header is trivial in comparison. In addition, nothing in the
design precludes adding having fileblocks that contain data other than OSM
entities. Fileblocks can contain an index from a node ID to the ways and
relations it is contained. Metadata headers on these blocks can indicate
which block contains the index entries for a particular node.

My vision is that in many cases it is better to have a simple format that is
very dense and lets you skip the 95% of the data that you don't care about,
rather than design a very complex or significantly larger format (e.g, a
relational database) that lets you skip  99+% of the data that you don't
care about. The more advanced formats may return less data, but the simple
format is still 20 times less data than reading everything.

Maybe we should be careful about naming these formats to make their purpose
> clearer. The generic "OSM binary format" seems to mean different things to
> different people. The file extension ".bin" is perhaps not the best choice.
>
> Have you considered/evaluated "Fast Infoset" and if so, what were the
> reasons against that?
>
>
>
No, I was not aware of that compressed XML design.

>  It is 5x-10x faster at
>> reading and writing and 30-50% smaller
>>
>
> The size figure is obviously compared to bz2; is the "5x-10x faster" also
> compared to bz2, and if so, compared to the native Java bz2 or the external
> C one?
>
>
For filesizes, I was comparing to bzip2. For performance, I compared against
the gzip'ed planet; I didn't have the patience to compress or decompress
that much XML in bzip2.

>  an entire planet, including
>> all metadata, can be read in about 12 minutes and written in about 50
>> minutes on a 3 year old dual-core machine.
>>
>
> How did you measure write performance decoupled from read performance?
> Surely your 3 year old dual-core machine did not have the 150 gigs of RAM
> needed to suck the entire planet into memory?
>
>
I benchmarked:

   osmosis --read-bin file=planet.bin --write-null
   osmosis --read-bin file=planet.bin --write-bin file=planet2.bin

And measured ~12 minutes of CPU time for the first and ~60 minutes of CPU
time for the second.

With a dual-core system, using '--b bufferCapacity=20000' gives some
concurrency and writing can be done in somewhere around 40 minutes.

> You have paid an impressive amount attention to details in order to achieve
> the good performance and compression rates that you do. I'm slightly
> concerned about the robustness of it all - in the past, we often had planet
> files that were broken one way or the other, and it was usually possible to
> remedy this with some standard grep, sed, or dd actions - if one of your
> files ever breaks then I guess it is likely to be complete garbage ;-)
>
>
Without knowing how those prior planets were broken, I can't say whether
analogous breakage of files in my format could be repaired.

>
>  Probably the most important TODO is packaging and fixing the build system.
>> I have no almost no experience with ant and am unfamiliar with java
>> packaging practices, so I'd like to request help/advice on ant and
>> suggestions on
>> how to package the common parsing/serializing code so that it can be
>> re-used across different programs.
>>
>
> I suggest to ask on osmosis-dev, an get your new code into the Osmosis
> trunk quickly so people can play with it.
>
>
I think it would be prudent to get suggestions from the OSM community first.
Once the code is in osmosis, our ability to make
compatability-breaking changes to the format will be reduced.

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20100429/def30779/attachment.html>