[OSM-dev] Simpler binary OSM formats

andrew byrd andrew at fastmail.net
Wed Apr 29 14:43:19 UTC 2015


Thanks for your comments Jochen. Responses in-line below.

On Wed, Apr 29, 2015, at 11:18, Jochen Topf wrote:
> On Wed, Apr 29, 2015 at 01:35:29AM +0200, andrew byrd wrote: ... it
> has some nice properties we shouldn't forget. First and foremost that
> is the block structure. This allows generating and parsing in multiple
> threads. I think thats an important optimization going forward.

A very good point. In any further testing I will apply intra-block
compression rather than compressing the entire stream.

>  I also think it is important to have some kind of header for storing
>  file meta data in a flexibly way, PBF has that.

Agreed. Any header metadata would be defined later if there is wider
interest in the format, since it represents a very small constant factor
in performance.

> Looking at your proposal you seem to be very concerned with file size
> but not so much with read/write speed. From my experience reading and
> writing PBF is always CPU bound. Removing complexity could speed this
> up considerably.

My true goal was to reduce complexity while at least maintaining the
performance characteristics of PBF. Given that goal, it is true that
there is too much emphasis on file size in the article. I will need to
do a follow-up to cover speed and memory usage.

Anecdotally, I was seeing about 2x speedup relative to PBF writing for a
specific processing step, with both PBF and VEX writing code written in
C. That certainly needs to be confirmed more methodically.

The PBF parsing code produced by the Protobuf compiler seems to involve
quite a lot of dynamic memory allocations. It is straightforward to read
and write OSM data with no dynamic allocation at all (outside the
compression library), and this is one place where VEX could offer an
advantage.

> Currently you can save quite a lot of CPU time if you do not
> compress the PBF blocks but leave them uncompressed. Of course the
> file size goes up, but if you have the storage space that doesn't
> matter that much.

I have to admit I did not really consider this because I've rarely
encountered uncompressed PBF "in the wild" and have always used PBF with
compression turned on. Indeed, any future comparison should ideally
include PBF with uncompressed blocks.

With storage space and bandwidth as high as it is today, it is true that
throughput should be considered as much or more so than file size.
Again, my true motivation is simplicity, and in performance improvements
to the extent that they can be enabled by simplicity.

I have observed that speed numbers can of course be quite different
depending on whether you handle the compression in a separate thread.

> First, I'd like to see the numbers for the whole planet. A size
> difference between small extracts doesn't really matter all that much,
> because the absolute size is so small. Savings on the whole planet
> file would be much more interesting.

A good suggestion. After I've amended the VEX format taking into account
the commentary I've received, I will perform a test on the whole planet.

> Second: The XML and PBF format usually contain the metadata that you
> removed in your VEX format. Have you accounted for that in your
> numbers? Ie. did you remove the metadata from XML and PBF, too?

Yes, I stripped the metadata from the source PBF before converting it to
all the other formats to ensure a fair comparison. These numbers are of
more interest to me in my daily applications, but since I see that there
is some interest from the wider community I agree that a future
comparison should be done including metadata.

> Incidentally I came up with a similar text format as you did. It is
> documented here:
> http://osmcode.org/libosmium/manual/libosmium-manual.html#opl-object-per-line-format

Yes, it's a very similar idea but of course mine is less complete since
it began life as debug output. One unusual ingredient of my text output
is the inclusion of nodes' complete data inline after the ways that
reference them. It's a trade-off which denormalizes intersection nodes,
but avoids keeping nodes far away from their references, avoids
repeatedly mentioning non-intersection node identifiers, and makes the
output human-readable as a series of complete way descriptions.

-Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20150429/b2fb5d31/attachment.html>


More information about the dev mailing list