[OSM-dev] Simpler binary OSM formats

Sun Feb 7 18:59:36 UTC 2016

Hi Dmitry,

Yes, there are similarities and I did study the o5m format before I began work on vex. The last section of my original article compares the two and gives my impressions of o5m: http://conveyal.com/blog/2015/04/27/osm-formats#comparisons-with-o5m <http://conveyal.com/blog/2015/04/27/osm-formats#comparisons-with-o5m>

In summary: o5m uses string tables with a fixed size and an LRU eviction policy. Producers and consumers must keep their string tables exactly in sync. Strings are then referenced by integers indicating how recently they were used (1 to 15000). This adds quite a bit of complexity to o5m implementations, especially considering that this eviction strategy can backfire on certain inputs leading to files that are actually bigger than a basic gzipped text representation of the same data. According to http://wiki.openstreetmap.org/wiki/Talk:O5m#Compression_Algorithms <http://wiki.openstreetmap.org/wiki/Talk:O5m#Compression_Algorithms> o5m uses string tables specifically to avoid relying on general purpose compression. I find this unnecessary considering that zlib compression is quite effective, resource efficient (with adjustable compression level), and available practically everywhere.

There are a few other unusual design decisions documented and discussed at http://wiki.openstreetmap.org/wiki/O5m <http://wiki.openstreetmap.org/wiki/O5m> and http://wiki.openstreetmap.org/wiki/Talk:O5m <http://wiki.openstreetmap.org/wiki/Talk:O5m>. For example, strings are both introduced and terminated by a null byte, and are often stored in pairs (i.e. three null bytes per string pair, one of which is at the beginning of the string).

Of course I recognize o5m's contribution to the dialog on binary formats, and we can of course learn from the o5m concept, but my conclusion is that it does not have the combination of extreme simplicity and compactness necessary to complement the existing formats.

As for fixed sized blocks in vex, I did consider that option but couldn’t come up with a compelling reason for it. I can see the case for a maximum block size (so we know what the maximum size of allocation will be), but can you give a concrete example of how fixed-size blocks would be advantageous in practice? I would be very hesitant to split any entity across multiple blocks.

-Andrew

> On 07 Feb 2016, at 09:06, Дмитрий Киселев <dmitry.v.kiselev at gmail.com> wrote:
> 
> Looks pretty similar to o5m, except tags key=value are not round-buffered.
> 
> As a further extension, it would be nice to have the ability to have blocks of fixed size. 
> Just write nodes one by one while you haven't full-fill byte buffer.
> For extremely big relations (which are larger than one block) it's possible to mark two adjacent blocks as connected, but there should be a few of them.
> 
> It would help to read write and seek over files.
> 
> 2016-02-07 3:47 GMT+05:00 Stadin, Benjamin <Benjamin.Stadin at heidelberg-mobil.com <mailto:Benjamin.Stadin at heidelberg-mobil.com>>:
> Hi Andrew,
> 
> Cap'n Proto (successor of ProtoBuffer from the guy who invented ProtoBuffer) and FlatBuffers (another ProtoBuffer succesor, by Google) have gained a lot of traction since last year. Both eliminate many if the shortcomings of the original ProtoBuffer (allow for random access, streaming,...), and improve on performance also.
> 
> https://github.com/google/flatbuffers <https://github.com/google/flatbuffers>
> 
> Here is a comparison between ProtoBuffer competitors:
> https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html <https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html>
> 
> In my opinion FlatBuffers is the most interesting. It seems to have very good language and platform support, and has quite a high adoption rate already. 
> 
> I think that it's well worth to reconsider creating an own file format and parser for several reasons. Your concept looks well thought, it should be possible to implement a lighweight parser using FlatBuffers for your data scheme. 
> 
> Regards
> Ben 
> 
> Von meinem iPad gesendet
> 
> Am 06.02.2016 um 22:37 schrieb Andrew Byrd <andrew at fastmail.net <mailto:andrew at fastmail.net>>:
> 
>> Hello OSM developers,
>> 
>> Last spring I posted an article discussing some shortcomings of the PBF format and proposing a simpler binary OSM interchange format called VEX. There was a generally positive response at the time, including helpful feedback from other developers. Since then I have revised the VEX specification as well as our implementation, and Conveyal has been using this format in our own day-to-day work.
>> 
>> I have written a new article describing of the revised format:
>> http://conveyal.com/blog/2016/02/06/vex-format-part-two <http://conveyal.com/blog/2016/02/06/vex-format-part-two>
>> 
>> The main differences are 1) it is more regular and even simpler to parse; and 2) file blocks are compressed individually, allowing parallel processing and seeking to specific entity types. It is no longer smaller than PBF, but still comparable in size.
>> 
>> Again, I would welcome any comments you may have on the revised format and the potential for a shift to simpler binary OSM formats.
>> 
>> Regards,
>> Andrew Byrd
>> 
>> 
>>> On 29 Apr 2015, at 01:35, andrew byrd <andrew at fastmail.net <mailto:andrew at fastmail.net>> wrote:
>>> 
>>> Hello OSM developers,
>>>  
>>> Over the last few years I have worked on several pieces of software that consume and produce the PBF format. I have always appreciated the advantages of PBF over XML for our use cases, but over time it became apparent to me that PBF is significantly more complex than would be necessary to meet its objectives of speed and compactness.
>>>  
>>> Based on my observations about the effectiveness of various techniques used in PBF and other formats, I devised an alternative OSM representation that is consistently about 8% smaller than PBF but substantially simpler to encode and decode. This work is presented in an article at http://conveyal.com/blog/2015/04/27/osm-formats/ <http://conveyal.com/blog/2015/04/27/osm-formats/>. I welcome any comments you may have on this article or on the potential for a shift to simpler binary OSM formats.
>>>  
>>> Regards,
>>> Andrew Byrd
>>> _______________________________________________
>>> dev mailing list
>>> dev at openstreetmap.org <mailto:dev at openstreetmap.org>
>>> https://lists.openstreetmap.org/listinfo/dev <https://lists.openstreetmap.org/listinfo/dev>
>> 
>> _______________________________________________
>> dev mailing list
>> dev at openstreetmap.org <mailto:dev at openstreetmap.org>
>> https://lists.openstreetmap.org/listinfo/dev <https://lists.openstreetmap.org/listinfo/dev>
> 
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org <mailto:dev at openstreetmap.org>
> https://lists.openstreetmap.org/listinfo/dev <https://lists.openstreetmap.org/listinfo/dev>
> 
> 
> 
> 
> -- 
> Thank you for your time. Best regards.
> Dmitry.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20160207/eb71a830/attachment.html>