[OSM-dev] Simpler binary OSM formats
andrew at fastmail.net
Mon Feb 8 11:45:19 UTC 2016
I was aware of Cap’n Proto, but thanks for pointing out FlatBuffer. I’ve studied this system and considered how it might be useful for OSM data exchange. Here are my impressions:
1. Each FlatBuffer message does indirection through a table "to allow for format evolution and optional fields”. The basic OSM data model is quite stable at this point and to my knowledge evolves only through the introduction of different tag strings. Unlike existing formats, I’d like vex to be extremely simple and non-extensible so developers can easily and completely support reading or writing it. I would hesitate to devote space in every serialized entity to unused extensibility features.
2. FlatBuffer messages use fixed-width integers throughout, for both field values and vtable entries. OSM entity IDs are now 64 bits wide. Vtable entries are 32 bits wide and are used to refer to all strings and vectors, which are “never stored in-line”. The buffer will contain a very large proportion of zeros and repeated or unnecessary bytes (redundant fragments of coordinates and successive OSM entity references, offsets to strings and vectors). To get even remotely close to the file sizes we are accustomed to, the FlatBuffers would need to be inside compressed blocks. To achieve anything like comparable file sizes, we’d want to delta-code most numeric fields and probably apply variable-byte coding, i.e. pre-filter the data to assist the general purpose compression in its job. However, FlatBuffer inherently does not support variable-width integers.
3. Generally speaking, I can certainly see the appeal of using code generated from a schema to support a format quickly and reliably in several languages. But one of the main difficulties I encountered with OSM PBF is that it requires the developer to mix automatically generated Protobuf code with various bits of hand-rolled code to handle the block structure, compression, delta coding, string tables, etc. diminishing the appeal of code generation. In a well designed format, the code to parse each individual OSM entity (or interpret it in-place) could in fact be quite simple compared to this compression and block-handling code, and I’m not sure we gain much by generating it. To achieve reasonably compact file sizes, FlatBuffer would still require mixing custom code into and around generated code. This would defeat one of my major design goals.
4. FlatBuffer allows accessing buffer contents without parsing or dynamic allocations, which is a laudable goal. However, the vex format as it is currently defined would also allow iterative access to every entity with no dynamic allocations, requiring only an initial pass over each entity to determine the offsets of tags, references, etc. before use. You could refer to this as “parsing the entity” but I expect it would have a near zero impact on speed (and potentially zero impact considering that the data needs to be pulled into the processor cache for use anyway). Also, the file sizes we are accustomed to depend on delta coding, which is a cumulative process. While entire blocks may be skipped over, we must scan over all entities within a block to progressively decode coordinates or entity references. Random access within a block is not compatible with delta coding, nor do I see much use for it in a bulk data transfer and archiving format. So I think it’s a non-problem that we have to sequentially interpret the entities within each block.
Of course I may have misunderstood something about your suggestion or the use cases you had in mind. As always I’d welcome any reactions or discussion. My intent here is not to defend a specification set in stone, but to see if there is a technical consensus on what a next generation OSM format could look like.
> On 06 Feb 2016, at 23:47, Stadin, Benjamin <Benjamin.Stadin at heidelberg-mobil.com> wrote:
> Hi Andrew,
> Cap'n Proto (successor of ProtoBuffer from the guy who invented ProtoBuffer) and FlatBuffers (another ProtoBuffer succesor, by Google) have gained a lot of traction since last year. Both eliminate many if the shortcomings of the original ProtoBuffer (allow for random access, streaming,...), and improve on performance also.
> https://github.com/google/flatbuffers <https://github.com/google/flatbuffers>
> Here is a comparison between ProtoBuffer competitors:
> https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html <https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html>
> In my opinion FlatBuffers is the most interesting. It seems to have very good language and platform support, and has quite a high adoption rate already.
> I think that it's well worth to reconsider creating an own file format and parser for several reasons. Your concept looks well thought, it should be possible to implement a lighweight parser using FlatBuffers for your data scheme.
> Von meinem iPad gesendet
> Am 06.02.2016 um 22:37 schrieb Andrew Byrd <andrew at fastmail.net <mailto:andrew at fastmail.net>>:
>> Hello OSM developers,
>> Last spring I posted an article discussing some shortcomings of the PBF format and proposing a simpler binary OSM interchange format called VEX. There was a generally positive response at the time, including helpful feedback from other developers. Since then I have revised the VEX specification as well as our implementation, and Conveyal has been using this format in our own day-to-day work.
>> I have written a new article describing of the revised format:
>> http://conveyal.com/blog/2016/02/06/vex-format-part-two <http://conveyal.com/blog/2016/02/06/vex-format-part-two>
>> The main differences are 1) it is more regular and even simpler to parse; and 2) file blocks are compressed individually, allowing parallel processing and seeking to specific entity types. It is no longer smaller than PBF, but still comparable in size.
>> Again, I would welcome any comments you may have on the revised format and the potential for a shift to simpler binary OSM formats.
>> Andrew Byrd
>>> On 29 Apr 2015, at 01:35, andrew byrd <andrew at fastmail.net <mailto:andrew at fastmail.net>> wrote:
>>> Hello OSM developers,
>>> Over the last few years I have worked on several pieces of software that consume and produce the PBF format. I have always appreciated the advantages of PBF over XML for our use cases, but over time it became apparent to me that PBF is significantly more complex than would be necessary to meet its objectives of speed and compactness.
>>> Based on my observations about the effectiveness of various techniques used in PBF and other formats, I devised an alternative OSM representation that is consistently about 8% smaller than PBF but substantially simpler to encode and decode. This work is presented in an article at http://conveyal.com/blog/2015/04/27/osm-formats/ <http://conveyal.com/blog/2015/04/27/osm-formats/>. I welcome any comments you may have on this article or on the potential for a shift to simpler binary OSM formats.
>>> Andrew Byrd
>>> dev mailing list
>>> dev at openstreetmap.org <mailto:dev at openstreetmap.org>
>>> https://lists.openstreetmap.org/listinfo/dev <https://lists.openstreetmap.org/listinfo/dev>
>> dev mailing list
>> dev at openstreetmap.org <mailto:dev at openstreetmap.org>
>> https://lists.openstreetmap.org/listinfo/dev <https://lists.openstreetmap.org/listinfo/dev>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the dev