[OSM-dev] Simpler binary OSM formats

Mon Feb 8 09:57:15 UTC 2016

> On 07 Feb 2016, at 20:10, Дмитрий Киселев <dmitry.v.kiselev at gmail.com> wrote:
> 
> As for fixed sized blocks in vex, I did consider that option but couldn’t come up with a compelling reason for it. I can see the case for a maximum block size (so we know what the maximum size of allocation will be), but can you give a concrete example of how fixed-size blocks would be advantageous in practice? I would be very hesitant to split any entity across multiple blocks.
> 
> 
> When you need relations-ways-nodes read order, blocks will save you  from unnecessary read-through the whole file (yes, you can skip decompression for nodes/ways but still you must read the whole file).

Let me rephrase the question: You specifically mentioned blocks of a predetermined, fixed size. How would having fixed-size blocks (as opposed to the current variable sized blocks) improve your ability to seek to different entity types within a file? Maybe you are thinking of doing a binary search through the file rather than a linear search for the blocks of interest. But that means the vex blocks would need to be a fixed size after compression, not before compression. It seems too complex to require writers to target an exact post-compression block size.

Also, I think your observation that “you must read the whole file” when seeking ahead to another entity type requires some additional nuance. You must only read the header of each block, at which point you know how long that block is and you can seek ahead to the next block. So indeed, you’d touch at least one page or disk block per vex block. Pages are typically 4 kbytes, so if your vex blocks are a few Mbytes in size, you would only access on the order of 1/1000 of the pages while seeking ahead to a particular entity type. 

To me, it seems much more advantageous to provide a table of file offsets stating where each entity type begins. I have already considered adding this to vex after the basic format is established (like author metadata and map layers). It seems appropriate to place such a table at the end of the vex data, because this allows the writer to produce output as a stream (no backward seeks) and a reader can only make effective use of this table if it’s buffering the data and able to seek within the file.

> Second example: find something by id, if you have blocks it's easy to map whole block into memory and do a binary search for logN block reads instead of seeing through a file all the time.

Unlike o5m I have not included any requirement that IDs be in a particular order, which means binary searches are not always possible. I see vex as a data interchange format usable in both streaming and disk-backed contexts, not as a replacement for an indexed database table. It’s an interesting idea to try to serve both purposes at once and be able to quickly seek to an ID within a flat data file, but I’m not sure if such capabilities are worth the extra complexity. Such a binary search, especially if performed repeatedly for different entities, would be touching (and decompressing) a lot of disk blocks / memory pages because the IDs you’re searching through are mixed in with the rest of the data rather than in a separate index as they would be in a database.

Andrew

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20160208/9fa5c75e/attachment.html>