[OSM-dev] [osmosis-dev] Proposal for a multithreaded PBF reader

Wed Apr 29 05:22:20 UTC 2015

On Di, Apr 28, 2015 at 06:56:23 -0600, Martijn van Exel wrote:
> Not sure if this has been discussed recently, but we've been thinking
> about improving osmosis PBF reading performance over at Telenav. My
> colleague Jon (cc) has come up with a suggestion that I want to put
> forward for discussion. I'm posting this to both osmosis-dev as well
> as dev because it affects the PBF format definition.
> 
> When reading a large PBF resource from a random access file (as
> opposed to a stream), it might be possible to significantly increase
> throughput by reading data of the same entity type from multiple
> threads simultaneously, making use of an optional directory structure
> to locate valid blocks of nodes, ways and relations for threads to
> consume.
> 
> To support parallel access, an optional directory_offset might be
> added to the HeaderBlock:
> 
> message HeaderBlock {
>   …
>   optional int64 directory_offset
> }
> 
> The directory_offset field would be the seek location in the file of a
> Directory message which is written at the end of the file (since the
> directory is flexible in length and all offsets are only known after
> writing all data to the PBF file). The directory itself is simply a
> list of valid read offsets for each entity type. Threads can read data
> from a given offset in the list to the next offset. The best chunk
> size for blocks in the directory can be determined through
> experimentation, although something around 1MB might be a good first
> guess.
> 
> message Directory {
>   repeated int64 node_block_offsets;
>   repeated int64 way_block_offsets;
>   repeated int64 relation_block_offsets;
> }
> 
> Before we explore this further, I'd like to know if this has been
> attempted before, and what concerns there may be.

PBF files already come in blocks with a length header in front of every
block. Osmium reads this length header in one thread and then puts the
data of each block into a work queue to be parsed by as many threads as
you want. This way you already get a nice speedup without any changes to
the file format.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.jochentopf.com/  +49-351-31778688