[OSM-dev] Proposal for a multithreaded PBF reader

Martijn van Exel m at rtijn.org
Wed Apr 29 00:56:23 UTC 2015


Hi all,

Not sure if this has been discussed recently, but we've been thinking
about improving osmosis PBF reading performance over at Telenav. My
colleague Jon (cc) has come up with a suggestion that I want to put
forward for discussion. I'm posting this to both osmosis-dev as well
as dev because it affects the PBF format definition.

When reading a large PBF resource from a random access file (as
opposed to a stream), it might be possible to significantly increase
throughput by reading data of the same entity type from multiple
threads simultaneously, making use of an optional directory structure
to locate valid blocks of nodes, ways and relations for threads to
consume.

To support parallel access, an optional directory_offset might be
added to the HeaderBlock:

message HeaderBlock {
  …
  optional int64 directory_offset
}

The directory_offset field would be the seek location in the file of a
Directory message which is written at the end of the file (since the
directory is flexible in length and all offsets are only known after
writing all data to the PBF file). The directory itself is simply a
list of valid read offsets for each entity type. Threads can read data
from a given offset in the list to the next offset. The best chunk
size for blocks in the directory can be determined through
experimentation, although something around 1MB might be a good first
guess.

message Directory {
  repeated int64 node_block_offsets;
  repeated int64 way_block_offsets;
  repeated int64 relation_block_offsets;
}

Before we explore this further, I'd like to know if this has been
attempted before, and what concerns there may be.

Best,

Martijn



More information about the dev mailing list