[OSM-dev] [osmosis-dev] Proposal for a multithreaded PBF reader

Martijn van Exel m at rtijn.org
Wed Apr 29 16:55:57 UTC 2015

Hey Jochen,

That's useful, but still does not allow you to optimize threading
because you don't know in advance how many blocks there are, or is
there a way to know / estimate this based on the '~8k' block size
(from the wiki)?

If osmosis is the reference implementation, is there a reason why it
doesn't seem to leverage this block structure to speed up reading? Or
does it?

Martijn van Exel
skype: mvexel

On Tue, Apr 28, 2015 at 11:22 PM, Jochen Topf <jochen at remote.org> wrote:
> On Di, Apr 28, 2015 at 06:56:23 -0600, Martijn van Exel wrote:
>> Not sure if this has been discussed recently, but we've been thinking
>> about improving osmosis PBF reading performance over at Telenav. My
>> colleague Jon (cc) has come up with a suggestion that I want to put
>> forward for discussion. I'm posting this to both osmosis-dev as well
>> as dev because it affects the PBF format definition.
>> When reading a large PBF resource from a random access file (as
>> opposed to a stream), it might be possible to significantly increase
>> throughput by reading data of the same entity type from multiple
>> threads simultaneously, making use of an optional directory structure
>> to locate valid blocks of nodes, ways and relations for threads to
>> consume.
>> To support parallel access, an optional directory_offset might be
>> added to the HeaderBlock:
>> message HeaderBlock {
>>>>   optional int64 directory_offset
>> }
>> The directory_offset field would be the seek location in the file of a
>> Directory message which is written at the end of the file (since the
>> directory is flexible in length and all offsets are only known after
>> writing all data to the PBF file). The directory itself is simply a
>> list of valid read offsets for each entity type. Threads can read data
>> from a given offset in the list to the next offset. The best chunk
>> size for blocks in the directory can be determined through
>> experimentation, although something around 1MB might be a good first
>> guess.
>> message Directory {
>>   repeated int64 node_block_offsets;
>>   repeated int64 way_block_offsets;
>>   repeated int64 relation_block_offsets;
>> }
>> Before we explore this further, I'd like to know if this has been
>> attempted before, and what concerns there may be.
> PBF files already come in blocks with a length header in front of every
> block. Osmium reads this length header in one thread and then puts the
> data of each block into a work queue to be parsed by as many threads as
> you want. This way you already get a nice speedup without any changes to
> the file format.
> Jochen
> --
> Jochen Topf  jochen at remote.org  http://www.jochentopf.com/  +49-351-31778688

More information about the dev mailing list