[OSM-dev] Indexing of PBF files

Tue Oct 16 20:43:16 UTC 2018

On Tue, Oct 16, 2018 at 10:18:08PM +0200, William Temperley wrote:
> Requiring the sequential read makes using the pbf format difficult in data
> parallel processing.
> 
> When files are split into equal sized chunks to be processed in parallel,
> it is necessary to be able to seek to the beginning of the next block
> (blob) to begin processing there.
> 
> This is not currently possible with the pbf format, as the file _must_ be
> read sequentially to figure out where the blob ends / new one begins. With
> an index, or even just a simple delimiter it would be possible to figure
> this out in a parallel processing scenario.

Osmium can do this just fine. It has one thread reading the data
sequentially, figuring out where the blocks start and end and parceling
out the block decoding work to other threads. Not as simple and probably
not quite as fast as with an index pointing to those blocks, but it does
work.

Indexes have the drawback that you can't streaming-write the data any
more, you have to go back to write the index. Or you write them at the
end, then you can't streaming read any more (at least when you want to
use the index).

Jochen
-- 
Jochen Topf  jochen at remote.org  https://www.jochentopf.com/  +49-351-31778688