[OSM-dev] OSM PBF and spatial characteristics of blocks

Paul Norman penorman at mac.com
Tue Jan 5 17:09:54 UTC 2016

On 1/5/2016 8:32 AM, Stadin, Benjamin wrote:
> I’m thinking about a design for an efficient storage container for OSM 
> PBF (planet size data, minutely updates), for the purpose of TileMaker 
> as well as for an internal application.

Good to see Tilemaker (https://github.com/systemed/tilemaker) getting 
some traction.

> One thing I stumbled on is the usage of the bounding boxes within OSM 
> PBF. The documentation [1] does not clarify on the spatial 
> characteristics of the individual FileBlocks. Some questions:
>  1. Is it correct that there is exactly one HeaderBlock in a .pbf
>     file? If so, the BBOX defined within the HeaderBlock defines the
>     whole region of the .pbf export?
>  2. What are the spatial characteristics of an individual FileBlock
>     within the FileBlocks sequence? Is a FileBlock generated by any
>     kind of spatial ordering? For example, is it save to assume that
>     all content is very dense / close to a region of the world? Or can
>     this be controlled when creating a .pbf? If there was a spatial
>     loose relationship, it would allow to relate FileBlocks to map
>     „tile“ regions (a FileBlock may obviously relate to several
>     „tiles“, but would be fine as long as the blocks relate to a
>     certain region for most of it’s content)
>  3. There is a commented BBOX definition within the PrimitiveBlock.
>     What remains to be done to to enable this proposed BBOX extension?
>     I’d have the same question about this BBOX as with my second question.

PBFs are generally ordered by type then ID, so there is no guaranteed 
spatial clustering. There is a strong correlation between nearby IDs and 
objects being near each other which makes delta encoding worthwhile.

A lot of software implicitly depends on ordering. Sorting by type is 
often a hard requirement - doing anything with ways normally requires 
having parsed all the nodes for geometries. Sorting by ID may be needed 
depending on how storage algorithms were implemented - software can 
become less efficient or break if it's expecting ordered IDs and gets 
