[OSM-dev] Storing way/relation start offset in PBF file

Tue Nov 30 23:07:26 GMT 2010

On Tue, Nov 30, 2010 at 2:26 PM, Jochen Topf <jochen at remote.org> wrote:
> Sometimes it is useful to read ways before nodes or relations before way or
> nodes. With the XML format this is not really possible, but with the PBF
> format it could be reasonably easy if we store the offsets in the file where
> the way and relations start, respectively.
>
> If we write the offsets at the end of the file, we can still do streaming
> write. When reading from a stream you have to read everything anyway, when
> reading from a file, you can seek to the end and find out about the offsets
> and then seek there and start reading the data.
>
> Is this something we can fit in the existing extension mechanism? If not, its
> not a big deal, but we can note it down as possible extension in case we'll do
> a new version of the format someday.

Yes. It can be done. Someone requested that feature a few months ago.
Was that you?

I can't quite do what you want, but I can get very close within the
current design, in a backward and forward compatible manner. Each
fileblock in the file has a compressed 'blob', and 'indexdata'.
Indexdata is not compressed and can be read even before decompressing
the blob data. If indexdata contains a protocol buffer containing a
count of the number of each type of entity in the block, then your
read-only-ways code simply skips past any block with waycount==0. It
won't be as fast as random access, and it will burn the disk IO
bandwidth to read every byte of the file, but you avoid the
decompression and decoding time and retain full forward and backward
compatability with existing PBF software.

In implementation terms, you'd want to add the string 'EntityCount'
into optional_features in the OSMHeader block, to identify files
containing entity counts so that readers can use the
optimization. We'd also have to define a protocol buffer for
containing index data, and code in osmosis to optionally add these
counts.

One question however. Why? If you're doing this a lot, why isn't it
acceptable to have 3 files? One for nodes, one for ways, and one for
relations, each in PBF format? Is this really a feature that is
important enough to enough people that files should include
entity counts by default? Can you explain why you need this feature anyways?

The real problem with this proposal, as well as Christian Vetter's
topological sort proposal, or Frederik Ramm's idea of including version
numbers or URL's for incremental update support, is that osmosis doesn't have
the ability to push metadata like 'this is topologically sorted'
through its pipeline, making PBF support for encoding metadata not very useful.

It would be really really useful if XML-v0.61 included a tag-dictionary for the
entire map for storing this kind of metadata.

Scott