[OSM-dev] Storing way/relation start offset in PBF file

Wed Dec 1 03:48:37 GMT 2010

On Tue, Nov 30, 2010 at 5:47 PM, Jochen Topf <jochen at remote.org> wrote:
> On Tue, Nov 30, 2010 at 05:07:26PM -0600, Scott Crosby wrote:
>> On Tue, Nov 30, 2010 at 2:26 PM, Jochen Topf <jochen at remote.org> wrote:

>
>> One question however. Why? If you're doing this a lot, why isn't it
>> acceptable to have 3 files? One for nodes, one for ways, and one for
>> relations, each in PBF format? Is this really a feature that is
>> important enough to enough people that files should include
>> entity counts by default? Can you explain why you need this feature anyways?
>
> There are many cases where this would come in handy, for instance if you need
> all ways with tag foo=bar including their geometries. You can read all ways
> first, filter out the ones you need then go back to the nodes and read all nodes
> that you need to build the geometries for the ways you just extracted.

I don't think adding entity counts by default makes sense. Is it worth
adding in ~7 bytes into each fileblock? (~700kb increased filesize)
for every file for this one use case? Adding it on as a non-default
command line option is possible. Including offsets in the end end of
the file is much more concise and could be enabled by default, if a
workable way can be found. (see below)

> Of course I could have three files, but I don't. All the software we have
> creates a single file. Single-file pbfs are available for download etc. If
> I have to split them up first, the whole speed advantage is gone. And handling
> single files is much easier than multiple files anyway.

There's a way you could split it up a lot faster. There's no need to
decompress each block. Just copy the blocks that contain entities that
you're interested in.

>
>> The real problem with this proposal, as well as Christian Vetter's
>> topological sort proposal, or Frederik Ramm's idea of including version
>> numbers or URL's for incremental update support, is that osmosis doesn't have
>> the ability to push metadata like 'this is topologically sorted'
>> through its pipeline, making PBF support for encoding metadata not very useful.
>
> My suggestion is not really metadata that has to move through Osmosis.

True, but pushing metadata would let osmosis mark the header of the
file with optional_feature flags that indicated that that the file had
the index data at the end, or entity counts, and it would be more
generically useful in lots of other circumstances.

> It has
> more to do with the file serialization than with the file contents.  Its just
> something that concerns the tasks reading and writing the pbf.  It has to
> remember those offsets and add them to the end of the file. Or, as you suggest,
> add those counts when writing out the data. Thats independent of everything
> else happening in Osmosis.

I considered this option, but I couldn't come up with a workable
design. First, PBF files are designed to be concatenable, a property
that I want to retain. The problem is if I write some sort of trailing
block to the file containing extra indexing information, such as
offsets, How do I determine the length of that block? A reader needs
to know that in order to seek to the right place to read it. If you
could find a non-gross way to make this work, I'd appreciate it.

Scott