[OSM-dev] Storing way/relation start offset in PBF file

Tue Nov 30 23:47:16 GMT 2010

On Tue, Nov 30, 2010 at 05:07:26PM -0600, Scott Crosby wrote:
> On Tue, Nov 30, 2010 at 2:26 PM, Jochen Topf <jochen at remote.org> wrote:
> > Sometimes it is useful to read ways before nodes or relations before way or
> > nodes. With the XML format this is not really possible, but with the PBF
> > format it could be reasonably easy if we store the offsets in the file where
> > the way and relations start, respectively.
> >
> > If we write the offsets at the end of the file, we can still do streaming
> > write. When reading from a stream you have to read everything anyway, when
> > reading from a file, you can seek to the end and find out about the offsets
> > and then seek there and start reading the data.
> >
> > Is this something we can fit in the existing extension mechanism? If not, its
> > not a big deal, but we can note it down as possible extension in case we'll do
> > a new version of the format someday.
> 
> Yes. It can be done. Someone requested that feature a few months ago.
> Was that you?

I don't think so, but it could be, I am not sure. :-)

> I can't quite do what you want, but I can get very close within the
> current design, in a backward and forward compatible manner. Each
> fileblock in the file has a compressed 'blob', and 'indexdata'.
> Indexdata is not compressed and can be read even before decompressing
> the blob data. If indexdata contains a protocol buffer containing a
> count of the number of each type of entity in the block, then your
> read-only-ways code simply skips past any block with waycount==0. It
> won't be as fast as random access, and it will burn the disk IO
> bandwidth to read every byte of the file, but you avoid the
> decompression and decoding time and retain full forward and backward
> compatability with existing PBF software.
> 
> In implementation terms, you'd want to add the string 'EntityCount'
> into optional_features in the OSMHeader block, to identify files
> containing entity counts so that readers can use the
> optimization. We'd also have to define a protocol buffer for
> containing index data, and code in osmosis to optionally add these
> counts.

Ok.

> One question however. Why? If you're doing this a lot, why isn't it
> acceptable to have 3 files? One for nodes, one for ways, and one for
> relations, each in PBF format? Is this really a feature that is
> important enough to enough people that files should include
> entity counts by default? Can you explain why you need this feature anyways?

There are many cases where this would come in handy, for instance if you need
all ways with tag foo=bar including their geometries. You can read all ways
first, filter out the ones you need then go back to the nodes and read all nodes
that you need to build the geometries for the ways you just extracted.

My current problem is creating proper multipolygons. For that I want to read
all relations, store the ones containing a multipolygon in memory, then read
through all the ways to match them to the multipolygons that need them. Its
much cheaper to do it this way than the other way around, because there are
far less multipolygon relations than ways.

Of course I could have three files, but I don't. All the software we have
creates a single file. Single-file pbfs are available for download etc. If
I have to split them up first, the whole speed advantage is gone. And handling
single files is much easier than multiple files anyway.

> The real problem with this proposal, as well as Christian Vetter's
> topological sort proposal, or Frederik Ramm's idea of including version
> numbers or URL's for incremental update support, is that osmosis doesn't have
> the ability to push metadata like 'this is topologically sorted'
> through its pipeline, making PBF support for encoding metadata not very useful.

My suggestion is not really metadata that has to move through Osmosis.  It has
more to do with the file serialization than with the file contents.  Its just
something that concerns the tasks reading and writing the pbf.  It has to
remember those offsets and add them to the end of the file. Or, as you suggest,
add those counts when writing out the data. Thats independent of everything
else happening in Osmosis.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298