[OSM-dev] Indexing of PBF files

Sun Feb 13 10:25:55 UTC 2022

Hi!

I think Andrew has already explained many issues quite well. I'll just
add a few things here.

On Sun, Feb 13, 2022 at 10:17:11AM +0100, codesoap--- via dev wrote:
> Andrew Byrd <andrew at fastmail.net> wrote:
> > It's definitely worth considering the kind of capabilities you're
> > describing, but in my opinion it does not make sense at this point to
> > make incremental changes to the PBF format, which is very reliably
> > doing its job as a stable baseline for bulk data exchange. It would
> > be more appropriate in my opinion to form a working group to draft a
> > next-generation binary data exchange format that has explicitly stated
> > design goals and learns from a decade of usage.
> 
> I do understand your concerns and agree that it would be confusing to
> have two "types of PBF" (indexed and unindexed or even just partially
> indexed). However, I also think it's not too bad, since we're not
> talking about two incompatible file formats; all existing tool would
> continue working as usual, but newer tools could be more efficient by
> using the "indexdata" field.

Any use of the indexdata field would only make sense if the data inside
the PBF file is sorted in some geometric fashion, so that you can look
up a bounding box or geohash or whatever and quickly find the right
block in the PBF file and read that. But that's currently not the case.
Basically all PBF files are sorted by id and there are good reasons for
that, too. So adding some kind of indexdata would necessarily make the
files incompatible for most practical purposes. Another problem is that
the PBF file format as it exists today is so efficient because of
reasonably large internal block sizes. But the larger those blocks are,
the more data you have to unpack before finding the one object you are
interested in. So even with an index, you'd probably unpack far more
data than needed, which cost time. You can make the block sizes smaller,
but that will increase the file size. And, again, would make the file
format less attractive for other users.

Also the geometry index is only one way of organizing the data, for some
queries you would want to organize it by tags somehow. So at best you
are only providing half a solution.

> OSMExpress seems to solve the indexing problem, but also seems like an
> overkill for small tools, that only run on ones laptop. The *.osmx files
> seem to be about 10x as large as PBF files (600GB for the planet), it

That's the price you pay for useful indexes and quick access. There is
no one-size-fits-all solution here. Either you have a specific use case
in mind, in which case you can optimize for that. Or, you want a more
generic tool (which is probably interesting for more people) in which
case you need more RAM, processing power or whatever.

When Scott invented the PBF format, OSM was new and nobody knew where
the evolution would go, so he added that indexdata field for future
compatibility. But it turned out, it wasn't that useful, so nobody used
it. PBF is a format for very very efficiently storing and moving around
OSM data.

You wrote in an answer to Yuris mail:
> I'm afraid, though, that I don't understand how this relates to my
> problem. I'm neither qualified nor interested in a discussion about the
> data model. What I'm interested in, is a file format for storing the
> data with an index, so that tools can work with such files efficiently
> without needing to index or transform (e.g. with osm2pgsql) it first.

And maybe that's the problem here. If you are not willing to unstand the
OSM data model in detail, it is difficult to propose a better encoding.
I encourage you to delve into the details here and try something out.
Maybe you come up with a working solution. But having an index isn't
just a magic thing you put onto the data and it just works. All those
details matter.

Jochen
-- 
Jochen Topf  jochen at remote.org  https://www.jochentopf.com/  +49-351-31778688