[OSM-dev] Indexing of PBF files

Sun Feb 13 12:02:48 UTC 2022

Hi Jochen,
thanks for taking the time to tell me about your view on the matter!

Jochen Topf <jochen at remote.org> wrote:
> Any use of the indexdata field would only make sense if the data
> inside the PBF file is sorted in some geometric fashion [...] But
> that's currently not the case. Basically all PBF files are sorted
> by id and there are good reasons for that, too. So adding some kind
> of indexdata would necessarily make the files incompatible for most
> practical purposes.

It's probably because I'm not familiar enough with the software around
PBF files, but I'm not aware of the reasons for the sorting by ID. I
thought most applications that use PBF files today just used it as a
transfer medium and wouldn't care about the order of the OSM entities
inside, because they read the whole file anyway.

> Another problem is that the PBF file format as it exists today is
> so efficient because of reasonably large internal block sizes. But
> the larger those blocks are, the more data you have to unpack before
> finding the one object you are interested in. So even with an index,
> you'd probably unpack far more data than needed, which cost time.

Of course I would still have to decode more OSM entities, than I'm
interested in, but a maximum (uncompressed) blocksize of 32MiB is
defined in the specification. Reading a blob of this size might be
slower than would be ideal for my use-case, but it's still practicable.

> Also the geometry index is only one way of organizing the data, for some
> queries you would want to organize it by tags somehow. So at best you
> are only providing half a solution.

Fair point. I would still think that a geographic index is by far the
most common use-case, but I can see, that choosing one index over
another is not quite "clean".

> There is no one-size-fits-all solution here.

A one-size-fits-most would be acceptable to me, but maybe I am just too
uninformed, when thinking that a geographic index is the most common
requirement.

> You wrote in an answer to Yuris mail:
> > I'm afraid, though, that I don't understand how this relates to my
> > problem. I'm neither qualified nor interested in a discussion about the
> > data model. [...]
>
> And maybe that's the problem here. If you are not willing to unstand the
> OSM data model in detail, it is difficult to propose a better encoding.
> I encourage you to delve into the details here and try something out.
> Maybe you come up with a working solution. But having an index isn't
> just a magic thing you put onto the data and it just works. All those
> details matter.

I'm afraid I'm not motivated enough to try and develop a new file
format. I just thought that a geographic index for PBF files might still
be on someones todo-list, because the wiki still says, that such an
index could be implemented in the future.

I can now see that a lot of the community seems to think that the
"indexdata" field was just a bad idea overall and I won't try to force
anything.

I guess my takeaway is, that I shouldn't wait for indexed PBF files
to come around and read up on the Overpass-API or something similar
instead.

- Richard