[OSM-dev] Indexing of PBF files

Mon Feb 14 09:21:48 UTC 2022

Hi Richard,

> On 13 Feb 2022, at 20:02, codesoap--- via dev <dev at openstreetmap.org> wrote:
> It's probably because I'm not familiar enough with the software around
> PBF files, but I'm not aware of the reasons for the sorting by ID. I
> thought most applications that use PBF files today just used it as a
> transfer medium and wouldn't care about the order of the OSM entities
> inside, because they read the whole file anyway.

I don’t know the original reasoning behind ordering the entities by ID, but I have always assumed it had to do with efficiently replicating databases. It should be possible to construct an index on the IDs much more efficiently if the elements are encountered in a specific order, particularly if the receiver is aware they are arriving in this order. In many cases the data will also end up in the underlying data store in the order they arrive, which could have implications for memory and disk access efficiency (though that should also be true for spatial ordering).

Additionally, IDs are one-dimensional so they provide a single unambiguous order. Ordering spatially means flattening two or more dimensions into one dimension using a strategy like a space-filling curve, of which there are many. So data produced using one system’s spatial ordering would be in the wrong order for other receiving systems unless everyone could agree on a single standard system.

Jochen, you stated that there are good reasons why it’s standard to sort PBF files by ID. For future reference, can you confirm the reasons?

> Fair point. I would still think that a geographic index is by far the
> most common use-case, but I can see, that choosing one index over
> another is not quite "clean".

Geographic grouping is certainly useful and common, but entities have primary keys (ids), so that is implicitly the highest priority for an index. Especially for the highly general use case of replicating a whole database.

> I can now see that a lot of the community seems to think that the
> "indexdata" field was just a bad idea overall and I won't try to force
> anything.

I don’t think it’s a "bad idea", it’s just one that was not adopted by most users because the format is commonly used for bulk data transfer, with data then restructured for a specific use case in a subsequent stage of the pipeline. You could reorder the entities and produce PBF data with information in the indexdata and it might work for your use case. But it probably wouldn't catch on widely because with a similar amount of processing the data could be loaded into any of the other tools mentioned which are tailor-made for efficiently querying subsets of the data.

Regards,
Andrew