[OSM-dev] Indexing of PBF files

Sun Feb 13 09:17:11 UTC 2022

Hi Andrew,
thanks for your very thorough response!

Andrew Byrd <andrew at fastmail.net> wrote:
> But on the social side, PBF in its current form is a great success in
> that it's supported by a wide range of OSM tools, achieving a high
> degree of interoperability. Extending it with new features could undo
> the benefits of a decade of stability: near-total integration of the
> OSM toolchain achieved by all tools sharing a single stable language.

I absolutely agree, that a stable file format is very important. I
am, however, not suggesting any change to the defined file format; I
should probably have made this clear in my first mail: The (optional)
"indexdata" field in the BlobHeader has been defined from the beginning.
You can find it described in the wiki [1]. The author even suggests
storing a bounding box for the blob in this field.

AFAIK the "indexdata" field is currently not used by any tool, but every
tool that has implemented a correct PBF encoder/decoder will be able to
deal with this field without any changes. Most tools will probably just
ignore it.

> It's definitely worth considering the kind of capabilities you're
> describing, but in my opinion it does not make sense at this point to
> make incremental changes to the PBF format, which is very reliably
> doing its job as a stable baseline for bulk data exchange. It would
> be more appropriate in my opinion to form a working group to draft a
> next-generation binary data exchange format that has explicitly stated
> design goals and learns from a decade of usage.

I do understand your concerns and agree that it would be confusing to
have two "types of PBF" (indexed and unindexed or even just partially
indexed). However, I also think it's not too bad, since we're not
talking about two incompatible file formats; all existing tool would
continue working as usual, but newer tools could be more efficient by
using the "indexdata" field.

Creating an entirely new file format where the index data is not
optional, it's format clearly defined and it's design more efficient
than the "indexdata" field would be more "clean", but it's adoption
would probably take years, if it happens at all.

I think it's a trade-off between cleanliness and practicality. Right now
I'm leaning towards "practicality"/populating the "indexdata" field in
PBF files.

> Addressing the practical problem at hand: some of the capabilities
> you're looking for are not necessarily in scope for a bulk data
> exchange format, and may be better provided by software that loads
> and post-processes bulk data, maintaining separation of concerns.
> Loading the data into a database is a reasonable approach, but
> relational databases are very general-purpose and manipulating
> OSM data with them is often not space or time efficient. A
> database tailored to the specific characteristics and use cases
> of OSM data can do this much more efficiently, and may even be
> simplified down to an in-process, single-file indexing tool.
> Such tools already exist: in the client-server model with a
> separate process you have https://github.com/drolbr/Overpass-API
> <https://github.com/drolbr/Overpass-API>, and
> in the single-file embedded indexing model you
> have https://github.com/protomaps/OSMExpress
> <https://github.com/protomaps/OSMExpress>.

Thanks for pointing me towards these two projects!

OSMExpress seems to solve the indexing problem, but also seems like an
overkill for small tools, that only run on ones laptop. The *.osmx files
seem to be about 10x as large as PBF files (600GB for the planet), it
has a lot of big dependencies (libosmium, lmdb, s2 geometry, ...) and is
only available for Python and C++ (I want to use Go). I don't need every
last bit of performance, transactional updates or concurrent access.

The Overpass-API seems more attractive to me, since I could keep my
tool/client nice and simple. I was not aware, that there are free and
publicly available Overpass-APIs, so that's a big plus.

I think I'll take a closer look at the Overpass-API, but I still view
this as a workaround, since it makes me dependent on an internet
connection and on the provider of the API (because I don't want to
burden the users of my tool with setting up their own Overpass-API).

> As I mentioned I have long intended to set up a working group on file
> formats - anyone please feel free to contact me off-list if you're
> interested and I will plan to post any updates on the osm-dev list. I
> have included some links below to past work on this subject.

I'd really like to see this happening. I'll keep watching the osm-dev
list for updates.

- Richard

[1] https://wiki.openstreetmap.org/wiki/PBF_Format#File_format