[OSM-dev] Indexing of PBF files

Sun Feb 13 07:04:31 UTC 2022

Hi Richard,

For context, I have been using PBF files in my everyday work since the format emerged, and have written several pieces of software that produce and consume them. It's true that the format has some quirks and is not conducive to random access, so I've spent quite a bit of time thinking through what a next-generation binary data exchange format would look like, and have implemented prototypes and presented them at State of the Map and on mailing lists for community feedback etc.

So I'm someone who is aware of the internals of PBF and very interested in how it could evolve. And yet I would emphasize that a format or standard has both technical and social features or characteristics. On the technical side, we can think of many optimizations, simplifications, or improvements for particular use cases. But on the social side, PBF in its current form is a great success in that it's supported by a wide range of OSM tools, achieving a high degree of interoperability. Extending it with new features could undo the benefits of a decade of stability: near-total integration of the OSM toolchain achieved by all tools sharing a single stable language. This is a killer feature for any format, a characteristic not in the format itself, but distributed among every producing and consuming application and embedded in the skills of every technical person, software developer or integrator that uses PBF.

It's definitely worth considering the kind of capabilities you're describing, but in my opinion it does not make sense at this point to make incremental changes to the PBF format, which is very reliably doing its job as a stable baseline for bulk data exchange. It would be more appropriate in my opinion to form a working group to draft a next-generation binary data exchange format that has explicitly stated design goals and learns from a decade of usage. This would be a gradual process of building prototypes and getting a limited group of people to use converted data in production long enough that the design could be iteratively refined.

Addressing the practical problem at hand: some of the capabilities you're looking for are not necessarily in scope for a bulk data exchange format, and may be better provided by software that loads and post-processes bulk data, maintaining separation of concerns. Loading the data into a database is a reasonable approach, but relational databases are very general-purpose and manipulating OSM data with them is often not space or time efficient. A database tailored to the specific characteristics and use cases of OSM data can do this much more efficiently, and may even be simplified down to an in-process, single-file indexing tool. Such tools already exist: in the client-server model with a separate process you have https://github.com/drolbr/Overpass-API <https://github.com/drolbr/Overpass-API>, and in the single-file embedded indexing model you have https://github.com/protomaps/OSMExpress <https://github.com/protomaps/OSMExpress>. I particularly like OSMExpress - it is reminiscent of prototypes I've built in the past, but has been carried through to production quality. It may be what you're looking for. I've discussed it at length with the author Brandon, and am convinced that it's very well thought-out. You can try out the spatial indexing and extraction at https://protomaps.com/downloads/osm <https://protomaps.com/downloads/osm>. 

It's reasonable to consider having this kind of spatial indexing or random access as standard features of bulk data transfer formats, but from one perspective this just offloads the indexing workload on the people producing the bulk data instead of those consuming it. Since not everyone needs the spatial index, and many people are just using PBF to replicate bulk data into their own relational databases, it seems reasonable for the consumer to perform any indexing necessary to its particular use case.

As I mentioned I have long intended to set up a working group on file formats - anyone please feel free to contact me off-list if you're interested and I will plan to post any updates on the osm-dev list. I have included some links below to past work on this subject. 

Regards,
Andrew

Links: I wrote two blog articles summarizing my design study. In the first [1] I focused a little to heavily on compactness, and in the second post [2] I revised my charts after correcting some overly-aggressive truncation of coordinates. I hosted a workshop at SOTM 2017 on the subject of "Rethinking binary map data exchange formats" [3] and got generally positive responses. I do not think this workshop was recorded, but the slide deck is published online [4]. There was a follow-up discussion on this mailing list [7] and [8]. Implementations of reading and writing the described format are available in Conveyal open source projects, and have been used in production systems at some points.

[1] https://blog.conveyal.com/simpler-openstreetmap-data-exchange-formats-6d43be5230e8 <https://blog.conveyal.com/simpler-openstreetmap-data-exchange-formats-6d43be5230e8> 
[2] https://blog.conveyal.com/simpler-openstreetmap-data-exchange-part-ii-82afa8bdb01 <https://blog.conveyal.com/simpler-openstreetmap-data-exchange-part-ii-82afa8bdb01> 
[3] https://2017.stateofthemap.org/2017/rethinking-binary-map-data-exchange-formats/ <https://2017.stateofthemap.org/2017/rethinking-binary-map-data-exchange-formats/> 
[4] https://speakerdeck.com/sotm2017/day3-1515-rethinking-binary-map-data-exchange-formats <https://speakerdeck.com/sotm2017/day3-1515-rethinking-binary-map-data-exchange-formats> 
[7] https://lists.openstreetmap.org/pipermail/dev/2015-April/028546.html <https://lists.openstreetmap.org/pipermail/dev/2015-April/028546.html> 
[8] https://lists.openstreetmap.org/pipermail/dev/2016-February/029057.html <https://lists.openstreetmap.org/pipermail/dev/2016-February/029057.html> 

> On 13 Feb 2022, at 06:30, codesoap--- via dev <dev at openstreetmap.org> wrote:
> 
> Nick Stallman wrote:
>> Adding an index [to PBF files] seems like a logical step which would
>> reduce processing times for many common operations drastically.
> 
> I was just thinkgin about the same thing and was wondering if there is
> any progress on this, now that another 3+ years have passed. I haven't
> found any other resources on the topic than this E-Mail; has there been
> some discussion elsewhere, that I didn't find?
> 
> I'm sad to see you only received pushback against your proposal. IMO
> the benefits of an index would far outweigh the drawback of a few
> bytes of indexdata for each BlobHeader. I think such an index could
> lay the foundation for a whole new kind of applications; for example:
> applications that run on the end-user's hardware, are easy to setup and
> use, yet performant and with low disk usage. Applications like this are
> currently just not feasible.
> 
> I'll give a real-word example: I'm the author of
> github.com/codesoap/osmf. It's a simple command line tool, that I use
> to find OSM entities within a specified circle and with specified tag
> values. I use it for simple, everyday tasks, like finding a bakery,
> restaurant, pharmacy, etc in my vicinity. I would like to just give
> the tool a PBF file, in which it looks for results. I tried doing this
> with github.com/qedus/osmpbf and sachsen-latest.osm.pbf (208MB), but
> had to find, that this takes ~20s on my laptop. This is too slow to be
> practical, even with this rather small PBF file.
> 
> Thus, what I'm doing right now with osmf, is to first import the PBF
> file into a PostgreSQL database and than query this database instead of
> the PBF file. This has several major drawbacks:
> 1. Setting up osmf is much more complicated than just downloading a PBF
>   file. This is annoying for me, but probably also scares away users.
> 2. Importing the PBF file takes a long time; ca. 25 minutes for
>   sachsen-latest.osm.pbf (208MB) on my laptop.
> 3. The database takes up much more disk space than the PBF file; ~5.7GB
>   in this case.
> 
>> An initial thought would be to sort the input pbf file by geohash so
>> each PBF Blob has it's own unique geohash.
> 
> This sounds like a cool idea, but what I'm missing in this proposal,
> is the ability to determine the bounding box of the blob from this.
> Without the bounding box, I'm unable to determine whether a blob can
> be discarded for my search or not. I would at least need the width and
> height (in degrees) of the blob in addition to the geohash.
> 
> However, even with a width and height, I don't think sorting the blobs
> by geohash provides any benefit. I can still not do a binary search over
> the geohashes to find the blob(s) for my area of interest, because the
> blobs could have different (geospatial) sizes. For example, there could
> be many small blobs and then a big one, which spans the whole globe
> containing labels for all capitals or similar.
> 
> Thus I think a regular old bounding box would do the job (as suggested
> in [1]). I think this would still be OKish performancewise. If we had a
> 62GB PBF file of the whole planet with an average blob size of 8MB, it
> would still "only" take ~8k seek operations to jump through all blobs.
> I don't know too much about disk performance, but if I understand IOPS
> correctly, this should take ~0.1s on a simple SSD.
> 
> Of course, it would be even better for performance to have a separate
> index from the blobs, but unfortunately this is not possible with the
> defined PBF file format.
> 
> Greetings,
> Richard Ulmer
> 
> 
> [1] https://wiki.openstreetmap.org/wiki/PBF_Format#File_format
> 
> 
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> https://lists.openstreetmap.org/listinfo/dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20220213/1f6f0b81/attachment-0001.htm>