[OSM-dev] Indexing of PBF files

Sun Feb 13 06:49:16 UTC 2022

(resending this E-Mail because my previous one wasn't threaded correctly)

Hi,

Nick Stallman wrote:
> Adding an index [to PBF files] seems like a logical step which would
> reduce processing times for many common operations drastically.

I was just thinkgin about the same thing and was wondering if there is
any progress on this, now that another 3+ years have passed. I haven't
found any other resources on the topic than this E-Mail; has there been
some discussion elsewhere, that I didn't find?

I'm sad to see you only received pushback against your proposal. IMO
the benefits of an index would far outweigh the drawback of a few
bytes of indexdata for each BlobHeader. I think such an index could
lay the foundation for a whole new kind of applications; for example:
applications that run on the end-user's hardware, are easy to setup and
use, yet performant and with low disk usage. Applications like this are
currently just not feasible.

I'll give a real-word example: I'm the author of
github.com/codesoap/osmf. It's a simple command line tool, that I use
to find OSM entities within a specified circle and with specified tag
values. I use it for simple, everyday tasks, like finding a bakery,
restaurant, pharmacy, etc in my vicinity. I would like to just give
the tool a PBF file, in which it looks for results. I tried doing this
with github.com/qedus/osmpbf and sachsen-latest.osm.pbf (208MB), but
had to find, that this takes ~20s on my laptop. This is too slow to be
practical, even with this rather small PBF file.

Thus, what I'm doing right now with osmf, is to first import the PBF
file into a PostgreSQL database and than query this database instead of
the PBF file. This has several major drawbacks:
1. Setting up osmf is much more complicated than just downloading a PBF
   file. This is annoying for me, but probably also scares away users.
2. Importing the PBF file takes a long time; ca. 25 minutes for
   sachsen-latest.osm.pbf (208MB) on my laptop.
3. The database takes up much more disk space than the PBF file; ~5.7GB
   in this case.

> An initial thought would be to sort the input pbf file by geohash so
> each PBF Blob has it's own unique geohash.

This sounds like a cool idea, but what I'm missing in this proposal,
is the ability to determine the bounding box of the blob from this.
Without the bounding box, I'm unable to determine whether a blob can
be discarded for my search or not. I would at least need the width and
height (in degrees) of the blob in addition to the geohash.

However, even with a width and height, I don't think sorting the blobs
by geohash provides any benefit. I can still not do a binary search over
the geohashes to find the blob(s) for my area of interest, because the
blobs could have different (geospatial) sizes. For example, there could
be many small blobs and then a big one, which spans the whole globe
containing labels for all capitals or similar.

Thus I think a regular old bounding box would do the job (as suggested
in [1]). I think this would still be OKish performancewise. If we had a
62GB PBF file of the whole planet with an average blob size of 8MB, it
would still "only" take ~8k seek operations to jump through all blobs.
I don't know too much about disk performance, but if I understand IOPS
correctly, this should take ~0.1s on a simple SSD.

Of course, it would be even better for performance to have a separate
index from the blobs, but unfortunately this is not possible with the
defined PBF file format.

Greetings,
Richard Ulmer

[1] https://wiki.openstreetmap.org/wiki/PBF_Format#File_format