[OSM-dev] Planet file with preprocessed lines/polygons

Jochen Topf jochen at remote.org
Mon May 15 10:08:56 UTC 2017

On Fri, May 12, 2017 at 06:18:11PM +0200, Christoph Lingg wrote:
> when it comes to read raw OSM dumps it's quite straightforward to parse nodes:
> their geometry properties can be read alongside with their tags. When it comes
> to linestrings and relations it is more complicated to access their geometry:
> the geometry of referenced nodes needs to be combined into lines and polygons.
> Also one needs to decide which linestrings/relations are actually lines and
> which are areas.
> I know how this can be done but I am wondering if there are preprocessed
> datasets around that have geometries already precomputed. That would make sense
> to me as a lot of people face the sample problem and this step is quite
> resource intense.
> A huge file containing all osm items as geojson would be my dreamcase. Does
> this exist?

I have been thinking about something like this a lot in the last months
and experimented a bit. I agree that it would be a useful thing to have
preprocessed OSM data available for download. Currently the very basic
preprocessing needed that everybody has to do to assemble lines out of
ways and the node locations and to assemble multipolygons out of
relations, their member ways and, again, node locations, needs about 50
GB RAM to run efficiently. This is not something everybody has on their
machines. And on top of that, of course, whatever further processing the
user wants to do. Taking the first basic preprocessing step out, run it
separately and offer the result for download makes sense.

The biggest problem here is that there is no really suitable format. We
need a format that

* has the flexibility of the OSM data with its open tagging scheme.
  Otherwise we have to throw away too much data that might be useful for
  some users which would hurt adoption of such a data format. This
  excludes basically all of the known GIS formats (such as Shapefiles
  etc.) which are based on the assumption that there is a fixed list of
  layers and attributes. About the only format that somewhat fits this
  bill is GeoJSON.
* is fast to read and write. This is a problem with GeoJSON, because it
  is a rather verbose text format. In addition it has the problem that
  you can't generally read it in a streaming fashion. There is a variant
  called "GeoJSON Text Sequences" (https://tools.ietf.org/html/rfc8142)
  which solves this problem, though.
* is compact. Again, this is a problem with GeoJSON. We definitely need
  some kind of compression (gzip, bzip2 etc.) on top of GeoJSON to make
  this even remotely possible as a download format. But this makes
  creating and using those files even slower.

And some more about the flexibility issue: This is not only about having
all tags in the resulting file. There are more issues here: For handling
polygons from closed ways we have to decide which tags actually
represent polygons and which represent linestrings. Then we need to
decide about which metadata we need in such a file. Most users will
probably not need timestamps, user names, etc. that are in every OSM
object. Do we need all the nodes that have no tags themselves and are
only used for assembling lines and polygons from ways and relations?
What about non-multipolygon relations like routes and turn restrictions?
How to represent them? A general format should probably allow different
options here. But if you want to make this is available for download,
which variant will it be? Every user needs something different and we
don't know what this is. We'll probably needs some kind of 80% solution
here. Find a compromise format that is useful for most people, everybody
else has to create their own. This is similar to how I offer coastline
data for download at openstreetmapdata.com, there are several variantsin
the most useful formats for download, if you need more you can run the
osmcoastline program yourself using different options.

In all of this I am only talking about a format for transporting data.
We can think about different formats that include indexes into the data
in some way or split up the data, for instance in vector tiles. But then
the problem becomes even larger. What indexes do we need? How to handle
the splitting up of large geometries into tiles? The more "features" we
want to have the more the different use cases for the data will differ,
the more complicated it becomes. I don't believe there is such a format
that can be everything to everyone. So I am concentrating on, what I
think is the next step: A flexible, fast and compact format for
transporting preprocessed OSM data.

After all this preamble, here is some concrete work: The next
osmium-tool version will contain an "export" command that can create
GeoJSON (and GeoJSON Text Sequences) files. The implementation is done,
but not much testing. It is available in the "export" branch
(https://github.com/osmcode/osmium-tool/tree/export). Give it a try.
Medium term I would want to have a better format than GeoJSON for this
kind of data and would love to support that in osmium, but for the time
being you can experiment with GeoJSON.

One other thing: If you have the memory (see above) to assemble lines
and (multi)polygons from OSM data and are happy with C++ it might be
better to actually assemble the geometries from OSM data every time you
use them instead of writing them to GeoJSON and reading them in again.
On my server (3.6GHz quadcore) it takes only a bit more than 20 minutes
to do this for the whole planet file. But assembling the data *and*
writing it out to disk (GeoJSON Text Sequences format and using parallel
bzip, no metadata, no untagged nodes) takes more than two hours! The end
result is a 46 GB file (current planet is 37 GB). This is because the
OSM PBF format is more efficient than GeoJSON + compression.

Jochen Topf  jochen at remote.org  https://www.jochentopf.com/  +49-351-31778688

More information about the dev mailing list