[OSM-dev] Ordering OSM-Files (was: visible-Flag in PBF)

Sun May 8 11:19:45 BST 2011

One point that came up in the PBF discussion was the questions of ordering OSM
files. Lets try to analyze that and try to separate the different issues:

1. Order of object types

OSM files contain three different kinds of objects: nodes, ways, and relations.
In the future new types of objects might be added, such as a native area type.
Not all files contain all object types, for instance relations can be missing
if there are no relations in a bbox.

It makes processing easier if all objects of one type are bunched together.
It doesn't have to be this way, but it is almost always the case currently.
Note that the PBF format is optimized for that case with its structure of
blobs.

Bunching objects of the same type together is probably always better for
applications reading the data. For some applications it might not matter, but
for others it makes processing a lot easier (because you can be sure you have
read all data of a specific type before you start processing a different type).
On the other hand writing data might be much easier when you don't have to take
this into account. You can, for instance, basically just concatenate two files
to get a new OSM file.

I think the use cases where you want to rely on the ordering are important enough
to warrant some kind of flag that promises you that the objects are bunched
together according to object type.

The next question is what the order of the object types should be.  Currently
almost all OSM files have first all nodes, then all ways, then all relations.
That order makes sense, for instance, if you want to be able to easily cut a
bbox from the file. If you wanted to assemble multipolygons, a different order
would be better (relations, ways, nodes or relations, nodes, ways). I am not
sure we can ever decide on one particular "best" order here. But maybe it
would make sense to allow different orders and note them down in the header.
This way a program could decide in the beginning how many passes it needs over
the data or maybe even use a different algorithm depending on the order.

2. Order of objects

There is another order issue. Objects themselves can be ordered or not.
Currently most OSM files have the objects (nodes, ways, and relations) each
ordered according to ID. I guess that most programs do not depend on that
order, but some might. And there might be some optimizations that can be done
if the order is known, for instance you could stop reading a file if you see
IDs larger than the largest ID you are interested in for some reason.

Other orderings instead of ID could be possible, for instance by timestamp.

I suggest we add an optional flag to the file saying "objects are ordered
according to ID".

If we have those additional flags an application can directly see whether it
can read a file (efficiently). Most applications will probably work with any
order, but in case an application can't work without that order it can
immediately give an error message and stop. The user can then use Osmosis or
some other program that can order the file in the necessary order and re-run
the application.

To optimize the case where the order is wrong for an application a bit further
I think we also want the flags outside the compressed blobs to tell us whats
inside. I agree that thats a cleaner solution than storing the offset of the
first blobs with certain object types somewhere.

Another option would be a special block that says: The following blocks are
all ob object type X until you get some special end block (or the start
block for a different object type). This would reduce the overhead on each
block slightly for the price of having an additional block type. I am not
sure which solution is best.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298