[OSM-dev] Ordering OSM-Files (was: visible-Flag in PBF)

Mon May 9 23:41:19 BST 2011

On Sun, May 8, 2011 at 5:19 AM, Jochen Topf <jochen at remote.org> wrote:
> One point that came up in the PBF discussion was the questions of ordering OSM
> files. Lets try to analyze that and try to separate the different issues:
>
> 1. Order of object types
>
> OSM files contain three different kinds of objects: nodes, ways, and relations.
> In the future new types of objects might be added, such as a native area type.
> Not all files contain all object types, for instance relations can be missing
> if there are no relations in a bbox.
>
> It makes processing easier if all objects of one type are bunched together.
> It doesn't have to be this way, but it is almost always the case currently.
> Note that the PBF format is optimized for that case with its structure of
> blobs.

As long as bunches of the same type are at least a few thousand nodes,
PBF will handle it OK.

>
> Bunching objects of the same type together is probably always better for
> applications reading the data. For some applications it might not matter, but
> for others it makes processing a lot easier (because you can be sure you have
> read all data of a specific type before you start processing a different type).
> On the other hand writing data might be much easier when you don't have to take
> this into account. You can, for instance, basically just concatenate two files
> to get a new OSM file.
>
> I think the use cases where you want to rely on the ordering are important enough
> to warrant some kind of flag that promises you that the objects are bunched
> together according to object type.
>

I assume you mean sorted.

> The next question is what the order of the object types should be.  Currently
> almost all OSM files have first all nodes, then all ways, then all relations.
> That order makes sense, for instance, if you want to be able to easily cut a
> bbox from the file. If you wanted to assemble multipolygons, a different order
> would be better (relations, ways, nodes or relations, nodes, ways). I am not
> sure we can ever decide on one particular "best" order here. But maybe it
> would make sense to allow different orders and note them down in the header.
> This way a program could decide in the beginning how many passes it needs over
> the data or maybe even use a different algorithm depending on the order.
>
> 2. Order of objects
>
> There is another order issue. Objects themselves can be ordered or not.
> Currently most OSM files have the objects (nodes, ways, and relations) each
> ordered according to ID. I guess that most programs do not depend on that
> order, but some might. And there might be some optimizations that can be done
> if the order is known, for instance you could stop reading a file if you see
> IDs larger than the largest ID you are interested in for some reason.
>
> Other orderings instead of ID could be possible, for instance by timestamp.
>
> I suggest we add an optional flag to the file saying "objects are ordered
> according to ID".
>
> If we have those additional flags an application can directly see whether it
> can read a file (efficiently). Most applications will probably work with any
> order, but in case an application can't work without that order it can
> immediately give an error message and stop. The user can then use Osmosis or
> some other program that can order the file in the necessary order and re-run
> the application.

Yes. How about the following optional feature flags to indicate sort order:

 Sort.NodeWayRelationThenID
     --- Because this is the default order of the planet.
 Sort.Unordered
     --- Explicitly indicate that the order is unknown.
 Sort.RelationWayNodeThenID
     --- Because this order makes multipolygon/region extraction more efficient.

What are the motivations for the other sort orders you proposed above?

>
> To optimize the case where the order is wrong for an application a bit further
> I think we also want the flags outside the compressed blobs to tell us whats
> inside. I agree that thats a cleaner solution than storing the offset of the
> first blobs with certain object types somewhere.
>

This option would be to use the indexdata field and for each block have:

message IndexData {
   optional int32 nodecount = 16
   optional int32 waycount = 17
   optional int32 relationcount = 18
}

Should this have a count of invisible entities?

> Another option would be a special block that says: The following blocks are
> all ob object type X until you get some special end block (or the start
> block for a different object type). This would reduce the overhead on each
> block slightly for the price of having an additional block type.

I think this approach is a big ugly, its also more fragile as it
doesn't allow semi-structured data. (I could imagine a file thats
partially geographically sorted by continent.) Counts work no matter
how the file is structured. However, this approach has one big
advantage: it would save about 1 megabyte/planet compared to entity
counts.

What about a third alternative: None of the above. PBF is fast enough
that you could scan the entire file and build counts for each block in
5-10 minutes, maybe faster because you don't need to fully decode a
block to get its entity count. Marker blocks only work with sorted
data. Counts inflate the planet by 1mb and don't really solve the
problem.  What makes 1MB or a marker block worth having

I'm now less convinced of the value of this feature. What is it for?

Scott