[OSM-dev] Ordering OSM-Files (was: visible-Flag in PBF)

Tue May 10 17:12:08 BST 2011

On Mon, May 09, 2011 at 05:41:19PM -0500, Scott Crosby wrote:
> Yes. How about the following optional feature flags to indicate sort order:
> 
>  Sort.NodeWayRelationThenID
>      --- Because this is the default order of the planet.
>  Sort.Unordered
>      --- Explicitly indicate that the order is unknown.
>  Sort.RelationWayNodeThenID
>      --- Because this order makes multipolygon/region extraction more efficient.
> 
> What are the motivations for the other sort orders you proposed above?

Thats the problem. I don't know yet what orders will make sense and I don't
think we should change the definitions of the PBF file each time a new order is
"invented". But defining all of them from the get-go would be too many because
its a combinatorical problem.

I am also thinking ahead to when we might have more types, for instance an area
type. How would this fit in? Do we then need Sort.NodeWayAreaRelationThenID?

Maybe we can come up with a better idea how to code this.

> > To optimize the case where the order is wrong for an application a bit further
> > I think we also want the flags outside the compressed blobs to tell us whats
> > inside. I agree that thats a cleaner solution than storing the offset of the
> > first blobs with certain object types somewhere.
> >
> 
> 
> This option would be to use the indexdata field and for each block have:
> 
> message IndexData {
>    optional int32 nodecount = 16
>    optional int32 waycount = 17
>    optional int32 relationcount = 18
> }
> 
> Should this have a count of invisible entities?

They don't need to be counted extra.

> > Another option would be a special block that says: The following blocks are
> > all ob object type X until you get some special end block (or the start
> > block for a different object type). This would reduce the overhead on each
> > block slightly for the price of having an additional block type.
> 
> I think this approach is a big ugly, its also more fragile as it
> doesn't allow semi-structured data. (I could imagine a file thats
> partially geographically sorted by continent.) Counts work no matter
> how the file is structured. However, this approach has one big
> advantage: it would save about 1 megabyte/planet compared to entity
> counts.
> 
> What about a third alternative: None of the above. PBF is fast enough
> that you could scan the entire file and build counts for each block in
> 5-10 minutes, maybe faster because you don't need to fully decode a
> block to get its entity count. Marker blocks only work with sorted
> data. Counts inflate the planet by 1mb and don't really solve the
> problem.  What makes 1MB or a marker block worth having

1MB is nothing compared to the size of the planet which grows about 1 GB
every month or two.

> I'm now less convinced of the value of this feature. What is it for?

To save the cost of the zip decompression for blocks we are not interested
in. I'll do some benchmarking to see how much this would save.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298