[OSM-dev] visible-Flag in PBF

Sat May 7 21:15:32 BST 2011

On Sat, May 7, 2011 at 2:06 PM, Christian Vetter <veaac.fdirct at gmail.com> wrote:
> Hi,
>
> On Sat, May 7, 2011 at 5:46 AM, Scott Crosby <scott at sacrosby.com> wrote:

>
>> One idea: Each BlobHeader includes an 'indexdata' field. We could
>> store a protocol buffer message there. It is available without parsing
>> the blob. It needs to be kept small, but adding something like:
>>
>> message IndexData {
>>   optional int32 nodecount = 16
>>   optional int32 waycount = 17
>>   optional int32 relationcount = 18
>> }

I just thought of other things:

Bounding boxes: (I avoid using min/max so that the number is more
stable across the international date line)
       optional sint64 min_lat = 7
       optional int64 lat_width = 8
       optional sint64 min_lon = 9
       optional int64 lon_width = 10

With 10m of granularity, thats about 12-20 bytes per block.

Be able to skip blocks with unwanted entity ID numbers:
       optional int64 min_id =11
       optional int64 max_id_delta = 12 (Delta to the max id)

Also about 10 bytes per block.

However, I'm reluctant to add any of these unless there's a commitment
to immediately add software support for these.

>
> I guess this would work. However, you still would have one random file
> access for each blob. A normal hard drive has a seek time of about 5ms
> -> skipping 1000 blobs would take about 5 seconds.

Blobs at 100kb are small enough that sequential reads are faster than
seeking to the next blob header position. At a typical 80mb/sec
streaming HD throughput, it would be about 2-3 minutes to scan a
planet. The main benefit of this is saving the decompression and
decoding time, which account for about 80% off the total time.

> Would this still make it possible to add further index data structures later on?

Most likely. We can always add another field to the indexdata. What
kind of indexing data structure are you thinking of?

>
>> Can you and others who want this feature please make a convincing case
>> of why every planet should get about 1mb bigger?
>
> I would use this for parsing a set of relations / ways without using a
> lot of memory / disk space: First scan relations, then scan ways, then
> scan nodes. Adding one megabyte does not seem much, especially since
> we already have about the same amount of redundancy in the blob
> headers by using a string for the blob type.

Other peoples thoughts on this?

> We could decrease the
> overhead a little bit by just storing a bool for each of the three.

There are about 80k blobs. If 1-byte tags are used for the counts,
overhead is:9 bytes each:

   2 bytes indexdata tag&length in the BlobHeader
   3*1 bytes (tags for 3 fields)
   2*1 bytes (varint count for N==0)
   1*2 bytes (varint count for N < 2**14)

I assume that few blobs contain more than one entity type. Using
booleans only saves one byte of overhead compared to this.

>
> However, the same could also be achieved by having the PBF elements
> ordered such that relations come before ways and ways come before
> nodes.

A third alternative: Scan the file and partially decode each blob,
just enough to determine if a blob has nodes, ways, or relations and
remember their offsets in the file. Then in remaining passes, you only
fully decode blobs with the entities you care about. This (along with
putting counts in the indexdata field) has the perk of being robust
even in partially unordered files.

>
>> This is already documented in the Wiki. (See OSMHeader section). The
>> current osmosis writer doesn't set that feature because AFAICT, there
>> is no functionality in osmosis for the writer to know that the file is
>> sorted so that it knows to set that flag.
>
> I have no idea how the PBF extracts are actually generated from the
> OSM data base. Would there be some way to influence the order of
> elements and letting Osmosis know about this without adding much
> overhead?

The problem is that osmosis doesn't have functionality to 'push'
metadata through the pipeline. The pbf reader, as a consumer, has
barely any metadata available to put in the file header. It has no
idea if the data is sorted or not so can't set the flag. The solution
is to improve the XML format so that bounds tags include a key-val
dictionary, make osmosis propagate that metadata through its
processing, and have the xml&pbf writers put that metadata into the
OSMHeader blob or <bounds> XML tag. Existing osmosis filters need to
be patched to properly set or reset the is-sorted flag. Then ask the
people who do planet dumps to set the flag on the dbase dumps. (And,
given we'll have a full key-val dictionary, we can include other
metadata like URL's for getting minute changesets, datestamps, and
other goodies.)

Scott