[OSM-dev] Timestamp in PBF files

Scott Crosby scott at sacrosby.com
Sat Nov 24 19:36:13 GMT 2012


On Fri, Nov 23, 2012 at 5:03 AM, <marqqs at gmx.eu> wrote:

> Hi Scott,
>
> in brief to the 1-degrees granularity:
>
> 1. Do whole processing in 64 bit:
> This would mean to need much more RAM space when processing ways'
> coordinates. We should not do this unless this granularity is really
> required.
>

If you want your program to do all processing with 100 nanodegree
granularity instead of 1 nanodegree granularity, then you can use ints
throughout. Your software will have the limitation that if a PBF file
contains data with 1 nanodegree granularity that there will be data loss,
which is probably not a limitation in practice. AFAIK, there are no PBF
files with granularity that is not a multiple of 100 or with lat_offset and
lon_offset != 0.


>
> 2. Your formula:
>   latitude_int = ((lat_offset + granularity*lat)/50+1)/2
> Good idea, but again, this would mean one more multiplication, one more
> division (and two additions, one shift). These operations usually can be
> done in no time, however that's different if you need to do them a Billion
> times.
>

I'm curious, have you benchmarked the difference?

There are still people out there who have 32 bit machines, I presume they
> do not have 64 bits hardware multiplication units, hence the processing
> time will increase.
>
>
In any case, if the file has a granularity that is a multiple of 100,  you
can use this specialized formula instead:
   latitude_int = (lat_offset/50+1)/2 + (granularity/100)*lat // This
calculation can be done using 32-bit ints.

This can be further specialized for when the granularity is 100 to:
   latitude_int = (lat_offset/50+1)/2 + lat // This calculation can be done
using 32-bit ints.


> 3. Process sequence:
> Using the granularity factor, lon/lat of every node in an OSMData
> fileblock must be read, stored temporarily and transformed later. Thus you
> have to access every data twice: first to read it, and a second time when
> you transform its granularity. This might be a flaw in PBF data model...
> Could we at least change this in that manner that the granularity
> information comes _before_ the real data? Same applies to lon/lat offset
> and date granularity.
>

No can do. Google's protobuf format doesn't specifify the order in which
the components of a message are serialized (this is to support
concatenation of messages without decoding them). Their implementation
serializes in tag-order, and I chose larger numbers for the granularity
tags than for the primitive block tags.


>
> In the end - there always will be a lot of programs which do not need this
> quasi "optional feature" "granularity" and simply will not support it.


>
> Metadata...
>
> We had the same discussion a year ago. Do you remember?
> https://wiki.openstreetmap.org/wiki/Talk:PBF_Format#File_Timestamp.3F
> I'm curious if - and I hope that - we manage to extend the PBF data format
> this time. :-)


> The file time stamp I added was meant as an interim solution: I took the
> already defined "optional feature" and stored a key-val pair in it, for
> example "timestamp=2011-10-16T15:45:00Z".
>
> I think this example shows what we really need: a flexible format for file
> related meta data. With key-val pairs, everyone could add optional data
> whenever they are needed in a toolchain. This is the flexibility we are
> used to have from OSM XML format.
>

I understand the desire for this, but I want to put some thought into it to
avoid the situation that created this thread, where the same metadata is
stored in different locations, and in different formats.

How about two types of metadata storage, one type is standardized in the
OSMHeader object directly:


message HeaderBlock {
  optional HeaderBBox bbox = 1;
  /* Additional tags to aid in parsing this dataset */
  repeated string required_features = 4;
  repeated string optional_features = 5;
  /* Other ad-hoc metadata */

  repeated AdHocMetadata adhoc_metadata = 6; // See below.


  optional string writingprogram = 16;
  optional string source = 17; // From the bbox field.

  optional string timestamp = 18; // from OSM planet header.

  optional int64 replication_timestamp = 19 // In microseconds since 1970 UTC.

  optional string copyright = 20;

  optional string contributors = 21;

  optional string license = 22;

}


(new fields taken from the new planet header). Question, since I haven't
reviewed OSM replication options, do we want one timetsamp, two timestamps,
and should they be fnt64 or string?


> To combine this flexibility with the advantages of Protobuf format
> (compressed storage of different data types) we need to allow meta
> formatted objects - or something like this:
>
> message HeaderBlock {
>   ...
>   repeated HeaderMeta = 20;
> }
>
> message HeaderMeta {
>   required string HeaderKey = 1;
>   optional HeaderMetaVarint = 10;
>   optional HeaderMetaString = 12;
> // see type definitions there:
> https://wiki.openstreetmap.org/wiki/PBF#Format_example
> // Only _one_ of the three optional objects should be used; did not know
> how to define this in Protobuf without an additional hierarchy layer.
> }
>
> What do you think about this suggestion?
>
>
And, I agree with your idea of having key-value metadata, but, IMHO, ad-hoc
non-standardized metadata keys should be scoped to the author or creator of
that key-value. Say, something like this:

message AdHocMetadata {
   required string author = 1; // Fully qualified URI of the author of this
metadata, e.g.,
                                          // a website for toolchain,
program, a company using this for
                                          // internal tracking data, or an
email address of the person who created it. The author has
                                          // exclusive ownership of
all keys and values assigned under their ID.
   required String key = 2; // Key assigned by the author.
   required boolean copied_into_derived = 3; // Should this key be copied
into derived data.
   // These are generic fields that the supplier is free to use any or any
subset of these.
   repeated sint64 value_int = 8;
   repeated string value_string = 9;
   repeated double value_double = 10;
   repeated bytes value_bytes = 11; // byte fields can contain other
serialiezd protobuf objects.
}

Question, should I keep field #3? Useful for helping to track procesing
pipelines, or do OSM processing pipelines currently not handle pushing
through arbitrary metadata?

Thoughts on both proposals for metadata?
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20121124/caee63d3/attachment.html>


More information about the dev mailing list