On Fri, Nov 23, 2012 at 5:03 AM,  <span dir="ltr"><<a href="mailto:marqqs@gmx.eu" target="_blank">marqqs@gmx.eu</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Scott,<br>

<br>

in brief to the 1-degrees granularity:<br>

<br>

1. Do whole processing in 64 bit:<br>

This would mean to need much more RAM space when processing ways' coordinates. We should not do this unless this granularity is really required.<br></blockquote><div><br></div><div>If you want your program to do all processing with 100 nanodegree granularity instead of 1 nanodegree granularity, then you can use ints throughout. Your software will have the limitation that if a PBF file contains data with 1 nanodegree granularity that there will be data loss, which is probably not a limitation in practice. AFAIK, there are no PBF files with granularity that is not a multiple of 100 or with lat_offset and lon_offset != 0.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

2. Your formula:<br>

<div class="im">  latitude_int = ((lat_offset + granularity*lat)/50+1)/2<br>

</div>Good idea, but again, this would mean one more multiplication, one more division (and two additions, one shift). These operations usually can be done in no time, however that's different if you need to do them a Billion times.<br>


</blockquote><div><br></div><div>I'm curious, have you benchmarked the difference?</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">There are still people out there who have 32 bit machines, I presume they do not have 64 bits hardware multiplication units, hence the processing time will increase.<br>


<br></blockquote><div><br></div><div><div>In any case, if the file has a granularity that is a multiple of 100,  you can use this specialized formula instead:</div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">   latitude_int = (lat_offset/50+1)/2 + (granularity/100)*lat // This calculation can be done using 32-bit ints.</span></div>

<div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)"><br></span></div><div>This can be further specialized for when the granularity is 100 to:</div></div><div>

<div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">   latitude_int = (lat_offset/50+1)/2 + lat // This calculation can be done using 32-bit ints.</span></div>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

3. Process sequence:<br>

Using the granularity factor, lon/lat of every node in an OSMData fileblock must be read, stored temporarily and transformed later. Thus you have to access every data twice: first to read it, and a second time when you transform its granularity. This might be a flaw in PBF data model... Could we at least change this in that manner that the granularity information comes _before_ the real data? Same applies to lon/lat offset and date granularity.<br>

</blockquote><div><br></div><div>No can do. Google's protobuf format doesn't specifify the order in which the components of a message are serialized (this is to support concatenation of messages without decoding them). Their implementation serializes in tag-order, and I chose larger numbers for the granularity tags than for the primitive block tags.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

In the end - there always will be a lot of programs which do not need this quasi "optional feature" "granularity" and simply will not support it. </blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

<br>

Metadata...<br>

<br>

We had the same discussion a year ago. Do you remember?<br>

<a href="https://wiki.openstreetmap.org/wiki/Talk:PBF_Format#File_Timestamp.3F" target="_blank">https://wiki.openstreetmap.org/wiki/Talk:PBF_Format#File_Timestamp.3F</a><br>

I'm curious if - and I hope that - we manage to extend the PBF data format this time. :-) </blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

The file time stamp I added was meant as an interim solution: I took the already defined "optional feature" and stored a key-val pair in it, for example "timestamp=2011-10-16T15:45:00Z".<br>

<br>

I think this example shows what we really need: a flexible format for file related meta data. With key-val pairs, everyone could add optional data whenever they are needed in a toolchain. This is the flexibility we are used to have from OSM XML format.<br>

</blockquote><div><br></div><div>I understand the desire for this, but I want to put some thought into it to avoid the situation that created this thread, where the same metadata is stored in different locations, and in different formats.</div>

<div><br></div><div>How about two types of metadata storage, one type is standardized in the OSMHeader object directly:</div><div><br></div><div><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

<br></pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

message HeaderBlock {

  optional HeaderBBox bbox = 1;

  /* Additional tags to aid in parsing this dataset */

  repeated string required_features = 4;

  repeated string optional_features = 5;

  /* Other ad-hoc metadata */</pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

  repeated AdHocMetadata adhoc_metadata = 6; // See below.</pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

 
  optional string writingprogram = 16; 

  optional string source = 17; // From the bbox field.</pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

  optional string timestamp = 18; // from OSM planet header.</pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

  optional int64 replication_timestamp = 19 // In microseconds since 1970 UTC.</pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

  optional string copyright = 20;</pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

  optional string contributors = 21;</pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

  optional string license = 22;</pre><pre class="de1" style="padding:0px;border:0px none white;background-color:rgb(249,249,249);line-height:1.2em;font-size:12.499999046325684px;margin-top:0px;margin-bottom:0px;background-image:none;vertical-align:top">

}</pre></div><div><br></div><div>(new fields taken from the new planet header). Question, since I haven't reviewed OSM replication options, do we want one timetsamp, two timestamps, and should they be fnt64 or string?</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

To combine this flexibility with the advantages of Protobuf format (compressed storage of different data types) we need to allow meta formatted objects - or something like this:<br>

<br>

message HeaderBlock {<br>

  ...<br>

  repeated HeaderMeta = 20;<br>

}<br>

<br>

message HeaderMeta {<br>

  required string HeaderKey = 1;<br>

  optional HeaderMetaVarint = 10;<br>

  optional HeaderMetaString = 12;<br>

// see type definitions there: <a href="https://wiki.openstreetmap.org/wiki/PBF#Format_example" target="_blank">https://wiki.openstreetmap.org/wiki/PBF#Format_example</a><br>

// Only _one_ of the three optional objects should be used; did not know how to define this in Protobuf without an additional hierarchy layer.<br>

}<br>

<br>

What do you think about this suggestion?<br>

<br></blockquote><div><br></div><div><div>And, I agree with your idea of having key-value metadata, but, IMHO, ad-hoc non-standardized metadata keys should be scoped to the author or creator of that key-value. Say, something like this:</div>

<div><br></div><div>message AdHocMetadata {</div><div>   required string author = 1; // Fully qualified URI of the author of this metadata, e.g., </div><div>                                          // a website for toolchain, program, a company using this for</div>

<div>                                          // internal tracking data, or an email address of the person who created it. The author has</div><div>                                          // exclusive ownership of all keys and values assigned under their ID.</div>

<div>   required String key = 2; // Key assigned by the author.</div><div>   required boolean copied_into_derived = 3; // Should this key be copied into derived data.</div><div>   // These are generic fields that the supplier is free to use any or any subset of these.</div>

<div>   repeated sint64 value_int = 8;</div><div>   repeated string value_string = 9;</div><div>   repeated double value_double = 10;</div><div>   repeated bytes value_bytes = 11; // byte fields can contain other serialiezd protobuf objects.</div>

<div>}</div><div><br></div><div>Question, should I keep field #3? Useful for helping to track procesing pipelines, or do OSM processing pipelines currently not handle pushing through arbitrary metadata?</div></div><div><br>

</div><div>Thoughts on both proposals for metadata?</div><div>Scott</div><div><br></div></div>