[OSM-dev] Release candidate for OSM binary format is in osmosis trunk.

Mon Sep 6 13:02:09 BST 2010

On Mon, Sep 6, 2010 at 6:58 AM, Scott Crosby <scrosby at cs.rice.edu> wrote:

> On Sun, Sep 5, 2010 at 11:42 AM, Frederik Ramm <frederik at remote.org>
> wrote:
> > Scott,
> >
> > Scott Crosby wrote:
> >>
> >> message HeaderBlock {
> >>  required HeaderBBox bbox = 1;
> >>
> >>  // Author, name, and version number of the dataset in this file. (to
> >> permit
> >>  // patches/updates to be incrementally applied)
> >>  optional string datasetauthor = 16; // TODO: WANT THIS?
> >>  optional string datasetname = 17;  // TODO: WANT THIS?
> >>  optional int64 version = 18; // TODO: WANT THIS?
> >>
> >>  // Program generating this data
> >>  optional string writingprogram = 19;  // TODO: WANT THIS?
> >> }
> >
> > To start regular updates after importing a full planet file, one
> typicalle
> > needs to find out which state.txt file on planet.openstreetmap.org to
> copy.
>
> > The current alogrithm for this is:
> >
> > * decide whether you want daily, hourly, or minutely updates;
> > * find out the latest timestamp in your data set, or alternatively use
> the
> > time of dataset creation
> > * find the latest state.txt file from the appropriate directory that was
> > created before your own latest timestamp
> > * copy that to your Osmosis working directory
> >
> > In order to make this really easy, a data file should (in order of
> > preference) either
>
> I can define fields or sub-messages in the header to store any of this
> information. A representation of the state file, URLs, timestamps, or
> all three.
>
> However, your preferred suggestion is to contain the information to
> synthesize a state.txt file for updates. I can include this in the
> header. I looked at some existing state files and hers is what I'm
> guessing the schema is. Is it correct?
>
> message OneReplicationStateV1 {
>   required string base_url = 1;
>   required int64 sequence_number = 2;
>   required int64 timestamp = 3; // Milliseconds since 1970
>   required int64 txn_max = 4;
>   required int64 txn_max_queried = 5;
>   repeated int64 txn_ready_list = 6;
>   repeated int64 txn_active_list = 7;
> }
>
> message ReplicationStateV1 {
>   optional OneReplicationStateV1 minute = 16;
>   optional OneReplicationStateV1 hour = 17;
>   optional OneReplicationStateV1 day = 18;
>   optional OneReplicationStateV1 week = 19;
> }
>
> One question as a sanity-check. How many items will txn_active_list
> and txn_ready_list typically have in the average case, worst case
> (worst 1%), and absolutely worst case (worst .0000001%)?
>
> There's one other issue with including replication information in the
> header.
>
> Making that information usable is a separate challenge as that state
> information has to be pushed through the osmosis pipe both to and from
> the format. I'd have to ask Brett to chime on as to how to do this, my
> guess is as part of a Bounds object. I'm assuming that osmosis is used
> to generate these dumps, if not, then the program generating the dump
> has to be modified to generate a binary format.
>

I don't have an answer on how to do this.  I've so far avoided adding
anything like this to the pipeline because it is very difficult to make all
tasks support it in a meaningful way.  The existing bound support is messy
enough and I question its usefulness.  Bound support makes sense for an
editor like JOSM, but much less sense for Osmosis.  JOSM owns the entire
lifecycle of an OSM file, only supports the one file format, and typically
receives all data from the API.  Osmosis is more generic, may receive files
from a number of sources, and this makes it much harder to preserve and
manipulate metadata.

Adding this type of info would probably require a re-think of how metadata
is passed through the pipeline.  One reason why Osmosis is so flexible is
that the data model it supports is very simple.  Adding extra data to
support specific use cases will make this more difficult to achieve.

I'm not sure that is makes much sense to add replication state support to
binary files without adding it to other storage formats as well (XML, pgsql,
apidb?).

That all sounds a bit negative which isn't really my intent.  I guess what
I'm saying is that I'm not keen to see replication support specifically
added to the pipeline, but rather a rethink of how metadata can be passed
through the pipeline in a more generic fashion.  It needs to take into
account all tasks and not just the couple that are used for one use case.
This could potentially be used for replication data, bounds information, and
maybe even other information such as whether the data has been manipulated
in some fashion (eg. clipIncompleteEntities option on the bounding box
task).

Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20100906/cc440f06/attachment.html>