[osmosis-dev] Proposal for Allowing Additional Data in Pipeline

Wed Jun 8 09:33:55 BST 2011

On Wed, May 11, 2011 at 10:59:34PM +1000, Brett Henderson wrote:
> This email is just musings at this point.  I'm not sure if I'll be able to
> implement anything anytime soon, but I'd be interested in people's thoughts
> on this.
> 
> Until now I've intentionally kept the core data classes in Osmosis as simple
> as possible to simplify maintenance and ensure consistency across all
> tasks.  I've only added attributes that are required to support basic OSM
> data and avoided any extensions from creeping in.
> 
> However it can be quite limiting when there is no way of passing additional
> data through the pipeline.  Examples of additional data might be:
> 
>    - A "mutated" flag of some kind to flag when a particular entity has been
>    changed and shouldn't be uploaded to the main API.  An example is when ways
>    are clipped at bounding box boundaries.
>    - A "visible" flag.  I hesitate to include this one because Osmosis
>    supports this via change streams, not optional visible attributes.
>    - Header information to be attached to the Bound element such as
>    replication timestamp information, source URLs, etc.
>    - Custom data exchanged between specialised tasks.  For example, a
>    polygon processing task might add full geometric information to a way.
> 
> To add some flexibility I'm thinking along the following lines:
> 
>    - Add a new collection to entities that can be optionally populated with
>    String/Object pairs.  Conceptually similar to a Map<String, Object> but
>    possibly stored like existing Tag objects in a simple Collection (currently
>    implemented as an ArrayList) for efficiency.
>    - The collection may be null when no data is required to minimise
>    overhead in the common case.  Consumers would need to explicitly check for
>    null which is a tad ugly but I think warranted here.
>    - Modify key tasks such as XML tasks to support serialising these
>    additional values as attributes on the entities themselves (eg. <node id=1
>    version=1 ... mutated="true" /> ).  Alternatively represent them as
>    sub-elements (eg. metatag stored as <node id=1...><mtag k="mymtag"
>    v="myvalue"></node>) .  The object would simply have the toString method
>    called on it to get a string representation.  Reading from XML would result
>    in a String object.
>    - Tasks not caring about the data would simply pass the objects on
>    without modification.
>    - Some Sink tasks such as PostgreSQL database tasks would ignore the
>    additional data.
>    - Some tasks such as --bounding-box could add a flag such as "mutated".
>    - Rename the existing Bound entity to something more generic like Header
>    to allow more file attributes to be persisted.
> 
> I think this approach would allow additional data to be attached to entities
> in a generic fashion without Osmosis itself having to add special support
> for it.  It would keep the pipeline generic but allow specialised tasks to
> exchange their own custom data.  I think representing the value part of data
> as an Object rather than String makes more sense because it allows custom
> tasks to exchange complete objects instead of forcing serialisation to and
> from String.
> 
> The additional data could in theory be represented as Tags without changing
> the pipeline at all, but it gets messy mixing real data with metadata.
> 
> I'm not sure if it makes sense to add support for this to the Bound object,
> or to simply allow Tag objects to be added instead.  Perhaps tags make more
> sense here?  The whole Bound concept has always fitted awkwardly in Osmosis,
> so I'm not sure how to tackle this one.
> 
> Hmm, a somewhat rambling email :-)  Any thoughts?

A flexible mechanism like this would be very interesting and useful, but I also
see a lot of potential for confusion. Some tasks can handle certain extra data,
some tasks can't. Some would only work if extra data is present, some would
silently do wrong things when expected data is not present, etc. Currently
all tasks can be plugged together and if you are trying to combine them in a
way that doesn't work (for instance a task reading change data instead of one
reading plain data), Osmosis will complain.

So I think this needs to be a bit more formalized so Osmosis can still make
those checks.

The Map<String, Object> could be connected to some registry for the strings
(or the strings would be objects instead with some extra information.) For
each one you would have at least the two options:
If a tasks encounters an object of this type in the pipeline and doesn't
understand it,
* it will just ignore it and optionally pass it on
* it has to complain.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298