[osmosis-dev] Proposal for Allowing Additional Data in Pipeline

Wed May 11 13:59:34 BST 2011

Hi All,

This email is just musings at this point.  I'm not sure if I'll be able to
implement anything anytime soon, but I'd be interested in people's thoughts
on this.

Until now I've intentionally kept the core data classes in Osmosis as simple
as possible to simplify maintenance and ensure consistency across all
tasks.  I've only added attributes that are required to support basic OSM
data and avoided any extensions from creeping in.

However it can be quite limiting when there is no way of passing additional
data through the pipeline.  Examples of additional data might be:

   - A "mutated" flag of some kind to flag when a particular entity has been
   changed and shouldn't be uploaded to the main API.  An example is when ways
   are clipped at bounding box boundaries.
   - A "visible" flag.  I hesitate to include this one because Osmosis
   supports this via change streams, not optional visible attributes.
   - Header information to be attached to the Bound element such as
   replication timestamp information, source URLs, etc.
   - Custom data exchanged between specialised tasks.  For example, a
   polygon processing task might add full geometric information to a way.

To add some flexibility I'm thinking along the following lines:

   - Add a new collection to entities that can be optionally populated with
   String/Object pairs.  Conceptually similar to a Map<String, Object> but
   possibly stored like existing Tag objects in a simple Collection (currently
   implemented as an ArrayList) for efficiency.
   - The collection may be null when no data is required to minimise
   overhead in the common case.  Consumers would need to explicitly check for
   null which is a tad ugly but I think warranted here.
   - Modify key tasks such as XML tasks to support serialising these
   additional values as attributes on the entities themselves (eg. <node id=1
   version=1 ... mutated="true" /> ).  Alternatively represent them as
   sub-elements (eg. metatag stored as <node id=1...><mtag k="mymtag"
   v="myvalue"></node>) .  The object would simply have the toString method
   called on it to get a string representation.  Reading from XML would result
   in a String object.
   - Tasks not caring about the data would simply pass the objects on
   without modification.
   - Some Sink tasks such as PostgreSQL database tasks would ignore the
   additional data.
   - Some tasks such as --bounding-box could add a flag such as "mutated".
   - Rename the existing Bound entity to something more generic like Header
   to allow more file attributes to be persisted.

I think this approach would allow additional data to be attached to entities
in a generic fashion without Osmosis itself having to add special support
for it.  It would keep the pipeline generic but allow specialised tasks to
exchange their own custom data.  I think representing the value part of data
as an Object rather than String makes more sense because it allows custom
tasks to exchange complete objects instead of forcing serialisation to and
from String.

The additional data could in theory be represented as Tags without changing
the pipeline at all, but it gets messy mixing real data with metadata.

I'm not sure if it makes sense to add support for this to the Bound object,
or to simply allow Tag objects to be added instead.  Perhaps tags make more
sense here?  The whole Bound concept has always fitted awkwardly in Osmosis,
so I'm not sure how to tackle this one.

Hmm, a somewhat rambling email :-)  Any thoughts?

Cheers
Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/osmosis-dev/attachments/20110511/9d8132c5/attachment.html>