[OSM-dev] Changeset And Replication

Tue Oct 21 12:39:41 BST 2008

Hi All,

I'm in the process of updating Osmosis to work with API 0.6, or more 
specifically to work with the new MySQL schema.

The biggest change is the introduction of changesets.  I'm interested in 
people's thoughts on how this should be done.

**** Option 1 ****
My initial plan is not to look at the changeset table at all.  I will 
continue to use the node/way/relation history tables as I do in 0.5 and 
only use the changeset table as a means of joining to the user table.  
When writing updates to a destination MySQL database, I will create a 
changeset per user per replication interval.  In other words if using 
minute changesets, there will be one changeset created per user per 
minute.  Hourly changesets will result in one changeset per user per 
hour.  This should be straightforward to implement.  This will have two 
major limitations:
1. Changesets will not align with changesets in the master production 
database.
2. The bounding box information on the changesets will all be set to the 
whole planet.  It may be possible to make the bounding boxes accurate 
but it will add a large overhead to processing so I won't provide it in 
the initial release.

**** Option 2 ****
A possible enhancement is to replicate changesets themselves.  There are 
a number of ways this could be done but the current changeset 
implementation makes all of them difficult in their own way.  I would 
have liked to use changesets themselves as a basis of replication to 
identify what data has been written during a change interval but this is 
not possible because changesets are not guaranteed to be independent 
(ie. non-overlapping) with other changesets, cannot be relied upon to be 
closed in a timely fashion (thus having no further updates), and don't 
have a closing timestamp.  The second method I've been leaning towards 
is to introduce a new changeset element type in the changeset file which 
will include all changesets that have been created (but may not be 
closed yet) in the change interval.  This second method has the issue 
that the bounding box information may not be final because more changes 
may yet be written.

**** Problems with Changeset Replication ****
In short I don't have a way of creating useful changesets in replicated 
databases.  The first option creates artificial changesets without bbox 
information (although could have bbox information by adding a large 
overhead to initial import), and the second option has problems with 
bbox information due to the bboxes changing after the point of 
replication.  If changesets are not important outside of the main 
database then we can proceed with Option 1.  If replicated changesets 
are considered useful, then I can't see a workable solution for Option 2 
using the current changeset implementation and believe a change in 
design is required.  I'd like to see replicated changesets but the 
usefulness may be outweighed by increased complexity.

**** Possible Fixes ****
The easiest fix from a replication point of view would be to make 
changesets atomic but this precludes live editors like Potlatch.
Another option is to introduce a form of locking where records are 
locked until their owning changesets are completed but this adds 
complexity to the current implementation and may block edits if 
changesets are long-lived.

The advantage of either fix is that osmosis knows for sure that a 
changeset is complete and is thus a candidate for replication and that 
the changeset can be applied in isolation to other changesets so long as 
changesets are applied in chronological order.  I'd like to see the 
locking method employed, this would require a daemon to run which limits 
the duration of changesets to sensible values (eg. 5 minutes but 
potentially variable based on changeset activity) and auto-closes 
changesets if timeout expires.  For extra points the API could avoid 
exposing edited data until changesets are closed.

Hopefully the above makes sense.  Any thoughts and feedback appreciated.

Brett