[OSM-dev] Changeset And Replication
Brett Henderson
brett at bretth.com
Tue Oct 21 12:39:41 BST 2008
Hi All,
I'm in the process of updating Osmosis to work with API 0.6, or more
specifically to work with the new MySQL schema.
The biggest change is the introduction of changesets. I'm interested in
people's thoughts on how this should be done.
**** Option 1 ****
My initial plan is not to look at the changeset table at all. I will
continue to use the node/way/relation history tables as I do in 0.5 and
only use the changeset table as a means of joining to the user table.
When writing updates to a destination MySQL database, I will create a
changeset per user per replication interval. In other words if using
minute changesets, there will be one changeset created per user per
minute. Hourly changesets will result in one changeset per user per
hour. This should be straightforward to implement. This will have two
major limitations:
1. Changesets will not align with changesets in the master production
database.
2. The bounding box information on the changesets will all be set to the
whole planet. It may be possible to make the bounding boxes accurate
but it will add a large overhead to processing so I won't provide it in
the initial release.
**** Option 2 ****
A possible enhancement is to replicate changesets themselves. There are
a number of ways this could be done but the current changeset
implementation makes all of them difficult in their own way. I would
have liked to use changesets themselves as a basis of replication to
identify what data has been written during a change interval but this is
not possible because changesets are not guaranteed to be independent
(ie. non-overlapping) with other changesets, cannot be relied upon to be
closed in a timely fashion (thus having no further updates), and don't
have a closing timestamp. The second method I've been leaning towards
is to introduce a new changeset element type in the changeset file which
will include all changesets that have been created (but may not be
closed yet) in the change interval. This second method has the issue
that the bounding box information may not be final because more changes
may yet be written.
**** Problems with Changeset Replication ****
In short I don't have a way of creating useful changesets in replicated
databases. The first option creates artificial changesets without bbox
information (although could have bbox information by adding a large
overhead to initial import), and the second option has problems with
bbox information due to the bboxes changing after the point of
replication. If changesets are not important outside of the main
database then we can proceed with Option 1. If replicated changesets
are considered useful, then I can't see a workable solution for Option 2
using the current changeset implementation and believe a change in
design is required. I'd like to see replicated changesets but the
usefulness may be outweighed by increased complexity.
**** Possible Fixes ****
The easiest fix from a replication point of view would be to make
changesets atomic but this precludes live editors like Potlatch.
Another option is to introduce a form of locking where records are
locked until their owning changesets are completed but this adds
complexity to the current implementation and may block edits if
changesets are long-lived.
The advantage of either fix is that osmosis knows for sure that a
changeset is complete and is thus a candidate for replication and that
the changeset can be applied in isolation to other changesets so long as
changesets are applied in chronological order. I'd like to see the
locking method employed, this would require a daemon to run which limits
the duration of changesets to sensible values (eg. 5 minutes but
potentially variable based on changeset activity) and auto-closes
changesets if timeout expires. For extra points the API could avoid
exposing edited data until changesets are closed.
Hopefully the above makes sense. Any thoughts and feedback appreciated.
Brett
More information about the dev
mailing list