[OSM-dev] Minute Diffs Broken
Brett Henderson
brett at bretth.com
Tue May 5 01:51:01 BST 2009
Frederik Ramm wrote:
> Hi,
>
> Brett Henderson wrote:
>> I'm not reading any of the changeset table data so the behaviour of
>> the closed_at field doesn't affect osmosis. The changeset table is
>> effectively useless to osmosis processing because changesets aren't
>> atomic.
>
> Thinking about possible solutions:
>
> 1. When updating things in a transaction, set the timestamp to the
> commit time of the transaction. I don't believe PostgreSQL can do it.
If we could do this it'd be great.
>
> 2. As you said, introduce changes to the database, like dirty bits or
> change logs or so.
It's my only option at the moment. It has a number of advantages such
as being able to process immediately behind the API with no delay. But
it introduces a lot more complexity. Part of the issue is that several
downstream osmosis tasks want the data. My preference would be to use
the "dirty log" as a simple marker table and then pull all changes into
a separate offline database for distribution amongst the various
consuming osmosis processes. It is also possible to only have a single
osmosis consumer (eg. minute diffs) and perform post processing to merge
them into hourly and daily diffs but an offline database would make
other things easier such as full history deltas.
If we went down this path it needs significant enhancements to be made
to the core database, something to stream changes out of the core db
into a changes database, and something to feed those changes into the
existing diff files. I think it's perfectly do-able and I can't see any
major showstoppers, but not a trivial task. I'd need a lot of help from
others :-)
>
> 3. Make a semantic change to the way we handle diffs: Let the diff for
> interval X not be "all changes with timestamp within X" but instead
> "all changes that happened in a changeset that was closed within X".
> Changesets not being atomic should pose no problem for this (because
> when it's closed, it's closed). This would adversely affect downstream
> systems in that some changes are held back until the changeset is
> closed (whereas they are passed on immediately now), but on the other
> hand you could afford to generate the minutely diff at 5 seconds past
> the minute because you do not have to wait for transactions to settle
> (the actual changeset close never happens inside a transaction).
I think this would introduce far too large a delay. What is the maximum
age of a changeset? That is the delay that may occur between making an
edit and it appearing in replica databases. I don't think that would be
suitable for tiles at home and mapnik for instance. It would be simple to
implement though. This was my original plan until I learnt that
changesets weren't going to be atomic.
It's worth nothing that if we went with option 2 we'd have to include
part of option 3. If data was missed from one changeset due to delayed
commit it would have to be included in a subsequent changeset which is a
slight change from the current behaviour. It shouldn't impact consumers
so long as entity versions are ordered correctly in diff files.
Brett
More information about the dev
mailing list