[OSM-dev] Minute Diffs Broken

Tue May 5 01:51:01 BST 2009

Frederik Ramm wrote:
> Hi,
>
> Brett Henderson wrote:
>> I'm not reading any of the changeset table data so the behaviour of 
>> the closed_at field doesn't affect osmosis.  The changeset table is 
>> effectively useless to osmosis processing because changesets aren't 
>> atomic.
>
> Thinking about possible solutions:
>
> 1. When updating things in a transaction, set the timestamp to the 
> commit time of the transaction. I don't believe PostgreSQL can do it.
If we could do this it'd be great.
>
> 2. As you said, introduce changes to the database, like dirty bits or 
> change logs or so.
It's my only option at the moment.  It has a number of advantages such 
as being able to process immediately behind the API with no delay.  But 
it introduces a lot more complexity.  Part of the issue is that several 
downstream osmosis tasks want the data.  My preference would be to use 
the "dirty log" as a simple marker table and then pull all changes into 
a separate offline database for distribution amongst the various 
consuming osmosis processes.  It is also possible to only have a single 
osmosis consumer (eg. minute diffs) and perform post processing to merge 
them into hourly and daily diffs but an offline database would make 
other things easier such as full history deltas.

If we went down this path it needs significant enhancements to be made 
to the core database, something to stream changes out of the core db 
into a changes database, and something to feed those changes into the 
existing diff files.  I think it's perfectly do-able and I can't see any 
major showstoppers, but not a trivial task.  I'd need a lot of help from 
others :-)
>
> 3. Make a semantic change to the way we handle diffs: Let the diff for 
> interval X not be "all changes with timestamp within X" but instead 
> "all changes that happened in a changeset that was closed within X". 
> Changesets not being atomic should pose no problem for this (because 
> when it's closed, it's closed). This would adversely affect downstream 
> systems in that some changes are held back until the changeset is 
> closed (whereas they are passed on immediately now), but on the other 
> hand you could afford to generate the minutely diff at 5 seconds past 
> the minute because you do not have to wait for transactions to settle 
> (the actual changeset close never happens inside a transaction).
I think this would introduce far too large a delay.  What is the maximum 
age of a changeset?  That is the delay that may occur between making an 
edit and it appearing in replica databases.  I don't think that would be 
suitable for tiles at home and mapnik for instance.  It would be simple to 
implement though.  This was my original plan until I learnt that 
changesets weren't going to be atomic.

It's worth nothing that if we went with option 2 we'd have to include 
part of option 3.  If data was missed from one changeset due to delayed 
commit it would have to be included in a subsequent changeset which is a 
slight change from the current behaviour.  It shouldn't impact consumers 
so long as entity versions are ordered correctly in diff files.

Brett