[OSM-dev] Minute Diffs Broken

Tue May 5 02:27:20 BST 2009

Greg Troxel wrote:
> Frederik Ramm <frederik at remote.org> writes:
>
>   
>> 3. Make a semantic change to the way we handle diffs: Let the diff for 
>> interval X not be "all changes with timestamp within X" but instead "all 
>> changes that happened in a changeset that was closed within X". 
>> Changesets not being atomic should pose no problem for this (because 
>> when it's closed, it's closed). This would adversely affect downstream 
>> systems in that some changes are held back until the changeset is closed 
>> (whereas they are passed on immediately now), but on the other hand you 
>> could afford to generate the minutely diff at 5 seconds past the minute 
>> because you do not have to wait for transactions to settle (the actual 
>> changeset close never happens inside a transaction).
>>     
>
> So obviously we aren't running "SET TRANSACTION ISOLATION LEVEL
> SERIALIZABLE", since that would kill performance and make things harder,
> but it would solve this :-)
>
> It's possible for a transaction with effective time T to have a
> commit time of T', and the minute scan for A-B for T < B < T' is not
> seeing the changeset, and the B-C minute scan is considering it not in
> bounds.
>
> If the real requirement for minute diffs is that the union of them is
> right, then having the minute diff generator keep track of all the
> changeset IDs it has seen in the last hour, and do a query that is
> basically:
>
>   select all changesets from the last 30 minutes
>   exclude all changesets in the previous 60 minute diffs
>
> then the missing changeset would show up in the next diff, which would
> be the minute it was committed in, not the minute it was started in.  If
> it's known there are no holes then changeset > top_changeset could make
> this faster.
>   
I don't think we can use changeset ids as a way of tracking processed 
changes due to the delay that introduces.  We have to track on 
individual entities.

Individual entities will not be sequential because entities can be 
modified.  This means we can't check for holes and query with 'node_id > 
top_node_id' for example.

That leaves us having to query for the maximum time a transaction could 
stay open for.  I don't know how to bound this.  Obviously 5 minutes is 
not enough.  Maybe 15 would be?  If we go with a 15 minute interval, 
combining that with the existing 5 minute delay means we have to read 10 
minutes worth of data for every minute changeset.  That's 10 times more 
data to be read from the database at a time.  It would probably work but 
it would increase the load on the main database.  The other thing we'd 
have to do is introduce a local database of some kind to track processed 
ids because osmosis gets launched from cron every minute and doesn't 
maintain any state between invocations other than the current timestamp.

It would work.  But hopefully there's a cleaner way.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20090505/1d8bade1/attachment.html>