<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Greg Troxel wrote:

<blockquote cite="mid:rmibpq8qrqs.fsf@fnord.ir.bbn.com" type="cite">

  <pre wrap="">Frederik Ramm <a class="moz-txt-link-rfc2396E" href="mailto:frederik@remote.org"><frederik@remote.org></a> writes:

  </pre>

  <blockquote type="cite">

    <pre wrap="">3. Make a semantic change to the way we handle diffs: Let the diff for 

interval X not be "all changes with timestamp within X" but instead "all 

changes that happened in a changeset that was closed within X". 

Changesets not being atomic should pose no problem for this (because 

when it's closed, it's closed). This would adversely affect downstream 

systems in that some changes are held back until the changeset is closed 

(whereas they are passed on immediately now), but on the other hand you 

could afford to generate the minutely diff at 5 seconds past the minute 

because you do not have to wait for transactions to settle (the actual 

changeset close never happens inside a transaction).

    </pre>

  </blockquote>

  <pre wrap=""><!---->

So obviously we aren't running "SET TRANSACTION ISOLATION LEVEL

SERIALIZABLE", since that would kill performance and make things harder,

but it would solve this :-)

It's possible for a transaction with effective time T to have a

commit time of T', and the minute scan for A-B for T < B < T' is not

seeing the changeset, and the B-C minute scan is considering it not in

bounds.

If the real requirement for minute diffs is that the union of them is

right, then having the minute diff generator keep track of all the

changeset IDs it has seen in the last hour, and do a query that is

basically:

  select all changesets from the last 30 minutes

  exclude all changesets in the previous 60 minute diffs

then the missing changeset would show up in the next diff, which would

be the minute it was committed in, not the minute it was started in.  If

it's known there are no holes then changeset > top_changeset could make

this faster.

  </pre>

</blockquote>

I don't think we can use changeset ids as a way of tracking processed

changes due to the delay that introduces.  We have to track on

individual entities.<br>

<br>

Individual entities will not be sequential because entities can be

modified.  This means we can't check for holes and query with 'node_id

> top_node_id' for example.<br>

<br>

That leaves us having to query for the maximum time a transaction could

stay open for.  I don't know how to bound this.  Obviously 5 minutes is

not enough.  Maybe 15 would be?  If we go with a 15 minute interval,

combining that with the existing 5 minute delay means we have to read

10 minutes worth of data for every minute changeset.  That's 10 times

more data to be read from the database at a time.  It would probably

work but it would increase the load on the main database.  The other

thing we'd have to do is introduce a local database of some kind to

track processed ids because osmosis gets launched from cron every

minute and doesn't maintain any state between invocations other than

the current timestamp.<br>

<br>

It would work.  But hopefully there's a cleaner way.<br>

<br>

</body>

</html>