[OSM-dev] Hourly diffs are missing edits (too)

Matt Amos zerebubuth at gmail.com
Wed Oct 7 16:50:43 BST 2009


On Wed, Oct 7, 2009 at 11:22 AM, Frederik Ramm <frederik at remote.org> wrote:
> Hi,
>
> Tom Hughes wrote:
>>> Could Postgres be persuaded to abort any transaction that runs longer
>>> than "n" minutes (e.g. 30), and the we run the hourlies at hh:31 or so?
>>> That would probably be a slight inconvenience to those who happen to
>>> start 35-minute transactions but they should just learn to do their bulk
>>> imports properly ;-)
>>
>> Um... Are you being serious?
>
> Yes. In most cases, whatever HTTP client they were using to upload will
> already have terminated the TCP connection and complained about a
> timeout anyway, leading them to upload the same thing again...

unless they're on an extremely slow connection and the diff is
trickling into the server at a few bytes per second.

>> Why? We've already solved the problem with the transactional diffs...
>
> For many uses, I view the hourly diffs as superior as they will contain
> less "noise" (edits that cancel each other out will not be present in
> the hourlies). So I would really like to have reliable hourly diffs -
> even if they come with a considerable delay.

it's possible to do this by aggregating the replication diffs and
dropping intermediate versions. of course, this won't exactly match an
hour boundary, but should usually be pretty close.

> My reasoning is that I feel uneasy when these transactions are
> completely unlimited. For all I know, there might be a freak case where
> a transaction runs for 8 days and I suddenly have an object in this
> week's planet which is 8 days old but wasn't in last week's planet (etc.).

but since the transaction commits atomically, the only evidence of
this is would be the timestamp on the element. the timestamp on the
element may as well be replaced with the timestamp of the replication
diff it came in.

> So I would like to have *some* certainty about the run time of
> transactions. Even if it is 24 hours or so - just some value where I
> know that no transaction can possibly exceed this duration. And if we
> should find that only one promille of all transactions takes longer than
> thrity minutes then I'd be prepared to sacrifice that one transaction if
> that buys me accurate hourly diffs at 31 minutes past the hour.

aggregating replication diffs gives you accurate nearly-hourly diffs
at 1 minute past the hour - wouldn't you prefer those? all it takes is
a little mental re-adjustment away from a time-based stream towards a
transaction-based stream. come on in, the mental water is fine ;-)

cheers,

matt




More information about the dev mailing list