[OSM-dev] Minute Diffs Broken

Tue May 5 23:54:48 BST 2009

Brett Henderson wrote:
> Steve Singer wrote:
>> On Tue, 5 May 2009, Brett Henderson wrote:
>>
>>> That does look interesting.  I'd hope to use that outside the main 
>>> database though.  My thoughts were to use triggers to populate short 
>>> term flag tables which a single threaded process would read, use as 
>>> keys to select modified data into an offline database, then clear.  
>>> This offline database could then use a queueing system such as PgQ 
>>> (I haven't seen it before, will have to check it out) to send events 
>>> to the various consumers of the data.  I'd like to minimise access 
>>> to the central database if possible because 1. it will scale better, 
>>> and 2. it adds less burden to existing DBAs.
>>
>> I agree you'd only want one process pulling data from the central 
>> database and then let other clients pull from another machine.  You'd 
>> have to examine how different your trigger + scanning process code 
>> will be from using PgQ with 1 consumer that then stores the data in 
>> another db for publishing.  You should at least look to see what 
>> problems they solved.
> I'll take a look.  You're right, I should avoid poorly inventing 
> something that others have already done a better job of :-)  I'd hate 
> to impose a bottleneck on the entire app.
On second thoughts, a queueing mechanism may not be appropriate.  A 
queue would be great if the queue contained all the data needed for 
replication but that isn't likely to be the case.

If a trigger was applied to the node table for instance, then the 
trigger would log the node id and version for the replication process to 
pick up.  Once the replication process picks up that id then it can 
retrieve the node and associated tags to be replicated.  What I have to 
avoid is running select statements per node, I need to pick them up in a 
large batch.  I could do SELECT statements with large numbers of ids in 
a "WHERE node.id IN []" clause but that doesn't scale very far.  It 
would be much nicer if I could do a join to a table containing the ids I 
want to retrieve.  A queue is serialising events which then have to be 
merged into a large set again for retrieval.  I can't do one-by-one 
processing if I want to remain efficient.

The second problem is that I don't always want all records in the 
queue.  I'd still like to be able to break up records into time 
intervals rather than grabbing everything available in the order it was 
logged.  However this mightn't be an issue in the central database, it's 
more of an issue in the distribution database.  So the main retrieval 
daemon could just grab everything (or bound it by an object count limit) 
and dump into the distribution database where time-based chunks could be 
extracted.

Brett