[OSM-dev] Minute Diffs Broken

Wed May 6 14:04:56 BST 2009

On Wed, 6 May 2009, Brett Henderson wrote:

> On second thoughts, a queueing mechanism may not be appropriate.  A queue 
> would be great if the queue contained all the data needed for replication but 
> that isn't likely to be the case.

You could

A. Have a trigger put the entire row contents in the queue (extra write IO)

B. Have rails put the data for the entire upload into the queue  (not using
triggers, the queue doesn't even have to be postgresql based you could even
look at using some opensource queuing system like activemq).  The downside 
of this is that it does introduce some additional failure scenarios and 
software components. Also if the queue is persisted on disk (it might not 
have to be) you'd again be paying write IO penalty.

C. As you sudgest put the id's in the queue and do bulk selecting with a 
large IN(..). I have a vague recollection of slony using this technique for 
selecting data. You could also structure things so you could join the node 
table to your queue table.  I'm not sure if PgQ is structured to allow this 
or not.

>
> If a trigger was applied to the node table for instance, then the trigger 
> would log the node id and version for the replication process to pick up. 
> Once the replication process picks up that id then it can retrieve the node 
> and associated tags to be replicated.  What I have to avoid is running select 
> statements per node, I need to pick them up in a large batch.  I could do 
> SELECT statements with large numbers of ids in a "WHERE node.id IN []" clause 
> but that doesn't scale very far.  It would be much nicer if I could do a join 
> to a table containing the ids I want to retrieve.  A queue is serialising 
> events which then have to be merged into a large set again for retrieval.  I 
> can't do one-by-one processing if I want to remain efficient.
>
> The second problem is that I don't always want all records in the queue.  I'd 
> still like to be able to break up records into time intervals rather than 
> grabbing everything available in the order it was logged.  However this 
> mightn't be an issue in the central database, it's more of an issue in the 
> distribution database.  So the main retrieval daemon could just grab 
> everything (or bound it by an object count limit) and dump into the 
> distribution database where time-based chunks could be extracted.
>
> Brett
>
>