[OSM-dev] Deriving Change Sets

Sun Jul 1 06:20:19 BST 2007

Frederik Ramm wrote:
> This is something I would really love to see get off the ground. (I 
> remember an evening at the Essen meeting where I complained about the 
> weekly dump, and Nick Black said something like "well we could do it 
> daily", and I said "daily is not enough", and he went "well one could 
> do hourly with proper equiment" and I said "dumps, dumps, dumps, I 
> don't want no stupid dumps, I want live data..." - a discussion ensued 
> about what you'd possibly need live data for, but until today I 
> maintain that we should just provide data as live as possible without 
> asking what people want to use it for.)
At the risk of rambling, I feel the same way.  Dumps are extremely 
valuable, simple to implement, and probably the right solution for many 
problems.  But they have their limitations such as:
* Data is always out of date.
* Attempting to increase dump frequency adds significant load to the 
data source (ie. the database and other osm server infrastructure).
* Data is re-transmitted every time a new dump is requested adding to 
network utilisation.
* At the receiving end, the complete dataset must be processed every 
time a new dump is utilised.

A method of synchronisation avoids the problems above.

A regular synchronisation mechanism enables some of the following 
possibilities:
* Some end users such as mapnik can produce more up-to-date maps and can 
avoid significant processing by only importing changes.
* In order to alleviate load from the current API and primary database, 
current users of the API such as tiles at home *may* (this may be 
controversial :-) be able to switch to using a "near live" feed.
* Tasks can respond to changes more effectively.  To use tiles at home as 
an example, the replication task from primary database to rendering 
database could examine the nodes and automatically flag tiles that need 
re-rendering thus eliminating the need to manually request tile re-renders.
* Read-only tasks without hard real-time requirements can be moved off 
the API (and core database) thus allowing the core infrastructure to 
scale to a larger number of users.

> But I always thought - as long as "near live" feeds are what one wants 
> - it would be much cheaper in terms of processing power to simply log 
> each change as performed by the API.
Agree.
>
> Your approach would be required if there were other ways to change the 
> data but in our situation where anything that changes data has to go 
> through rails anyway, why not have rails log these things and simply 
> process the log files?
The current tool I'm working on (Osmosis) aims to support any change 
detection mechanism.  It consists of a pipeline of connected tasks 
processing osm data.  There are producer tasks, consumer tasks, and 
combinations of both.  They can be plugged together in various 
combinations so that the input tasks don't care which output tasks are 
consuming their data.

Currently I'm working on a task for producing change sets by examining a 
database because this could be utilised today without changes to other 
OSM infrastructure.  In effect, I'm treating the history tables of the 
database as my log file by aiming not to touch data that hasn't changed.

If we want to improve on this further (to make data even closer to 
real-time), a new task would need to be written that interacts more 
closely with the rails API (via a custom log file would certainly be one 
way of doing this).
> That way you wouldn't have to ask the database for anything, and 
> unlike something based on DB replication or triggers, you would be 
> independent of the target system.
Agree, remaining independent of the target system is essential.

Rather than try to convince people with words that this is a good idea, 
I'm trying to prove it with some working code.  Unfortunately real work 
keeps interrupting my "play" time so it's taken me a bit longer than I 
would have liked :-)  If I can get a database change reader and database 
change writer working on the current schema, it will have reached a 
point where it has real possibilities.  At that point, I'll write up 
some wiki pages describing its usage and see if I can attract some interest.

The current code is available from:
https://www.bretth.com/repos/main/osmosis/trunk

If anybody wants it moved to the main osm repository, I'm more than 
happy to do so.