[OSM-dev] Deriving Change Sets

Fri Jun 29 13:52:22 BST 2007

I'm hoping that somebody with some knowledge of the database schema may 
be able to provide some guidance.  Sorry for the long email, my 
questions are in the "Problem" section, the "Background" and 
"Constraints" section provide some additional info that explains what 
I'm trying to do.

*** Background ***
I'm trying to derive a change set within a time interval using the 
history stored in a mysql database.  The end purpose of this is to build 
a mechanism for regular updates to be fed from the main osm database to 
other data destinations without requiring complete planet dumps.  This 
is similar to a database replication mechanism but doesn't assume 
anything about the end destination.  The end destination may well be 
another mysql database, but it could be a pgsql database used for map 
generation, or it could be an existing planet file for the purposes of 
minimising download sizes.  Without assuming too much about the thoughts 
of others on this subject, this could provide the basis for an efficient 
mechanism for increasing the regularity of planet updates.  I'm aiming 
this to be capable of running very regularly with small time intervals, 
not just down to the day level but much less (hourly, perhaps minutes).  
But I'm getting way ahead of myself ...

*** Constraints ***
I'm hoping to derive these changes using a fixed number of queries to 
the database.  For example, if I could perform a single query to each of 
the nodes, segments, ways, way_segments and way_tags tables it will be 
far more efficient than performing a number of queries in proportion to 
the number of changes.  In addition, I want to restrict the data 
returned by timestamp (will require appropriate indexing) so that full 
table reads are avoided.
If this can be achieved, it will scale for both big and small change 
sets.  Large change sets will not result in excessive numbers of queries 
to the database.  Small change sets will not result in the entire 
database being read.  This allows the replication schedule to be set to 
any size.

*** Problem ***
For each of the data types in OSM (ie. nodes, segments and ways), I want 
to know the creates, the modifies, and the deletes within a specific 
time interval.

Nodes
If I just read the current_nodes table for records with timestamp values 
within my range, I will receive a list of nodes that have been updated 
in my time interval.  This will tell me which nodes have been deleted in 
that time (where visible = 0), but for all other records I can't tell 
the difference between a create and a modify.
If I read the nodes table for records with timestamp values within my 
range, I have the same problem because I still can't tell the difference 
between creates and modifies.
The only approach I can think of is as follows:
- Get the ids of all nodes that have been modified in the time interval 
using the current_nodes table.
- For *each* id in the list, read all of its history from the nodes 
table (ordered by timestamp) so that I can tell the difference between a 
create and modify.

Segments
The same problems for nodes apply to segments.

Ways
Ways have a version column on the ways table which solves the problems 
occurring for nodes and segments.  I can tell the difference between a 
create and a modify based on the fact that the version of a create 
record is always "1".
Obtaining the records from the related tables (way_segments and 
way_tags) is a little bit tricky but presumably I can get this data 
using left outer joins to those tables.

If it is not possible to do this without a version column on the nodes 
and segments tables, is there any reason (other than time and effort) 
for a version column not to be added?

Brett