[OSM-dev] Deriving Change Sets
Brett Henderson
brett at bretth.com
Fri Jun 29 13:52:22 BST 2007
I'm hoping that somebody with some knowledge of the database schema may
be able to provide some guidance. Sorry for the long email, my
questions are in the "Problem" section, the "Background" and
"Constraints" section provide some additional info that explains what
I'm trying to do.
*** Background ***
I'm trying to derive a change set within a time interval using the
history stored in a mysql database. The end purpose of this is to build
a mechanism for regular updates to be fed from the main osm database to
other data destinations without requiring complete planet dumps. This
is similar to a database replication mechanism but doesn't assume
anything about the end destination. The end destination may well be
another mysql database, but it could be a pgsql database used for map
generation, or it could be an existing planet file for the purposes of
minimising download sizes. Without assuming too much about the thoughts
of others on this subject, this could provide the basis for an efficient
mechanism for increasing the regularity of planet updates. I'm aiming
this to be capable of running very regularly with small time intervals,
not just down to the day level but much less (hourly, perhaps minutes).
But I'm getting way ahead of myself ...
*** Constraints ***
I'm hoping to derive these changes using a fixed number of queries to
the database. For example, if I could perform a single query to each of
the nodes, segments, ways, way_segments and way_tags tables it will be
far more efficient than performing a number of queries in proportion to
the number of changes. In addition, I want to restrict the data
returned by timestamp (will require appropriate indexing) so that full
table reads are avoided.
If this can be achieved, it will scale for both big and small change
sets. Large change sets will not result in excessive numbers of queries
to the database. Small change sets will not result in the entire
database being read. This allows the replication schedule to be set to
any size.
*** Problem ***
For each of the data types in OSM (ie. nodes, segments and ways), I want
to know the creates, the modifies, and the deletes within a specific
time interval.
Nodes
If I just read the current_nodes table for records with timestamp values
within my range, I will receive a list of nodes that have been updated
in my time interval. This will tell me which nodes have been deleted in
that time (where visible = 0), but for all other records I can't tell
the difference between a create and a modify.
If I read the nodes table for records with timestamp values within my
range, I have the same problem because I still can't tell the difference
between creates and modifies.
The only approach I can think of is as follows:
- Get the ids of all nodes that have been modified in the time interval
using the current_nodes table.
- For *each* id in the list, read all of its history from the nodes
table (ordered by timestamp) so that I can tell the difference between a
create and modify.
Segments
The same problems for nodes apply to segments.
Ways
Ways have a version column on the ways table which solves the problems
occurring for nodes and segments. I can tell the difference between a
create and a modify based on the fact that the version of a create
record is always "1".
Obtaining the records from the related tables (way_segments and
way_tags) is a little bit tricky but presumably I can get this data
using left outer joins to those tables.
If it is not possible to do this without a version column on the nodes
and segments tables, is there any reason (other than time and effort)
for a version column not to be added?
Brett
More information about the dev
mailing list