[OSM-dev] Reading OSM History dumps

Mon Aug 23 14:18:22 BST 2010

On Sat, Aug 21, 2010 at 11:40 AM, Peter Körner <osm-lists at mazdermind.de> wrote:
> Hi
>
> I during the last week I thought intensively about the new full history dump
> and how we could use it. I wrote some kind of paper and also some demo code
> to check how we could get osm history information into a postgis database
> with linestings and all this delicate features osmosis offers.
>
> I've put it on the wiki at
> <http://wiki.openstreetmap.org/wiki/User:MaZderMind/Reading_OSM_History_dumps>
>
> I'd love to hear some comments about it.

I have a few comments that might give you a different impementation
strategy that would work entirely in the database, generating geometry
data and supporting all the usual indexes. If it works, it'd be a
*lot* less custom coding and let the database deal with the heavy
lifting.

My idea is that we assume that a given snapshot of the database at a
given changeset number is unique. IE, given a changeset number, can we
reconstruct the exact contents of the database at that particular
version.? If so, then the changeset&id number becomes a primary key
for all entities.   We can identify each version of each
node/way/relation in normal form based on the id and changset it was
created/updated.

Ajunct each way, node, relation with two columns, a 'start_changeset"
and a 'end_changeset'. If end_changeset == MAX_INT the node/way is
current.   The primary key for a node becomes the pair
(id,start_changeset). The primary key for a way becomes the pair
(id,start_changeset). Tables like {node,way,relation}_tags, way_nodes
or relation_members also use (id, start_changeset) as their key.

Derived data like way.linestring, way.bbox, relation.bbox, which
change whenever the referenced nodes changes position, cannot be done
in normal form as part of the way or relation tables. If it is stored
into a different table, we can exploit the uniqueness of
(id,changeset) to store it uniquely.

create table way_bbox (
   id int not null;
   start_changeset int not null,
   end_changeset int not null,
   linestring .....
)

Fill this table by iterating over all changeset numbers.

   CREATE TEMP TABLE todo as select id from ways where start_changeset == %1;

This will be guaranteed to enumerate each way&changeset number ONCE,
not doing any redundant work.

    CREATE TEMP TABLE todo_1b as select todo.id, %1 as way_changeset,
nodes.start_changeset from nodes, ways, way_nodes, todo WHERE
(nodes.id = way_nodes.node_id and nodes.start_changeset between
way.start_changeset and way.end_changeset) and (ways.id =
way_nodes.way_id and ways.start_changeset = way_nodes.start_changeset)
and ways.id == todo.id and way.start_changeset = %1 group by todo.id,
nodes.start_changeset.

This joins the way_nodes and ways tables to get a list of all nodes
that in a way in the obvious way. We then join with the nodes table to
get all of the node version changes that occur across this one
changeset in the way. This gets us a set of distinct timestamps for
each way when the way's geometry changes.

Or, do it in one SQL:
    CREATE TEMP TABLE todo_2 as select way.id, way.start_changeset as
way_changeset, nodes.start_changeset from nodes, ways, way_nodes, todo
WHERE (nodes.id = way_nodes.node_id and nodes.start_changeset between
way.start_changeset and way.end_changeset) and (ways.id =
way_nodes.way_id and ways.start_changeset = way_nodes.start_changeset)
group by way.id, way_changeset, nodes.start_changeset.

This code is slightly broken as it doesn't generate end_changeset
numbers for the node interval ranges. I believe windowing functions or
manual fixups will be needed. Assume that that is done and the todo_2
table has start_changset and end_changeset numbers.

Now, given that table, create the distinct geometry for each.
    CREATE TABLE bbox as (select todo_2.id, todo_2.start_changeset,
todo_2.end_changeset, bbox_merge(nodes.location) as bbox from nodes,
todo_2, way_nodes, where (nodes.id = way_nodes.node_id and
nodes.start_changeset between todo_2.start_changeset and
todo_2.end_changeset) and (todo_2.id = way_nodes.way_id and
todo_2.way_changeset = way_nodes.start_changeset) and ways.id ==
todo.id group by todo.id, todo_2.start_changeset,
todo_2.end_changeset.

To handle the out-of-order changeset numbers from old API conversions,
either #1: Ignore the issue as it hopefully applies to old/historic
data, or define a  virtual_changeset_id distinct from
api_changeset_id.  api_changeset_id remembers the api changeset
included in the OSM database, and virtual_changeset_id is used above
in the schema. The mapping from one to another can be arbitrary, eg,
using timestamp information..

Scott