[OSM-dev] OWL + OSM Activity Server

Thu Oct 18 11:07:10 BST 2012

Hi Matt/Ian,

I'd like to continue this thread as since yesterday I have been looking 
into OWL code (specifically the C++ and the database parts) and I have 
some thoughts.

One major difference in our approach is how we process OSC files.

Correct me if I'm wrong but OWL goes through an OSC file just like the 
XML parser (libxml TextReader to be exact) does - top-down, element by 
element. Every change is examined and saved to the database.

What I'm doing (see [1]) is I scan the whole OSC file to find all 
distinct changesets and then process each changeset. So in effect I 
split the OSC file into smaller OSC files that only contain changes for 
one changeset.

I need to do it this way because of the nature of an activity - right 
now I consider one changeset as one "mapping" activity. This means that 
for example I need to feed the whole changeset at once to Changemonger 
in order to get back nice changeset description.

It also means that I need to have the database in a state just before 
given changeset is applied in order to find changeset's geometry and 
allow Changemonger to query the database and get correct state as well.

Next is scalability. I see that in OWL there is a quite sophisticated 
"tiling" algorithm that distributes chagnes between database tables.

This looks scalable to me :-) One question would be how do you get the 
data out of this structure - I guess that bounding box can be 
transformed into a list of tiles, then into a list of tables and then 
you do one huge SQL UNION? I have not yet looked into the Rails part of 
OWL to see how the data is used.

In any case, I'm interested in your opinion on the following:

1. Have a table of changesets (or corresponding activities - does not 
matter) with a geometry column that represent changeset's geometry - 
node coordinates, way linestrings, relations translated into geometry etc.

I guess at this point that would be about ~15M rows judging where 
changeset id is nowadays.

2. Simply use PostGIS and the GIST index on the geometry column to query 
the thing. Perhaps first do a simple bounding box vs bounding box query 
to narrow down the suspects and then ST_Intersects.

So the question is - do you think this scales? Specifically, how long 
would a typical query take? I plan to do some tests with artificial data 
since I don't have the server resources to generate such table from real 
data (OSC files "since forever" and a planet database).

I'm asking this question because this is how I currently do things. So 
if the answer is "no" then it's clear that I need to go into OWL.

And finally a bunch of random questions:

1. How many changesets and changes do you currently have in your planet 
OWL instance?

2. Perhaps you answered that already but just to clarify - do you 
already have the code to respond to bounding box/tile queries (e.g. 
"return all changes within X"?) on this new database structure with >4k 
tables?

3. If yes to (2), how long does a query take?

[1] 
https://github.com/ppawel/osm-activity-publishers/blob/master/changeset-publisher/process_osc.rb#L139

Paweł