[OSM-dev] Osmosis - handling of linestring column

Wed Jan 14 22:30:06 GMT 2009

Lars Francke wrote:
> Hi,
>
> Osmosis used to create a bbox column for the ways table. As I didn't
> need this I replaced all its queries with queries to create and update
> a LINESTRING-column instead. This works flawlessly and as I understand
> this feature has since been added to Osmosis (I'm using 0.29.2). As
> this update seems(!) to be the main source for slowdowns when applying
> diffs I wondered what could be done about it.
>   
Yes, the feature has been added, but only for the 0.6 code.

There are 3 optional features in the 0.6 codebase, the way.bbox column, 
the way.linestring column and the action table (contains all changes for 
the current import, useful for updating custom downstream tables).
> The column needs to be updated when any of the nodes in a way gets updated.
> In my hack and the current NodeDAO.java[1] file the
> SQL_UPDATE_WAY_LINESTRING is executed once for each node that has been
> updated and once again if the way has been updated (WayDAO.java). So a
> way with 1000 nodes would be updated up to 1001 times. First question
> is: Is my assumption correct? If not please ignore the rest ;-)
>   
Yes, that is correct.
> If it is correct a solution depends on the OSM-format. If a node is
> updated does Osmosis produce a <modify>-element of all the related
> ways? If yes the solution would be to simply delete the
> SQL_UPDATE_WAY_LINESTRING from NodeDAO. If no we could compile a list
> of all ways that are "touched" by the node updates and after
> successfully importing everything update these ways. This wouldn't be
> the best solution memory-wise and perhaps not even speed up things at
> all because we'd have to find out which ways a node belongs to for
> each updated node. But this is all I could come up with.
>   
The osm format doesn't contain related elements, it only contains those 
elements that have changed.  Finding related elements would require 
additional queries while extracting data from the main database which 
I've kept to a minimum.  In effect the load is being placed on the 
client instead of the server, scaling the central server is one of the 
primary objectives.  It has the additional benefit of keeping the 
changeset file sizes minimal.

The difficult bit is identifying which ways are impacted by a node 
change.  Currently the code is naive and just runs an update query on 
all ways related to the node being modified.  As you point out this 
isn't ideal so it might be possible to break this into two parts: 1. 
Identify impacted ways, 2. Update ways.  Step 1 will still need to be 
done per node (although we could query on several nodes within a single 
query) so the savings there might not be noticeable (performance seems 
to be most impacted by disk seeks, not database round trips), but step 2 
could be reduced significantly if each way is impacted by many nodes 
(ie. we'd only update each way once rather than once per node).

The current solution is the simplest and I hoped would be satisfactory.  
If it's taking too long we'll have to find a better way.  So long as the 
solution can fit in a "reasonable" amount of RAM I'm happy.  I doubt if 
I'll be able to look at this myself soon so feel free to experiment with 
improvements.
> I don't have access to a running database at the moment as I'm
> currently rebuilding the server so all this is not thoroughly checked.
> But any comments are greatly appreciated.
>   
Brett