[OSM-dev] Updating Nodes table to use a flag for 'current' data

Sat Jun 3 02:19:27 BST 2006

I did some testing today, to determine whether there is a way faster
than the current subselects to find nodes which are to be displayed in
map.

I used http://crschmidt.net/projects/openstreetmap/make-random-data.txt
to create 500,000 random nodes, and then ran it again, creating a table
with a total of 1 million rows, one set of history for each node.

I created an indexed tinyint column ('current') in the data: adding the
index took 21 seconds. altering the content of 500,000 nodes to have a
'current=0' tag took 1 minute. 

Queries against the dataset were then run in the format that OSM uses:

select id, latitude, longitude, visible, tags from (select * from
(select nodes.id, nodes.latitude, nodes.longitude, nodes.visible, 
nodes.tags from nodes, nodes as a where a.latitude > 4  and a.latitude 
< 5  and a.longitude > 19 and a.longitude < 20 and > nodes.id = a.id order 
by nodes.timestamp desc) as b group by id) as c
where visible = true and latitude > 4  and latitude < 5  and longitude >
19 and longitude < 20

And then initiated for a different area (to prevent from loading all
data from query cache), using the flag:

select id, latitude, longitude, visible, tags from nodes where visible =
true and current=1 and latitude > 5  and latitude < 6  and
longitude > 20 and longitude < 21;

As I performed more queries, the data came more quickly out of the API.
However, throughout the process, the flag method was about 3.5 times
as quick than the OSM method: 

OSM Method: 76 rows in set (45.69 sec)
Flag method: 94 rows in set (12.20 sec)

OSM method: 88 rows in set (3.63 sec)
Flag method: 103 rows in set (1.54 sec)

OSM method: 83 rows in set (0.35 sec)
Flag method: 97 rows in set (0.11 sec)

And, for selecting 5 degree by 5 degree areas, the difference was about
the same:

OSM method: 2537 rows in set (2.42 sec)
Flag method: 2498 rows in set (0.79 sec)

1 degree area, while loading data at several thousand rows per second
via make-random-data script:

OSM method: 440 rows in set (6.78 sec)
Flag method: 386 rows in set (1.60 sec)

Assuming that there is historical data that is, on average, equally
dense to the current data, adding a flag would probably offer a speedup
of a factor of 3 in selects.

While updates to the database are taking place, reads still proceed,  As I 
said earlier, updating a 500,000 node dataset took about 60 seconds, which
comes down to an insignificant amount of database time to perform a
single update. 

Migrating from the current data to the flagged data would be a
significant cost: I was unable to come up with a way to do it that did
not cause MySQL to lock up in some way.  (I'm running a dev version of
MySQL 5.0, so this may be a problem with my MySQL.) However, this is a
one-time cost, and I'm sure that it could be worked around by people who
have more experience with databases than me. 

Selecting a count of current non-historical nodes while inserting new
ones at a rate of several thousand per second takes around .2 seconds.

I'm assuming there are a significant number of edits to the nodes.
Assuming that OSM continues to grow and maintain historical data as it
does now, becoming more and more like wikipedia, the flag method will
show increasing benefits 

Queries other than the select in getnodes (and other relevant functions) 
would not be effected: however, createnode would need to update the
nodes table to set current=0 where nodeid=given. 

All times are reproducible as a relative number: All numbers fall within
a range of about 3->4 times difference in the time requried. I expect,
but can not be sure, that more historical data will continue to
negatively impact the queries, so this factor of 3, on data which is
only a 1:1 ratio of historical data, will most likely continue to
'improve' with the flagged method as more historical data becomes
available.

Note that this test is not going to accurately reflect performance on
the live server. I believe that within a 1 degree area around London,
OSM would have significantly more than 400 nodes ... and in the middle
of the atlantic, where my test data is, there's probably far fewer :) 

I'd be happy to write a patch to the getnodes and update/create node
functions, as well as describe what I've done to the MySQL schema, if
someone would like me to, or do any other work neccesary to help migrate
to this change, although I would need help on how to update the existing
data. A query like:

update nodes set current=1 where (id, timestamp) in (select id,
timestamp from (select nodes.id, nodes.timestamp from nodes, nodes as a
where a.id=nodes.id  order by nodes.timestamp
desc) as b group by id)

Ought to do it, but doesn't seem to in my MySQL, so this would need to
be tested before migrating the live data. (I'd recommend performing this
change in a copied database during some form of maint. period, then
copying that data back into OSM, honestly.)

Hopefully this is a helpful analysis. Please let me know if there's
anything else I can do to convince you that things can be improved by
using this method :)

-- 
Christopher Schmidt
Web Developer