[OSM-talk] OGL, OSM, NASA

Tue Apr 25 18:37:38 BST 2006

Raphael Jacquot wrote:

> I'm not sure that database size on disk is of any interest (after all, 
> 250G disks go for about 80€). My *experience* with postgres on a laptop 
> (that is, with a 4200 rpm laptop drive) has been pretty good (running my 
> navsys.py navigation / capture system).
> Also, my *experience* with it and a standard seagate ATA 100 drive on a 
> 40 wire cable (because that's what's inside the box) is pretty good too.

The data size on disc makes a huge difference.

The issue isn't the amount of disc space it takes, but what proportion 
of the available data - and whether those parts of the index which will 
be used are in memory for the look-up. The bigger the data on disc, the 
less likely the point you need will be in memory cache. (The linux 
kernel uses all unused memory for disc caching and discards on a least 
recently used algorithm)

It also makes a big difference how randomly distributed the data is in 
the index.

In practical terms, I can perform queries returning 25600 gps points on 
a 100 million GPS point field with a 1Gb machine using integers in 0.2 
seconds (single thread on an AMD Hammer 2.6Ghz). That is a query on a 
random point. Please see osm-dev for my test program.

Using the much larger geometric point type, the same query, on the same 
hardware, with (only) a 10 million GPS point field initially takes 9 
seconds which drops to 2 or 3 seconds as parts of the index are loaded 
into memory.

For the geometric type, if I constrain the query to random queries in a 
small area for the 10 million point field, queries start at 2 seconds 
then drop to 0.041 seconds.

I have not been able to set up a practical demonstration using the 
gemoetric point data type for a 100 million GPX point field because 
generating the index takes disproportionately longer as the data set 
grows. I have sucessfully performed tests with a 100 million point field 
using integers.

The issue also is not the number of conductors or even the transfer 
between the disc and the motherboard. This makes no difference at all. 
The issue is latency. How long does it take for the CPU to get that 
piece of data it needs to make the decision about where the next point 
is. If the data is in memory, the latency is in the order of tens of 
nanoseconds. If the data is on the hard drive, it is in the order of 
milliseconds.

> 
> mebbe if your mysql was using innodb or something it would be faster, 
> but there's no way this is gonna work with myisam (which is after all a 
> flat file a la d-base 3)
> 
> here are the data sizes:
> 
> on the laptop:
> 75845 gps points
> 11389 nodes       \
> 12048 segments    / only the grenoble area
> 30M are used
> 
> on the main server
> (no gps points - yet)
> 296806 nodes     \
> 295899 segments     / a large part of the UK and France
> 320M are used
> 
> that's including all the indexes
> 

This size data set will easily fit into memory of a modest machine. If 
the R-tree representation fits into memory, then it is fast. If it 
doesn't, it is monstrously slow.

How will the database perform when we get to 10 million or 100 million 
points?