[OSM-talk] OGL, OSM, NASA
Nick Hill
nick at nickhill.co.uk
Tue Apr 25 18:37:38 BST 2006
Raphael Jacquot wrote:
> I'm not sure that database size on disk is of any interest (after all,
> 250G disks go for about 80€). My *experience* with postgres on a laptop
> (that is, with a 4200 rpm laptop drive) has been pretty good (running my
> navsys.py navigation / capture system).
> Also, my *experience* with it and a standard seagate ATA 100 drive on a
> 40 wire cable (because that's what's inside the box) is pretty good too.
The data size on disc makes a huge difference.
The issue isn't the amount of disc space it takes, but what proportion
of the available data - and whether those parts of the index which will
be used are in memory for the look-up. The bigger the data on disc, the
less likely the point you need will be in memory cache. (The linux
kernel uses all unused memory for disc caching and discards on a least
recently used algorithm)
It also makes a big difference how randomly distributed the data is in
the index.
In practical terms, I can perform queries returning 25600 gps points on
a 100 million GPS point field with a 1Gb machine using integers in 0.2
seconds (single thread on an AMD Hammer 2.6Ghz). That is a query on a
random point. Please see osm-dev for my test program.
Using the much larger geometric point type, the same query, on the same
hardware, with (only) a 10 million GPS point field initially takes 9
seconds which drops to 2 or 3 seconds as parts of the index are loaded
into memory.
For the geometric type, if I constrain the query to random queries in a
small area for the 10 million point field, queries start at 2 seconds
then drop to 0.041 seconds.
I have not been able to set up a practical demonstration using the
gemoetric point data type for a 100 million GPX point field because
generating the index takes disproportionately longer as the data set
grows. I have sucessfully performed tests with a 100 million point field
using integers.
The issue also is not the number of conductors or even the transfer
between the disc and the motherboard. This makes no difference at all.
The issue is latency. How long does it take for the CPU to get that
piece of data it needs to make the decision about where the next point
is. If the data is in memory, the latency is in the order of tens of
nanoseconds. If the data is on the hard drive, it is in the order of
milliseconds.
>
> mebbe if your mysql was using innodb or something it would be faster,
> but there's no way this is gonna work with myisam (which is after all a
> flat file a la d-base 3)
>
> here are the data sizes:
>
> on the laptop:
> 75845 gps points
> 11389 nodes \
> 12048 segments / only the grenoble area
> 30M are used
>
> on the main server
> (no gps points - yet)
> 296806 nodes \
> 295899 segments / a large part of the UK and France
> 320M are used
>
> that's including all the indexes
>
This size data set will easily fit into memory of a modest machine. If
the R-tree representation fits into memory, then it is fast. If it
doesn't, it is monstrously slow.
How will the database perform when we get to 10 million or 100 million
points?
More information about the talk
mailing list