[OSM-dev] Further Database optimisation and data size

Nick Hill nick at nickhill.co.uk
Thu Apr 20 20:50:35 BST 2006


 From these tests, using integers is much faster for database retrieval 
on key columns.

Storing the data completely in memory can be helpful, but is not 
necessarily overriding. Previous tests have shown that tripling the 
available memory from 340Mb to 980Mb for 100million points as doubles 
makes around 10% difference. That is on a 1.7Gb data set.

The ease by which the queries can be performed in terms of processing 
power is perhaps of greatest importance. In addition, I plan to run an 
optimisation which will put most groups of tiles within 4 hard drive 
seeks of data retrieval for any table, without changing any SQL logic. 
(I have  not yet finished tests on this).

Also, my tests have shown that the size of data set which we are 
querying has a massive impact on query performance. Please see the post 
of 10/04/06 01:33 and below.

I have re-run the 10/4/06 tests with the co-ordinates as integers. The 
results are as follows (the environment is exactly the same as the tests 
of 10/4/06 01:33):

Creating a plan with 100000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           0.118
25600           0.119
25600           0.119
25600           0.119
12800           0.069
6400            0.042
3200            0.026
1600            0.017
800             0.011
400             0.008
200             0.005
100             0.004
50              0.003
Creating a plan with 1000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           0.224
25600           0.223
25600           0.222
25600           0.223
12800           0.145
6400            0.093
3200            0.062
1600            0.043
800             0.029
400             0.020
200             0.015
100             0.011
50              0.008
Creating a plan with 10000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           0.558
25600           0.548
25600           0.551
25600           0.551
12800           0.376
6400            0.257
3200            0.181
1600            0.125
800             0.087
400             0.062
200             0.044
100             0.031
Creating a plan with 100000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           2.422
25600           2.332
25600           2.493
25600           2.446
12800           1.769
6400            1.295
3200            0.866
1600            0.657
800             0.456
400             0.328
200             0.233
100             0.159
50              0.118

Conclusion:
For 10M point data sets and less, using integers for the key columns 
improves database query speed by a factor of between 2 and 3. for 100M 
points, query speed is improved by 55%. Tripling available memory from 
340M to 980M improves query times by 10-20%. For 100M data set and 
smaller, the memory size (340M/980M) makes no difference.

Tests performed on AMD Sempron 32 bit (athlon-xp type) Debian sarge, 
running X+KDE.




Source code:
#!/usr/bin/perl -w

#Program creates random point fields eqyuivalent to bitfieldtest.pl 
except the data is stored
#as regular signed integers. To represent the globe as closely as 
possible, extents between
#-180 and +179.999999 will be used. Therefore, adding 180 normalises for 
international date line 0.
#Prime Meridian 180. 111111**0.000001


use DBI;
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );

$DBHOST = "localhost";
$DBNAME = "nickh";
$DBUSER = "nickh";
$DBPASS = "xxxxxx";

#initialise database
$driver = "mysql";
$dsn = "DBI:$driver:database=$DBNAME;host=$DBHOST";
$dbh = DBI->connect($dsn, $DBUSER, $DBPASS);

#@plane_densities=(100000000);
@plane_densities=(100000,1000000,10000000,100000000);
@tile_points=(25600,25600,25600,25600,12800,6400,3200,1600,800,400,200,100,50);
$query_iterations=25;
$debug=0;

sub create_bitfield;
sub run_tests;

foreach $density(@plane_densities){
print "Creating a plan with $density points and averaging over 
$query_iterations queries\nPoints_per_tile Query_Time\n";
create_bitfield($density);
	foreach $tilepoints(@tile_points){
	my $testtime=run_tests($density,$tilepoints);
	printf '%-14d  %.3f',$tilepoints,$testtime;  # prints "<1.0>"print 
"$density   $testtime s\n";
	print "\n";
	}
}
	


$dbh->disconnect;
exit 0;


sub create_bitfield{
#takes number of points on the point field as argument.
#We first create table without an index, as the indes slow populating table
my $prep=q~
DROP TABLE IF EXISTS integer_test;

CREATE TABLE `integer_test` (
   `lat` integer NOT NULL default '0',
   `lon` integer NOT NULL default '0'
   ) TYPE=MyISAM
~;

#drop/create tables without index
foreach $aprepare(split(/;/,$prep)) {
#print "preparing $preparation";
	$dbh->do($aprepare);
	}


#populate table
for ($i=0;$i<$_[0];$i+=100){
#create 100 element batched inserts (value,value),(value,value) etc
my $insert='';
    for ($j=0;$j<100;$j++){
    if ($j>0) {$insert .= ','; }
    $lat=int(rand 3599999999)-1800000000;
    $lon=int(rand 3599999999)-1800000000;
    $insert .= "($lat,$lon)";
    }

my $sql="INSERT INTO `integer_test` VALUES $insert;";
#print "SQL1 is $sql1\n\n";
$dbh->do($sql);

}
#After populating table, we create indexes.
#print "Creating index... This may take some time\n";
my $sql='CREATE INDEX index1 ON integer_test (lat,lon);';
$dbh->do($sql);
}


sub run_tests{
#Parameters: Points in field; Size of tile in average number of points
my $number_of_points=$_[0];
my $target_query_return=$_[1];
my 
$proportion_of_each_axis=1/(sqrt($number_of_points/$target_query_return));
my $max_extent=1-$proportion_of_each_axis;  #Maximum extent for query 
without extending beyond bound
my $returnedrows;
$query1total=0;

for($i=0;$i<$query_iterations;$i++){
#$lat1=(0.4+(rand ($max_extent/10)));
#$lon1=(0.4+(rand ($max_extent/10)));

$lat1=int(rand ($max_extent*3599999999))-1800000000;
$lon1=int(rand ($max_extent*3599999999))-1800000000;

$lat2=int($lat1+($proportion_of_each_axis*3599999999));
$lon2=int($lon1+($proportion_of_each_axis*3599999999));

if ($debug){print "querying bounds $lat1 $lon1 $lat2 $lon2 with queries 
\n";}

$query1="SELECT lat,lon from integer_test WHERE lat>$lat1 and lat<$lat2 
and lon>$lon1 and lon<$lon2";


#print "Query 1: $query1\n";
$t0 = Time::HiRes::time();
my $sth = $dbh->prepare($query1);
$sth->execute();
$returnedrows+=$sth->rows();
#$sth->finish;
$elapsed=Time::HiRes::time()-$t0;
#print "Fetched $rows points in $elapsed \n";
$query1total+=$elapsed;


}
#print $returnedrows;
if ($debug){print "Each of the $query_iterations returned an average " . 
$returnedrows/$query_iterations . "records\n";}
#print "unit test time is " . ($query1total/$query_iterations) . 
"seconds \n\n";
return ($query1total/$query_iterations);
}




More information about the dev mailing list