[OSM-dev] Analysis of query execution time based on table size and records returned

Mon Apr 10 01:33:40 BST 2006

I have written a script which generates a point-field emulating a table 
of GPX points. The script then runs a series of queries emulating 
different tile sizes.

This data should help determine optimal tile sizes, and help predict our 
database requirements as the GPX and other tables grow.

OUTPUT DATA:
+++++++++++++++++++++++
Creating a plan with 100000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           0.323
25600           0.313
25600           0.313
25600           0.312
12800           0.174
6400            0.100
3200            0.059
1600            0.036
800             0.023
400             0.015
200             0.011
100             0.007
50              0.005
Creating a plan with 1000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           0.500
25600           0.499
25600           0.501
25600           0.501
12800           0.307
6400            0.195
3200            0.126
1600            0.083
800             0.057
400             0.038
200             0.027
100             0.019
50              0.014
Creating a plan with 10000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           1.230
25600           1.392
25600           1.365
25600           1.219
12800           0.820
6400            0.580
3200            0.367
1600            0.267
800             0.199
400             0.148
200             0.101
100             0.087
50              0.051
Creating a plan with 100000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           3.830
25600           3.822
25600           3.729
25600           3.794
12800           2.655
6400            1.794
3200            1.321
1600            0.943
800             0.657
400             0.464
200             0.340
100             0.258
50              0.191

+++++++++++++++++++++++++++++++++

As can be seen,
1) As the table gets bigger, all queries take longer. Smaller queries 
become disproportionately inefficient.

	The 100 point query takes 36 times longer on the 10^8 point table 
compared to the 10^5 point table.

	The 25600 point query takes 11 times longer to execute on the 10^8 
table compared to the 10^5 point table.

It may be worth considering using a multi-table database with each table 
covering a small area to minimise the table size.

2) It is always more efficient to capture the data for a complete area 
rather than capturing data in tiles for the area.

We'll compare running 16 small queries to running one large query for a 
given area for each database size using the 400 and 6400 point queries.

10^5 points
16x0.015=0.24s
1x0.1=0.1s
Ratio: 1:2.4

10^6 points
16x0.038=0.6s
1x0.195=0.195s
Ratio:1:3

10^7 points
16x0.148=2.36s
1x0.58=0.58s
Ratio:1:4

10^8
16x0.464=7.424s
1x1.794=1.794
Ratio:1:4.13

As the number of points in the database increases, both the number of 
items returned in each query and the table size increases. This has an 
effect of multiplying the load effect on a database server.

For example, doubling the number of GPX points on the database
1) Increases query execution time by 40% from a greater number of 
matching points
2)Increases the execution time by 14% from having a bigger data set to 
extract the points from, giving a compound load increase of 60%.

As a rule of thumb, we can say that assuming the usage patterns remain 
flat (an unsafe assumption) for every doubling of size of data set will 
increase database load by 60%.

As the size of data set increases, the usefulness of OSM increases, 
which in turn increases the potential load on the server. This factor 
could at times be a square or cube function. Assuming a square function, 
a doubling of database size could result in a load increase of 6.4x.

Perl script for the test:

#!/usr/bin/perl -w
#Program creates random point fields of given density of extents 0..1 x 
#and y.
#In other words, a known number of points at random locations on a 2d 
#plane.
#Creates a given point field of given density, runs a set of queries for
#areas calculated to have a given number of points. Evaluates time to 
#run each query.

use DBI;
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );

$DBHOST = "localhost";
$DBNAME = "nickh";
$DBUSER = "nickh";
$DBPASS = "xxxxxx";

#initialise database
$driver = "mysql";
$dsn = "DBI:$driver:database=$DBNAME;host=$DBHOST";
$dbh = DBI->connect($dsn, $DBUSER, $DBPASS);

@plane_densities=(100000,1000000,10000000,100000000);
@tile_points=(25600,25600,12800,6400,3200,1600,800,400,200,100,50);
$query_iterations=25;
$debug=0;

sub create_bitfield;
sub run_tests;

foreach $density(@plane_densities){
print "Creating a plan with $density points and averaging over 
$query_iterations queries\nPoints_per_tile Query_Time\n";
create_bitfield($density);
	foreach $tilepoints(@tile_points){
	my $testtime=run_tests($density,$tilepoints);
	printf '%-14d  %.3f',$tilepoints,$testtime;  # prints "<1.0>"print 
"$density   $testtime s\n";
	print "\n";
	}
}

$dbh->disconnect;
exit 0;

sub create_bitfield{
#takes number of points on the point field as argument.
#We first create table without an index, as the indes slow populating table
my $prep=q~
DROP TABLE IF EXISTS int_test;

CREATE TABLE `int_test` (
   `lat` double NOT NULL default '0',
   `lon` double NOT NULL default '0'
   ) TYPE=MyISAM
~;

#drop/create tables without index
foreach $aprepare(split(/;/,$prep)) {
#print "preparing $preparation";
	$dbh->do($aprepare);
	}

#populate table
for ($i=0;$i<$_[0];$i+=100){
#create 100 element batched inserts (value,value),(value,value) etc
my $insert='';
    for ($j=0;$j<100;$j++){
    if ($j>0) {$insert .= ','; }
    $lat=rand 0.999999999;
    $lon=rand 0.999999999;
    $insert .= "($lat,$lon)";
    }

my $sql="INSERT INTO `int_test` VALUES $insert;";
#print "SQL1 is $sql1\n\n";
$dbh->do($sql);

}
#After populating table, we create indexes.
#print "Creating index... This may take some time\n";
my $sql='CREATE INDEX index1 ON int_test (lat,lon);';
$dbh->do($sql);
}

sub run_tests{
#Parameters: Points in field; Size of tile in average number of points
my $number_of_points=$_[0];
my $target_query_return=$_[1];
my 
$proportion_of_each_axis=1/(sqrt($number_of_points/$target_query_return));
my $max_extent=1-$proportion_of_each_axis;  #Maximum extent for query 
without extending beyond bound
my $returnedrows;
$query1total=0;

for($i=0;$i<$query_iterations;$i++){
$lat1=rand $max_extent;
$lon1=rand $max_extent;

$lat2=$lat1+$proportion_of_each_axis;
$lon2=$lon1+$proportion_of_each_axis;

if ($debug){print "querying bounds $lat1 $lon1 $lat2 $lon2 with queries 
\n";}

$query1="SELECT lat,lon from int_test WHERE lat>$lat1 and lat<$lat2 and 
lon>$lon1 and lon<$lon2";

#print "Query 1: $query1\n";
$t0 = Time::HiRes::time();
my $sth = $dbh->prepare($query1);
$sth->execute();
$returnedrows+=$sth->rows();
#$sth->finish;
$elapsed=Time::HiRes::time()-$t0;
#print "Fetched $rows points in $elapsed \n";
$query1total+=$elapsed;

}
#print $returnedrows;
if ($debug){print "Each of the $query_iterations returned an average " . 
$returnedrows/$query_iterations . "records\n";}
#print "unit test time is " . ($query1total/$query_iterations) . 
"seconds \n\n";
return ($query1total/$query_iterations);
}