[OSM-talk] Bounding box

Mon Feb 5 21:44:04 GMT 2007

Nick Hill wrote:
> Hello Raphael
> 
> Insane queries are those whose very high demand don't stack up to the 
> close to zero value of the result.

ok, so you define insane queries as the one that need to return a 
significant part of the available data for no good reason. point taken
in that case, said query can still have a significant impact on other 
queries that attempt to run in parallel, and may lead to locking.

>> there is no need to use partitionning or other 'hackish' schemes if 
>> the indexing engine features the proper index generation algorithms 
>> (such as R-tree based indexes)
> 
> I don't consider partitioning hackish.

partitionning the *tables* is hackish. that is, you modify the 
fundamental structure of the table to allow extracting data on 
*portions* of the data set, based on a set of rules that are arbitrarily 
defined (such as cutting the tables in a grid of sorts).

said partitionning will lead to, in average, require checking one, two, 
or four datasets, depending on how the request's bounding box is designed.

> 
>> Of course, you'll then tell me that the index takes a lot of space or 
>> something. well, that's what computing is all about, compromises.
>> that is, either your data is very compressed and needs a lot of 
>> computing power to sift through, or, you sacrifice storage space to 
>> have a fast to go through data structure, and your data is fast to 
>> access. you can't have both.
> 
> I agree to a small part, but mostly disagree. I agree that our indexing 
> scheme is sub-optimal. Your assumption that the bbox and node limit 
> requirement is substantially as a result of the indexing scheme is 
> incorrect. I also strongly disagree with your hidden assumption that an 
> OGC R-tree index as commonly implemented is anywhere near optimal. I 
> also disagree with the implication that the space inefficiency of the 
> OGC indexing scheme somehow makes the index perform better than any 
> other theoretical model or that a different scheme is somehow compressed.
> 

now, that's a lot of unsubstantiated disagreement :D

I did *not* refered to OGC at any point in the above discussion, but to 
R-trees in general.

an R-tree is a data structure that indexes, similarly to a b-tree, a set 
of matrioshka-like boxes in a way that allows balancing the amount of 
data to sift through in a single box. so, for instance, you'd have a box 
for the center of london, and another one for, say the entirety of 
cornwall that would store the same amount of data.
the boxes work so that you can drill down the index while comparing the 
bounding box of the request. at some point, you'll find that the request 
is perfectly enclosed withing a particular box (or a set of various 
boxes) and the request will then go though the set of points inside that 
particular box and find the appropriate ones.

R-trees are not specific to OGC in any way. they are just a particular 
type of index that allow fast search of items within n-D data.
they are for instance used to search through OLAP results in financial 
applications (in which case they are nR-trees, as they can have n 
dimensions), they can also be used to search in a calendar like 
application (so you have days on one dimension, and hours on the other, 
and perhaps separate calendars on a third dimension)

the box limit requirement is a function of the time it takes for
a) the database to return the resultset
b) the time it takes for the web server to format that resultset to xml
c) some value dependant on whatever scripting language used

thing is, we only have (a + b + c) and we are trying to enact 
conjectures on why the sum is so large, without looking at each 
component independantly.

the database server probably has a way to explain (and probably time 
too) the request made to gather the data for a particular area.

so... we need numbers for a and b (and then we can derive c)
if there are timing functions in the language used, then it should be a 
simple matter of doing

time1= current_time()
request_data
time2= current_time()
format_data
time3= current_time()

and then output (time2 - time1) and (time3 - time2)

do the above for multiple data sets (for instance, a large part of a 
city such as london, some countryside area,...)

show us some hard numbers, and we may understand the problems better.