[OSM-talk] Bounding box

Nick Hill nick at nickhill.co.uk
Tue Feb 6 00:01:06 GMT 2007



Raphaël Jacquot wrote:
> Nick Hill wrote:
>>> there is no need to use partitionning or other 'hackish' schemes if 
>>> the indexing engine features the proper index generation algorithms 
>>> (such as R-tree based indexes)
>>
>> I don't consider partitioning hackish.
> 
> partitionning the *tables* is hackish. that is, you modify the 
> fundamental structure of the table to allow extracting data on 
> *portions* of the data set, based on a set of rules that are arbitrarily 
> defined (such as cutting the tables in a grid of sorts).

Mysql implements partitioning which is transparent to the queries or API.

> 
> said partitionning will lead to, in average, require checking one, two, 
> or four datasets, depending on how the request's bounding box is designed.

Unless the partitioning is extremely fine-grained, few bounding boxes will cross 
the border between partitions.

> 
>>
>>> Of course, you'll then tell me that the index takes a lot of space or 
>>> something. well, that's what computing is all about, compromises.
>>> that is, either your data is very compressed and needs a lot of 
>>> computing power to sift through, or, you sacrifice storage space to 
>>> have a fast to go through data structure, and your data is fast to 
>>> access. you can't have both.
>>
>> I agree to a small part, but mostly disagree. I agree that our 
>> indexing scheme is sub-optimal. Your assumption that the bbox and node 
>> limit requirement is substantially as a result of the indexing scheme 
>> is incorrect. I also strongly disagree with your hidden assumption 
>> that an OGC R-tree index as commonly implemented is anywhere near 
>> optimal. I also disagree with the implication that the space 
>> inefficiency of the OGC indexing scheme somehow makes the index 
>> perform better than any other theoretical model or that a different 
>> scheme is somehow compressed.
>>
> 
> now, that's a lot of unsubstantiated disagreement :D
> 
> I did *not* refered to OGC at any point in the above discussion, but to 
> R-trees in general.

I agree you didn't mention OGC, but given you didn't specifically tell me what 
you *were* referring to, and given you alluded to the nature of an inflated size 
  index, I think most people would forgive me for concluding that is what is in 
your mind.

> 
> an R-tree is a data structure that indexes, similarly to a b-tree, a set 
> of matrioshka-like boxes in a way that allows balancing the amount of 
> data to sift through in a single box. so, for instance, you'd have a box 
> for the center of london, and another one for, say the entirety of 
> cornwall that would store the same amount of data.
> the boxes work so that you can drill down the index while comparing the 
> bounding box of the request. at some point, you'll find that the request 
> is perfectly enclosed withing a particular box (or a set of various 
> boxes) and the request will then go though the set of points inside that 
> particular box and find the appropriate ones.

Yes

> 
> R-trees are not specific to OGC in any way. they are just a particular 
> type of index that allow fast search of items within n-D data.
> they are for instance used to search through OLAP results in financial 
> applications (in which case they are nR-trees, as they can have n 
> dimensions), they can also be used to search in a calendar like 
> application (so you have days on one dimension, and hours on the other, 
> and perhaps separate calendars on a third dimension)
> 
> the box limit requirement is a function of the time it takes for
> a) the database to return the resultset
> b) the time it takes for the web server to format that resultset to xml
> c) some value dependant on whatever scripting language used

z)You can also add to that things like: Tendency for a query to flush other used 
data from a cache

y) I/O taken to retrieve perhaps hundreds of millions of data points, sort them, 
find duplicates, return whatever page

If we used a more efficient indexing scheme, the database may perhaps more 
easily determine the implication of the request from bounding box information.

To perform tests, get the mysql schema from SVN. Load it into mysql. Get the 
planet dump then use planet2mysql to generate the OSM dataset in mysql.

Download the API from SVN. Point the API at the database. Run queries. Make 
changes to the database. Find some fabulous new way of increasing the speed with 
unlimited bounding boxes and no server slowdown. give us all the figures. 
Convince everyone. I'll buy you a pint.






More information about the talk mailing list