[OSM-talk] (no subject)

Wed Apr 26 17:03:02 BST 2006

Hi,

I wrote three R-Tree indexes for PostGIS.  All of them were based on the
examples that Oleg and Teodor (www.sai.msu.su/~megera/postgres/gist)
did for the postgresql internal types.  Oleg and Teodor wrote/maintain
the GiST index for Postgresql - and Rtree index can be implemented with
a GiST index.

The original (postgis about 0.2) used the Quadratic-Time node-splitting
algorithm.  The actual 'datum' stored in the index was the SRID
(int32), and 4 double-precision numbers representing the bounding box.

PostGIS 0.3-0.9 used the same datum, but used the Linear-Time
node-splitting algorithm.

PostGIS "light weight geometry" (1.0+) uses the Linear-Time
node-splitting algorithm, but instead of using doubles to represent the
bounding box, single-precision numbers are used.  The float32 bbox is
expanded so its the smallest bbox that contains the double-precision
bounding box.  SRID is no longer stored in the index.

Using float32s reduced the actual data in the index by 50%, but only
actually reduced the size of the index by about 30% because there's
overhead associated with each entry.  But, it has two advantages (a)
the entire index is smaller, so its more likely to be in the disk cache
and (b) each node contains more individual 'datum' values, meaning that
the height of the tree is reduced - any particular search is likely to
require less actual disk reads.  The height of the tree is proportional
to log-to-base-N where N=number of bboxes that fit in a disk page.

R-trees are similar to "normal" B-trees, but you search multiple paths
in the tree simultaneously.  At the bottom of the tree (leaf nodes),
each of the bboxes will "point to" a row in the main table.  All the
bboxes in node are unioned together and that is stored in the node
above.

  A
 / \
B   C

In this simple example, there are 3 nodes in the tree A,B,C.  B and C
are leaf nodes and will contain a bunch of boxes (say about 300), each
linking to a single row in the main table.  A is an internal node and,
in this example, only contains two bboxes, one being the union of all
the bboxes in B and the other being the union of all the bboxes in C.

So, a bbox index search is very simple:
search through all the bboxes in A (an internal node), if a bbox
intersects the search bbox then we'll descend into that sub-tree.  If
the node is a leaf, then an overlapping bbox means we can add the row
to the result set.

The main difference between a Btree and an Rtree is that you traverse
multiple paths at the same time.  If your search bbox is large, you may
have to search both the B sub-tree AND the C sub-tree.

Each node is a disk page in size (usually 8k).

Hopefully that explains how a Rtree is organized.

>1) Do geometric R-tree type indexes need 2, 4 or 8 indexed values per
point?

I'm not sure what this means.

>2) Do geometric R-tree indexes grow in a non-linear fashion with number
of indexed rows?

Size of the index (in disk-pages) is proportional to N*Log-base-K(N)
where N=number of rows and K=number of bboxes that fit in a node (disk
page).

>3) Does the R-tree indexing time for new points necessarily grow in a
non-linear fashion with number of indexed rows?

Experimentally, doubling the size of a postgis table means that it takes
twice as long to index.  Theoretically it should be much worse, but
because of the disk cache, its much better than expected.
Insertion will typically not affect more than Log(N) pages, but its
possibly worse.  Most of the time it will fit in a leaf node and only
require one or two pages to be written.  Worst case you can cause a
major re-shuffle of the index.

>4) If the R-tree can't fit in memory, is there an R-tree implementation
which can substantially cut down the number of disc seeks?

Most DB's will not load very much of the index into memory - thats the
job of the disk cache.  For highly auto-correlated queries for a small
area (ie. panning around on a map) the index will be almost certainly
in memory.  If you are doing random queries (ie. 100s of people asking
for different locations), then its likely that the top few layers of
the index will be in memory, but each query may generate page faults
and require loading in pages from the "bottom" of the index.

Typically, if your geometries are "large" (ie. lots of points), then
your index will be highly efficient. If your geometries are
single-points (and no other columns in the table), then your index size
will be about the same size as your table size.

dave

----------------------------------------------------------
This mail sent through IMP: https://webmail.limegroup.com/