[OSM-dev] on-disk indexing of geodata

Fri Oct 17 16:20:52 BST 2008

On Friday 17 October 2008, Sascha Silbe wrote:
> On Fri, Oct 17, 2008 at 03:25:59PM +0200, Freek wrote:
> > Generally you keep a bit of free space in all nodes to accommodate a
> > number of insertions and deletions, and only split or merge nodes when
> > they overflow or underflow so that you don't need to move around all
> > kinds of elements all the time.
>
> But how do you do those splits, i.e. node inserts? Just moving
> everything behind the insertion point by one block would be way too
> expensive for large trees, so there's probably a better method.

Just pick a new block somewhere else where you have free space (for example at 
the end of your mmap()'ed file), then split the pointers to the child nodes 
evenly over the two blocks (old and new one) so they are about half-full, and 
finally add a pointer to the new block in the parent node (again splitting if 
necessary, etc.).

> [...] I mmap() the whole database. For europe it's less than twice the
> size of my physical memory (5.3GB vs. 3.5GB), so most of it fits into
> memory. For whole planet imports it gets much slower (5.6h instead of
> 2.5h if it were scaling linearly).

Perhaps that is because you can still do most of the calculations in memory 
for the smaller data sets so you are hit harder, relatively, by I/O 
performance for the larger data sets.

> R-Tree generation scales fine (still fits into memory),
> 1D index sorting is O(2n) 

O(2n)? I guess you are using big O's for a different purpose then I do 
normally ;-)

> and way processing 
> (including MBR calculation) is O(5n). That's probably due to the nodes
> being sorted by id in the database so spatial locality is reduced.
> Quicksort (used for sorting 1D indices) also doesn't use locality for
> the outer iterations, but isn't a bottleneck yet.

Ok, otherwise you may want to look at mergesort or an external-memory variant 
thereof.

-- 
Freek