[OSM-dev] Binary OSM db (was: Re: on-disk indexing of geodata)

Fri Oct 17 21:46:39 BST 2008

On Friday 17 October 2008, Sascha Silbe wrote:
> On Fri, Oct 17, 2008 at 07:47:14PM +0200, Freek wrote:
> > Looks like we have enough ideas for adding indexes and such to an OSM
> > binary format, so if the rest of the structure is fixed I think we can
> > get it off the ground (I wouldn't mind at least helping out here and
> > there).
>
> OK, so what do "we" want in a binary DB? Currently there is (in my
> implementation):

"we" referring to anyone wanting to get involved :-)

> Node/way/relation internal <-> external id:
>     - int -> ext O(1)
>     - ext -> int O(log n)

So what kind of indexes do you have, hash tables and balanced binary search 
trees (what about external memory)? Anyway, when do you use such an index? I 
guess one would want to know the OSM id of objects returned by certain 
queries, but not specifically a translation from internal to external id 
(though of course you probably get that for free if you have an index on 
internal id).
Second case: when would you want to know the internal id of a given OSM 
object? Probably we need this if we want to implement updates...

> Node location (2x 32bit int, 64bit Peano):
>     - id -> location O(1)
>     - location -> int O(log n)

Just a bit of a technical thing (not very important for the design at this 
stage), but with "Peano" I suppose you mean the order that looks like a lot 
of nested Z's? Historically this is not a curve Peano came up with (though 
many people call it /a/ or even /the/ Peano curve). Actually, he proposed a 
different one that does not have jumps and hence has better locality 
properties. A disadvantage of that one may be that an implementation would be 
slightly more complex, but such a discussion can better be deferred.

> Node/way/relation types (using tag -> id mapping with wildcard support
> during import):
>     - id -> types ~O(1)
>     - type -> ids ~O(log n)

Type? Can I, for example, retrieve all post boxes using this index?

> Node/way/relation "interesting" bitmap (name/ref/is_in set and/or known
> type):
>     - id -> interesting O(1)

Important design decision, but more for the data format itself: does it store 
all nodes/ways/relations, also the "uninteresting" ones?

> Node/way/relation tags (k/v-pairs are \n-separated and properly
> escaped):
>     - id -> all tags O(1)

Guess we would want something like that, yes.

> Way MBR / RTree:
>     - id -> MBR O(1)
>     - MBR -> ids ~O(log n)

This is an index for retrieving all objects within a specified query 
rectangle, right? (So basically what we've been talking about ;-)

> Way members (ordered sequence of node ids):
>     - way id -> member ids ~O(1)
>     - member id -> way ids ~O(1)
>
> Relation members (ordered sequence of id+type+role):
>     - rel id -> member ids ~O(1)
>     - member id -> rel ids ~O(log n)

Will the data format have all these kinds of information (tags, members, etc.) 
in different places (on disk), or will they all be in the same place? In the 
latter case it may be logical to retrieve all this information with one 
query.

> Node/way/relation name/ref/is_in:
>     - text files sorted by id (intended for namefinder - could be
> changed into "proper" database entries similar to the way tags are
> handled)

I think name finding is a topic on its own. Also, a good look at the current 
online OSM name finder and Gosmore may be a good idea.

> Actually, there probably isn't a one-fits-all solution for the indices,
> it depends too much on the application and its environment. So each
> index should be optional for the high-level API (makes it more complex,
> though).

Indeed, a number of separate indexes for different purposes seems natural. An 
index for efficiently retrieving all data elements in a certain region may be 
a good place to start.

-- 
Freek