[OSM-dev] Binary OSM db (was: Re: on-disk indexing of geodata)

Sun Oct 19 12:42:39 BST 2008

On Sunday 19 October 2008, Sascha Silbe wrote:
> On Fri, Oct 17, 2008 at 10:46:39PM +0200, Freek wrote:
> >> Node/way/relation internal <-> external id:
> >>      - int -> ext O(1)
> >>      - ext -> int O(log n)
> >
> > So what kind of indexes do you have, hash tables and balanced binary
> > search trees (what about external memory)?
>
> Much simpler, as more complex index structures also use more memory
> (thus risking disk access penalty). Sorted lists, scanned with binary
> search (=> O(log n)).

Ok, but what about the O(1)?

Regarding index structures, I think you can reduce disk access penalty by 
using slightly more complex index structures (in this case B-trees for 
example). The O(log_2 n) is the number of disk operations you have to 
perform. Using a B-tree you can get this down to O(log_B n), where B is the 
number of elements per node. If you use your "implicit pointer" idea, you 
don't even have the extra storage penalty :-)

> > Anyway, when do you use such an index? I guess one would want to know
> > the OSM id of objects returned by certain queries, but not
> > specifically a translation from internal to external id (though of
> > course you probably get that for free if you have an index on internal
> > id).
>
> Usually the ID mappings are used on in- and output, respectively. The
> program core will work purely with internal IDs.

So you transparently rewrite internal IDs to external ones and vice versa on 
in- and output?
What is the advantage of having separate internal IDs? (I mean, there's 
probably a good reason, I just don't know...)

> > [...] Can I, for example, retrieve all post boxes using this index?
>
> Yes, you can. If you add an appropriate entry to the rules file prior to
> running the importer, that is. Excerpt from my current rules file:
>
>      {'tags': {'amenity': 'post_box'},             'maxScaleDisplay':
> 10000, 'maxScaleName':    5000, 'icon': 'public/post_box'},
>
> This file is preprocessed with a Python script, producing a file that's
> easy to parse in C++ (but much less readable).
> It's also used by my moving map viewer (with basic navigation support),
> that's why there are rendering styles in it, too.

Ah, nice. By the way, is all this still in (early) development, or is there 
already a description/website/demo online?

> >> Node/way/relation "interesting" bitmap (name/ref/is_in set and/or
> >> known
> >> type):
> >>      - id -> interesting O(1)
> >
> > Important design decision, but more for the data format itself: does
> > it store all nodes/ways/relations, also the "uninteresting" ones?
>
> Yes, it does - otherwise this bitmap wouldn't be necessary.

Ok. Do you plan to make your current format into a kind of standard for 
binary, indexed OSM data or do you want to go with, say, Marcus' plans (for 
which I don't clearly see if he wants it to be lossy or lossless)?

> Already had a good look on the current namefinder - I'm quite impressed
> by its performance, given that it's based on MySQL+PHP.
> Didn't know that Gosmore has a geolocation feature, though.

I remembered that it did. The feature list on the wiki says:
"Incremental search of all tags. Results are ordered from nearest to 
farthest."

-- 
Freek