[OSM-dev] Binary OSM db (was: Re: on-disk indexing of geodata)

Sun Oct 19 13:43:12 BST 2008

On Sun, Oct 19, 2008 at 12:42 PM, Freek <freek_osm at vanwal.nl> wrote:
> On Sunday 19 October 2008, Sascha Silbe wrote:
>> On Fri, Oct 17, 2008 at 10:46:39PM +0200, Freek wrote:
>> >> Node/way/relation internal <-> external id:
>> >>      - int -> ext O(1)
>> >>      - ext -> int O(log n)
>> >
>> > So what kind of indexes do you have, hash tables and balanced binary
>> > search trees (what about external memory)?
>>
>> Much simpler, as more complex index structures also use more memory
>> (thus risking disk access penalty). Sorted lists, scanned with binary
>> search (=> O(log n)).
>
> Ok, but what about the O(1)?

as i understand it (having lurked on this discussion) the internal to
external mapping is done by mmap()ed array lookup.

> Regarding index structures, I think you can reduce disk access penalty by
> using slightly more complex index structures (in this case B-trees for
> example). The O(log_2 n) is the number of disk operations you have to
> perform. Using a B-tree you can get this down to O(log_B n), where B is the
> number of elements per node. If you use your "implicit pointer" idea, you
> don't even have the extra storage penalty :-)

in this case i think a trie (or a van emde boas tree) would be the
most efficient, but at the cost of implementation complexity. if this
is to become a common format, its definitely worth keeping it simple
:-)

>> > Anyway, when do you use such an index? I guess one would want to know
>> > the OSM id of objects returned by certain queries, but not
>> > specifically a translation from internal to external id (though of
>> > course you probably get that for free if you have an index on internal
>> > id).
>>
>> Usually the ID mappings are used on in- and output, respectively. The
>> program core will work purely with internal IDs.
>
> So you transparently rewrite internal IDs to external ones and vice versa on
> in- and output?
> What is the advantage of having separate internal IDs? (I mean, there's
> probably a good reason, I just don't know...)
>
>> > [...] Can I, for example, retrieve all post boxes using this index?
>>
>> Yes, you can. If you add an appropriate entry to the rules file prior to
>> running the importer, that is. Excerpt from my current rules file:
>>
>>      {'tags': {'amenity': 'post_box'},             'maxScaleDisplay':
>> 10000, 'maxScaleName':    5000, 'icon': 'public/post_box'},
>>
>> This file is preprocessed with a Python script, producing a file that's
>> easy to parse in C++ (but much less readable).
>> It's also used by my moving map viewer (with basic navigation support),
>> that's why there are rendering styles in it, too.
>
> Ah, nice. By the way, is all this still in (early) development, or is there
> already a description/website/demo online?
>
>> >> Node/way/relation "interesting" bitmap (name/ref/is_in set and/or
>> >> known
>> >> type):
>> >>      - id -> interesting O(1)
>> >
>> > Important design decision, but more for the data format itself: does
>> > it store all nodes/ways/relations, also the "uninteresting" ones?
>>
>> Yes, it does - otherwise this bitmap wouldn't be necessary.
>
> Ok. Do you plan to make your current format into a kind of standard for
> binary, indexed OSM data or do you want to go with, say, Marcus' plans (for
> which I don't clearly see if he wants it to be lossy or lossless)?
>
>> Already had a good look on the current namefinder - I'm quite impressed
>> by its performance, given that it's based on MySQL+PHP.
>> Didn't know that Gosmore has a geolocation feature, though.
>
> I remembered that it did. The feature list on the wiki says:
> "Incremental search of all tags. Results are ordered from nearest to
> farthest."
>
> --
> Freek
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/dev
>