[OSM-dev] Binary OSM db (was: Re: on-disk indexing of geodata)

Sat Oct 18 23:41:28 BST 2008

On Fri, Oct 17, 2008 at 10:46:39PM +0200, Freek wrote:

>> Node/way/relation internal <-> external id:
>>      - int -> ext O(1)
>>      - ext -> int O(log n)
> So what kind of indexes do you have, hash tables and balanced binary 
> search trees (what about external memory)?
Much simpler, as more complex index structures also use more memory 
(thus risking disk access penalty). Sorted lists, scanned with binary 
search (=> O(log n)).

> Anyway, when do you use such an index? I guess one would want to know 
> the OSM id of objects returned by certain queries, but not 
> specifically a translation from internal to external id (though of 
> course you probably get that for free if you have an index on internal 
> id).
Usually the ID mappings are used on in- and output, respectively. The 
program core will work purely with internal IDs.

> Just a bit of a technical thing (not very important for the design at 
> this stage), but with "Peano" I suppose you mean the order that looks 
> like a lot of nested Z's? Historically this is not a curve Peano came 
> up with (though many people call it /a/ or even /the/ Peano curve).
Good to know. I used the nomenclature from the already-mentioned paper 
[1] by Faloutsos and Roseman.

> Actually, he proposed a different one that does not have jumps and 
> hence has better locality properties. A disadvantage of that one may 
> be that an implementation would be slightly more complex, but such a 
> discussion can better be deferred.
Using a different space-filling curve is definitely something on the 
unwritten part of the TODO list. I mainly used Peano because it was easy 
to implement. It's also quite fast, but only the bulk importer would 
have to invoke the function often, ordinary programs shouldn't be 
affected significantly.

>> Node/way/relation types (using tag -> id mapping with wildcard 
>> support
>> during import):
>>      - id -> types ~O(1)
>>      - type -> ids ~O(log n)
> Type? Can I, for example, retrieve all post boxes using this index?
Yes, you can. If you add an appropriate entry to the rules file prior to 
running the importer, that is. Excerpt from my current rules file:

     {'tags': {'amenity': 'post_box'},             'maxScaleDisplay':   
10000, 'maxScaleName':    5000, 'icon': 'public/post_box'},

This file is preprocessed with a Python script, producing a file that's 
easy to parse in C++ (but much less readable).
It's also used by my moving map viewer (with basic navigation support), 
that's why there are rendering styles in it, too.

>> Node/way/relation "interesting" bitmap (name/ref/is_in set and/or 
>> known
>> type):
>>      - id -> interesting O(1)
> Important design decision, but more for the data format itself: does 
> it store all nodes/ways/relations, also the "uninteresting" ones?
Yes, it does - otherwise this bitmap wouldn't be necessary.

>> Way MBR / RTree:
>>      - id -> MBR O(1)
>>      - MBR -> ids ~O(log n)
> This is an index for retrieving all objects within a specified query 
> rectangle, right? (So basically what we've been talking about ;-)
Exactly. :)

> Will the data format have all these kinds of information (tags, 
> members, etc.) in different places (on disk), or will they all be in 
> the same place?
Everything is in a different file. That can reduce performance (due to 
increased number of seeks), but enables different applications (using 
different subsets of the data) to (efficiently) share the same database.
This also accounts for using different subsets of the data in different 
stages of the program: ID on in/output, node location etc. in "core" 
operations.

>> Node/way/relation name/ref/is_in:
>>      - text files sorted by id (intended for namefinder - could be
>> changed into "proper" database entries similar to the way tags are
>> handled)
> I think name finding is a topic on its own.
At least partially, yes.

> Also, a good look at the current online OSM name finder and Gosmore 
> may be a good idea.
Already had a good look on the current namefinder - I'm quite impressed 
by its performance, given that it's based on MySQL+PHP.
Didn't know that Gosmore has a geolocation feature, though.

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9043

CU Sascha

-- 
http://sascha.silbe.org/
http://www.infra-silbe.de/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: Digital signature
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20081019/7e780689/attachment.pgp>