[OSM-dev] Update on osmosis dataset support

Sat Jan 5 23:49:23 GMT 2008

This is just an FYI for those who are interested.

The osmosis pipeline code in svn now supports the concept of a dataset 
which allows downstream tasks to read access osm data randomly.  
Currently you can access individual osm nodes/ways/relations by their 
identifier, or read entire sections of the planet by bounding box.  In 
effect, it provides a way for tasks to directly access a database 
without forcing them to access data in a streamy fashion.

*But* the missing part is a working dataset database implementation 
providing an on-disk planet representation allowing these queries to be 
performed.

I was planning to create a simple read-only customdb implementation that 
builds an on-disk database using simple data files and indexes.  I've 
almost completed this but I don't think it will be very practical.  It's 
horrendously slow (at least 12 hours for an import) and it breaks during 
an internal sort due to too many file handles being open.  The file 
handle problem should be easily fixable (either a file handle leak or a 
poorly tuned file-based merge sort implementation) but not so easy to 
debug due to the size of the dataset.  After some benchmarking, I think 
I can speed up the import but the problem will only get worse as the 
planet grows.  I'm now investigating an alternative approach.

One way to solve the import speed problems is to allow changes to be 
applied to the dataset database.  This allows you to do a single import 
and keep it up to date with daily (or hourly) diffs.  One way to 
implement this is to use the existing mysql schema but I don't want to 
because it puts osmosis out of reach of the average user who just wants 
to manipulate planet files without setting up a bunch of additional 
infrastructure.

I've started playing with the Berkeley DB java edition.  It appears to 
be everything I was looking for, it's very fast, it allows updates, it's 
transactional (very useful when applying diffs), and is very easy to 
program against allowing me to dump objects straight in using my 
existing custom serialisation mechanism.  I'll try to get something 
basic working over the next few days and see how it compares.

An alternative which has already been suggested is SQLite.  However it 
appears that the main implementation is C-based which presumably relies 
on native libraries which are a real pain to deploy for java apps.  The 
alternative is their pure java implementation which from a quick google 
appears to be slower.  The main thing turning me off SQLite for now 
though is that it is SQL based.  It will take far more development 
effort to support a SQL database than a simple byte array store like BDB.

If I get a basic dataset implementation working properly, I'll look into 
extending it to support the storage of bounding boxes.  This will allow 
you to only maintain a dataset for a subset of the planet (eg. a box 
around the UK).  This should minimise the amount of disk required to 
maintain a local dataset of the area you're interested in.

In summary I'm getting there and think this should be useful but it's 
taking a few iterations to make something practical due to the sheer 
size of OSM data these days.  I'm trying to keep things as simple as 
possible for the end user, but it is likely to be more difficult than 
the existing bounding box task extracting a bounding box directly from a 
planet file.

Cheers,
Brett