[OSM-dev] Update on osmosis dataset support
brett at bretth.com
Sat Jan 5 23:49:23 GMT 2008
This is just an FYI for those who are interested.
The osmosis pipeline code in svn now supports the concept of a dataset
which allows downstream tasks to read access osm data randomly.
Currently you can access individual osm nodes/ways/relations by their
identifier, or read entire sections of the planet by bounding box. In
effect, it provides a way for tasks to directly access a database
without forcing them to access data in a streamy fashion.
*But* the missing part is a working dataset database implementation
providing an on-disk planet representation allowing these queries to be
I was planning to create a simple read-only customdb implementation that
builds an on-disk database using simple data files and indexes. I've
almost completed this but I don't think it will be very practical. It's
horrendously slow (at least 12 hours for an import) and it breaks during
an internal sort due to too many file handles being open. The file
handle problem should be easily fixable (either a file handle leak or a
poorly tuned file-based merge sort implementation) but not so easy to
debug due to the size of the dataset. After some benchmarking, I think
I can speed up the import but the problem will only get worse as the
planet grows. I'm now investigating an alternative approach.
One way to solve the import speed problems is to allow changes to be
applied to the dataset database. This allows you to do a single import
and keep it up to date with daily (or hourly) diffs. One way to
implement this is to use the existing mysql schema but I don't want to
because it puts osmosis out of reach of the average user who just wants
to manipulate planet files without setting up a bunch of additional
I've started playing with the Berkeley DB java edition. It appears to
be everything I was looking for, it's very fast, it allows updates, it's
transactional (very useful when applying diffs), and is very easy to
program against allowing me to dump objects straight in using my
existing custom serialisation mechanism. I'll try to get something
basic working over the next few days and see how it compares.
An alternative which has already been suggested is SQLite. However it
appears that the main implementation is C-based which presumably relies
on native libraries which are a real pain to deploy for java apps. The
alternative is their pure java implementation which from a quick google
appears to be slower. The main thing turning me off SQLite for now
though is that it is SQL based. It will take far more development
effort to support a SQL database than a simple byte array store like BDB.
If I get a basic dataset implementation working properly, I'll look into
extending it to support the storage of bounding boxes. This will allow
you to only maintain a dataset for a subset of the planet (eg. a box
around the UK). This should minimise the amount of disk required to
maintain a local dataset of the area you're interested in.
In summary I'm getting there and think this should be useful but it's
taking a few iterations to make something practical due to the sheer
size of OSM data these days. I'm trying to keep things as simple as
possible for the end user, but it is likely to be more difficult than
the existing bounding box task extracting a bounding box directly from a
More information about the dev