[osmosis-dev] Reading OSM History dumps

Mon Aug 23 16:09:37 BST 2010

Am 23.08.2010 13:35, schrieb Brett Henderson:
> If you can keep your working set entirely in memory then it's a much
> simpler problem and you can avoid stores altogether. Obviously to
> process planet sized data that isn't workable.  So moving on ...
I wrote a sample store that uses nested HashMaps as store to get the 
real history logic done. I hope to get to the store problem later again.

<http://svn.toolserver.org/svnroot/mazder/osmhist/osmosis-plugin/history/src/org/openstreetmap/osmosis/history/common/InMemoryHistoryNodeStore.java>

> To create your own store implementation you can build on the Osmosis
> persistence support.  All classes that are persistable implement the
> Storeable interface and have a constructor with "StoreReader sr,
> StoreClassRegister scr" arguments.
>
> The existing IndexedObjectStore assumes that the key is a long but
> provides a good example to start from.  The underlying IndexStore it
> uses can support any type of key as long as it has a fixed width (ie.
> always persists to the same number of bytes).
It would need a key of 96 bit (id long + version int). I was not aware 
of any type >64bit in java so I'm not sure how I could build a store 
with a 96bit index, but I think I have to take a deeper look into the 
IndexStore & company.

The timestamp is just a 64bit long value, so the only problem is here to 
do the comparison but this is the easy past, i think.

> It may
> be possible to make the existing IndexedObjectStore more generic but I'd
> need to experiment with it.
I'll try to keep the whole changes local to my project. Once its 
finished you can take classes over to core as they're needed.

> Hmm, but thinking more about your problem it may make more sense to
> stick with the IndexedObjectStore and store a list of Nodes as each
> element instead of single Nodes.  I suspect in most cases you won't know
> the exact version you're looking for when you're loading a Node
In the first phase when selecting the versions of the nodes used to 
create a version of a way I'll have a lot of timestamp searches (find 
the oldest node that is younger then the timestamp of the way) that need 
the timestamp index.

later on, when the intermediate versions are calculated, i'll need a 
lookup for all versions of an id.

a direct request for a known id/version will, as far as I see in this 
early stage, not be used too often (maybe during linestring building)

 > (you'll
> only know node ids when looking at a way after all), and will only know
> a timestamp range.  When looking up a specific node/version/timestamp
> combination you would have to load all versions of a node from the
> IndexedObjectStore then linearly search for a match in the (usually
> fairly limited) list of objects.  You will possibly need to create you
> own Storeable list type to hold all versions of a particular Node
> because I don't think one exists.
The main problem I see is, that such a list won't be of fixed size. When 
I write it to the store and later on add another version, it will grow 
bigger and have to be re-allocated in the store file, freing up space at 
the beginning. Basically a malloc/realloc/free in files.

> Just keep in mind that Osmosis stores aren't particularly fast to query
> because they're based on very simple data structures.  They tend to
> result in huge amounts of disk seeks when processing, so there may be
> libraries out there that perform better.  The main reason they were
> originally developed was to minimise external library dependencies and I
> haven't revisited that decision since Osmosis put on weight (ie. it now
> relies on many third-party jars).
Thinking about all this I find that we're re-inventing the wheel. I'll 
try to use a JavaDB as the backend store. It is entirely written in Java 
ant thus cross platform compatible, supports btree indexes on multiple 
fields an can reside both, in-memory and on-disk. If it shows that it's 
fast enough, it may be a good alternative to a custom binary file/memory 
store.

> A slightly rambling email, but hopefully some of it is useful :-)
Of course it is. Coming from the PHP Land I'm getting to love the sane 
Eclipse environment and scared when looking at the class Infrastructure 
used to store the data. So your mail helped finding my way through the 
jungle of classes ^^

Peter