[osmosis-dev] Reading OSM History dumps
brett at bretth.com
Mon Aug 23 12:35:14 BST 2010
On Mon, Aug 23, 2010 at 3:44 AM, Peter Körner <osm-lists at mazdermind.de>wrote:
> Am 22.08.2010 08:26, schrieb Brett Henderson:
> You seem to have thought about most of the complexities of the problem
>> already so you know what you're dealing with.
> I think that all is solvable using just enough logic :) I did the demo
> implementation in PHP to see if this is possible and I think I know the OSM
> data structure enough to know what it means.
> But I don't know Osmosis and Java enough to know how tow to implement the
> simple multi-level arrays from PHP in a way that will work with those really
> big files.
> What I need is a store that can
> - store all versions of a Node*
> - access a specific version of a node
> - access all versions of a node
> - the oldest version of a node that has been created before Date X
> *not only the Node's location but also the Meta-Info (Timestamp, User,
> UserID) because you would want to have this as the Meta-Info on the
> generated intermediate Way-Versions.
> I looked into the three implementations of NodeLocationStore (especially
> the InMemoryNodeLocationStore) and I was thinking how I could extend the
> really simple fixed-size memory store to be able to store a complete Node
> and index by Id and Version at the same time.
> Because there is no fixed number of versions per Node I can't go with a
> simple offset=NodeID*NodeSize calculation but I have to write the nodes one
> after another just as they come in and save the Offsets in a List, but I'm
> not sure how to build a List that allows Random Access to the offset to all
> versions of a node as well as to a specific version in Java.
> I also found the IndexedObjectStore class in
> org.openstreetmap.osmosis.core.store and I thought about extending it to
> track three Indexes (NodeID, Version and Timestamp). Do you know if this
> would be workable?
Most of Osmosis is written to handle arbitrary sized data sets so it avoids
holding data in memory and persists temporary data to disk. If you can keep
your working set entirely in memory then it's a much simpler problem and you
can avoid stores altogether. Obviously to process planet sized data that
isn't workable. So moving on ...
All the store implementations have been created to solve specific problems
at some point in time so don't assume there is any grand architecture or
intelligent thought process behind it all :-) In your case you might have
to create a more generic store that has a multi-part key.
To create your own store implementation you can build on the Osmosis
persistence support. All classes that are persistable implement the
Storeable interface and have a constructor with "StoreReader sr,
StoreClassRegister scr" arguments.
The existing IndexedObjectStore assumes that the key is a long but provides
a good example to start from. The underlying IndexStore it uses can support
any type of key as long as it has a fixed width (ie. always persists to the
same number of bytes). So you could create a new multi-part key object that
implements the IndexElement interface and has id/version/timestamp
components. You could then build a new IndexedObjectStore implementation
that utilises your key type. It may be possible to make the existing
IndexedObjectStore more generic but I'd need to experiment with it.
As with most Osmosis stores, the IndexStore cannot be read until it is fully
written so you'll have to take that into account. It does allow unsorted
data to be added which may be helpful although sorting data on input is
usually fairly simple through the use of the FileBasedSort class.
Hmm, but thinking more about your problem it may make more sense to stick
with the IndexedObjectStore and store a list of Nodes as each element
instead of single Nodes. I suspect in most cases you won't know the exact
version you're looking for when you're loading a Node (you'll only know node
ids when looking at a way after all), and will only know a timestamp range.
When looking up a specific node/version/timestamp combination you would have
to load all versions of a node from the IndexedObjectStore then linearly
search for a match in the (usually fairly limited) list of objects. You
will possibly need to create you own Storeable list type to hold all
versions of a particular Node because I don't think one exists.
Just keep in mind that Osmosis stores aren't particularly fast to query
because they're based on very simple data structures. They tend to result
in huge amounts of disk seeks when processing, so there may be libraries out
there that perform better. The main reason they were originally developed
was to minimise external library dependencies and I haven't revisited that
decision since Osmosis put on weight (ie. it now relies on many third-party
> You mentioned the problem of obtaining test data. I'd suggest using:
> They are in .osc format but I need a task to convert from .osc to
> history-.osm and back, too.
> That is a full history from day one of the project up until now. It is
>> already in the OSM change format that Osmosis understands. Cutting
>> bounding boxes out of full history data is a difficult (but not
> In regard to the Node-Moded-In/-Out problem, yes. At the moment I'm working
> with self-including history files, that contain all referenced items from
> version 1 on. When I start to convert .osc files into history-.osm files I
> will have to deal with objects with incomplete histories (when a node has
> been moved I only know its new position). There is a need to feed in a
> second data-source like an already existing database.
Osmosis does have a stream type called "dataset" which complements the
"entity" and "change" streams. If you have a database allowing random
access to a full dataset (like the "simple" schema), you can expose a
dataset reader to downstream tasks allowing them to query for specific
entities as required. The --read-pgsql task does exactly that, it doesn't
read the database, instead it provides an interface for downstream tasks to
query the database.
> > problem that you may have to solve in order to move
>> forward. In order to build way linestrings for all way versions and for
>> all node versions impacting the way you will have to solve a similar
>> problem to understanding how to cut bbox data so you may be able to kill
>> a couple of birds with one stone.
> I'm not really sure if this will work as all I'm focusing on now is to get
> a complete dump analyzed, but we may get closer to this goal.
A slightly rambling email, but hopefully some of it is useful :-)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the osmosis-dev