[OSM-dev] Complete history of OSM data - questions and discussion

Wed Nov 11 06:41:47 GMT 2009

Hi!

I and many (okay at least a few) others have shown interest in the
complete history data of OSM. I understand that a lot of this data is
available throughout the web using old snapshots and diffs but this
comes in outdated formats and is by no way complete or easy to use. I
also had a look at the System Admin page on the Wiki but I don't
really know whom to contact, thus this post on the mailing list.

My question would be what would have to be done for a complete dump of
the data. I read previous requests for this data and it seems as if
there is no general objection to such a dump but that no one has
written the proper tool for the job so far. As I have some free time
on my hands (and about a hundred ideas/requests for the data for
osmdoc) I'd be willing to at least _try_ to get something done.

There are a few questions that probably need answering first and I
hope we can start a discussion about this:
- Am I correct in assuming that there are no general objections from
the OSM server folks against such a dump? (Which would render the rest
of this E-Mail useless ;-)
- Is anyone else currently working on this?
- Which format should the data be dumped in
- Distribution of the data and storage space requirements
- Interval of dumps

* Format *
1) The easiest would be to just use the PostgreSQL COPY command
(http://www.postgresql.org/docs/8.3/interactive/sql-copy.html). This
would produce a file suitable to be read into any other PostgreSQL
database with.

Pros:
- Easy to do
- Probably one of the fastest options
- Low overhead in the file formats

Cons:
- As far as I know there is no way to compress the data stream so
everything would have to be written uncompressed first
- The binary format is not really portable or easy to use, forced to
use PostgreSQL as target, not able to filter data (Text formats
available)
- Even using text formats the data would be scattered (i.e. tags
wouldn't be stored with the elements, node references wouldn't be
stored with the ways, ...)
- No OSM tools for this formats

2) A dump of all changesets in OsmChange mode (e.g.
http://www.openstreetmap.org/api/0.6/changeset/3010332/download ). As
I understand it Changesets have been created for every change. I don't
quite understand why the first changeset (and nodes/ways) come from
sometime in 2005 and not 2004 but I bet someone here can enlighten me.

Pros:
- Well known data format, many tools can work with OsmChange
- Good if the user wants to rebuild/relive the change events as the
Changesets should come roughly in the correct order/timeline
- Possibility to split the process in multiple parts (e.g. history
files with 50.000 changesets each)
- Easy to update -> Just add the new changesets (with the long running
transactions, that are 'haunting' the diffs, posing the same problem)

Cons:
- XML file size overhead (doesn't matter that much compressed)
- Probably a lot slower than the COPY method
- Custom code would have to be written to do this export but it
shouldn't be to hard to iterate over every changeset. The necessary
indexes already seem to exist
- Potentially bad if one is interested mainly in the elements itself,
the history data for a single element could be scattered throughout
the whole file

3) A dump of all OSM elements in OSM format
(http://www.openstreetmap.org/api/0.6/node/60078445/history)
Pros:
- Good if the user is interested in the elements and their history and
not the "flow" of changes
- Easily split in smaller files (nodes, ways, relations, changesets,
further subdivided by id ranges or something else)
- Easy to process although tools might not work out of the box

Cons:
- XML file size overhead, Custom code needed (or has Osmosis already
the possibility to do this?), slower than COPY
- This format has not that much tool support as far as I know
(multiple versions of an element in a single file)
- Best format to rebuild a "custom" database of OSM as it is grouped
by element and not rather "arbitrarily" by Changeset/date
- Not very easy to update, the whole process would have to be redone
(or changesets would have to be examined)

A few personal remarks:
- I personally favor option 3) but that is mainly because of my
requirements for osmdoc.
- I don't see missing tool support as a big problem as I suspect that
the majority of the users of this data will have/want their own tools
do analyze or store the data (just guessing).

*Distribution and space requirements*
I really can't say much about this as I have no idea of the size of
the database or the space available on the server(s). But I hope one
of the admins can tell me more about this. The planet has been
distributed using BitTorrent in the past so this might be a possible
solution for the history dump but it really is too early to tell.

*Interval of the dumps*
Theoretically only one dump would be needed as there are now the
replicate diffs which should provide every change to the database. But
as they are - at the moment - only available in 'minute' format one
might dump the history regularly (whatever that means, again depending
on space requirement and if there is demand for this at all).

I probably have forgotten some important aspects/problems/points and I
hope to receive some feedback on this. I know that any "dump" program
would have to be written in a way as to not interfere with normal
operations (there is only one db server if I'm correct) but the
current planet dump program probably gives a good indication about the
load such a dump produces. Again, I have no data about this.

Any pointers from the system administrators about the specifics and
whom best to contact would be very welcome. Remarks about the data or
its potential format (or possible uses for the data) are welcome too
of course!

Cheers,
Lars