[OSM-dev] History File Formats
brett at bretth.com
Thu May 12 00:50:54 BST 2011
On Thu, May 12, 2011 at 2:14 AM, Frederik Ramm <frederik at remote.org> wrote:
> On 05/11/11 15:31, Brett Henderson wrote:
>> I then used the osc file format to create not only the commonly used
>> daily/hourly/minutely replication files, but also a complete dump of
>> database history updated daily (stored in one file per day).
>> I was slightly surprised then to see the creation of the new full
>> history files.
> I wasn't sure what these are, and I think I am still confused. What is the
> difference between the daily replication file and the file in /history/?
> Assuming I want to have a local, current full history file, how can I use
> these diffs to amend an existing full file that I have?
The daily files are still "delta" style change files. They do not contain
"full history", they only contain enough to get from day N to day N + 1.
They're fine for patching planet files, but no good for patching full
The history files are full history files similar to the minutely and hourly
files. The other major difference is that they contain all information from
day 1 of the project rather than just the last month or two.
Obviously you can't maintain a full osh file using Osmosis because Osmosis
doesn't use that representation of history. However you can create a single
osc file containing full history by using a combination of
--read-change-interval and --append-change. However I've just realised that
--append-change doesn't provide sorted output so it may be necessary to
combine it with --sort-change which is very slow. I should take another
look at --append-change because it should be possible to provide sorted
output without requiring a temp file based merge sort.
In summary, the history files are the only mechanism Osmosis provides to get
a full history file albeit in osc format. It may be necessary to enhance
some tasks to speed up appending of multiple files into one file if that is
required. However if this is achieved you'll then have a mechanism allowing
you to keep a full history file within a day of the main OSM database.
This then raises the question of whether an osc or osh format is more
appropriate. They both contain the same information but in a different
> But I do wonder why we've now gone back to a single
>> massive file approach which is updated rarely and requires a full
>> download each time when the existing files allow incremental download of
>> recent changes.
> Is it so bad to generate a full history file four times a year? I haven't
> done the maths but I guess that downloading 300 daily diffs and applying
> them to a 300-day old history file will take some time too. What happens if
> users change their account names, or set their "public" flag?
A full file several times a year is fine if that's all you need. But it's
not suitable for daily or even weekly updates.
As for the changing of account names or public flags, that's a very good
public flag + denormalised user info = incredible nuisance
I don't think public flag is likely to be a major issue. The amount of data
impacted should be fairly small given that all new users are public and the
number of old users making edits public at this point should be very small.
Perhaps I'm wrong?
The username is a bigger issue, although depending on what you're doing with
the data it may not matter so long as later history files contain the
correct username. If you're storing user data in a separate table (such as
the Osmosis pgsnapshot schema) then it will be updated when you apply later
files anyway, but if you want a simple XML (or pbfh?) file containing full
history then it's a problem. No simple way around this one unfortunately.
The simplest fix for both public flag and updated usernames is to reset the
extract timestamp every so often and force a full refresh which will take
> It leaves me with a few questions:
>> * Are the Osmosis-based daily full history extracts even used?
>> Should I disable/delete them?
> I'm not using them but mainly because I don't know how ;)
I suspect that's the case for a lot of people :-) Hopefully this makes it a
little bit clearer.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the dev