[OSM-dev] Osmosis Replication Statistics
Brett Henderson
brett at bretth.com
Wed Aug 22 14:46:34 BST 2007
Hi All,
I'm inching ever closer towards osmosis being ready for wider use. This
is just an FYI for anybody who is interested to know where I'm at. I've
recently been testing osmosis replication features against full osm
history data and fixing up the various issues that I've encountered as a
result of this.
This email is pretty long, it starts with some stats breaking down the
time taken for osmosis to perform various change operations. For now
I'm focussing on the generation of snapshot and change files. Applying
them to a database will come later, the tasks are feature complete but
not heavily tested yet and a lower priority. Later in the email I
describe a couple of problems I'm facing.
I produced a set of monthly changesets from the start of 2006. The file
sizes are as follows:
20060101.osm 53452410
20060101-20060201.osc 29113922
20060201-20060301.osc 39577139
20060301-20060401.osc 25482918
20060401-20060501.osc 279046515
20060501-20060601.osc 1401380463
20060601-20060701.osc 747336625
20060701-20060801.osc 862656139
20060801-20060901.osc 777061744
20060901-20061001.osc 838770700
20061001-20061101.osc 1261867463
20061101-20061201.osc 327296919
20061201-20070101.osc 959647780
20070101-20070201.osc 1223347868
20070201-20070301.osc 298255681
20070301-20070401.osc 326782626
20070401-20070501.osc 802615880
20070501-20070601.osc 851776524
20070601-20070701.osc 848936402
20070701-20070801.osc 1012877148
The most recent one (20070701-20070801.osc) is quite large so I'll focus
on it. It took 43m26s to produce.
Splitting into 7 day intervals and ignoring the last couple of days
produces the following stats.
File Size Duration
20070701-20070708.osc 164027848 10m37s
20070708-20070715.osc 220498572 10m10s
20070715-20070722.osc 347431979 14m42s
20070722-20070729.osc 253686702 13m40s
Selecting the biggest file in the interval (20070715-20070722.osc) and
breaking into 1 day intervals produces the following stats.
File Size Duration
20070715-20070716.osc 16684056 1m41s
20070716-20070717.osc 44242820 2m55s
20070722-20070729.osc 62863288 3m02s
20070718-20070719.osc 45727744 2m31s
20070719-20070720.osc 54787426 2m27s
20070720-20070721.osc 77041093 3m07s
20070721-20070722.osc 63953029 2m44s
Selecting the biggest file in the interval (20070720-20070721.osc) and
breaking into 4 hour intervals produces the following stats.
File Size Duration
2007072000-2007072004.osc 8932383 10.229s
2007072004-2007072008.osc 17590822 17.930s
2007072008-2007072012.osc 12615890 16.762s
2007072012-2007072016.osc 14360116 21.107s
2007072016-2007072020.osc 13339044 23.095s
2007072020-2007072100.osc 10841897 31.845s (Note: running a second time
took 21.129s)
Selecting the biggest file in the interval (2007072004-2007072008.osc)
and breaking into 1 hour intervals produces the following stats.
File Size Duration (run 3 times)
2007072004-2007072005.osc 3160144 4.872s,4.655s,4.985s
2007072005-2007072006.osc 4156641 5.249s,5.185s,4.729s
2007072006-2007072007.osc 5766073 9.204s,8.871s,8.638s
2007072007-2007072008.osc 4584598 4.250s,4.650s,3.983s
Another bit of news is that I've just written merge tasks that will
allow multiple files (both base osm files and change files) to be merged
together. If 3 minutes to produce the daily change file extraction is
too long, the extraction can be broken down into smaller time intervals
and the result files merged together afterwards. Martijn requested this
feature and may use it to help process his AND files.
So how long will it take to generate a new daily planet from the
previous daily planet. My next task is to pick the biggest daily change
file (20070720-20070721.osc) and apply it to 20070720.osm but I need to
generate 20070720.osm first. I hope to do this in the next day or so.
In the meantime, this will give some indication. Applying all 2006
change files in a single pipeline against 20060101.osm to produce
20070101.osm took 59m29s. The result is 6146181638 bytes in size (yeah,
huge for some reason, more on that in a sec). So it appears that
generating a new daily change set takes less than 4 minutes in its
current state. Applying this to an existing planet will take very
approximately one hour and can be performed offline.
Now, back to the huge planet file. I have no idea why the planet file
is so big. My first thought was that my 12-month change application
process was wrong so I took a snapshot at 20070101 to verify the
results. The files were *almost* identical in size. The differences
are contained in the file attached to this email.
So two problems to focus on:
1. Why is the planet file is so big?
2. Why do I have differences between my snapshotted 20070101 planet and
my derived 20070101 using 12 months of changes?
Problem 1.
I have no idea what is causing this huge planet. My only thought is
that perhaps data exists in the history tables that doesn't exist in the
current tables. I'm really not sure what's going on here. I need to
look into this further.
Problem 2
I've examined a random sample of the changes between my two 20070101.osm
files. For each change I examined the history of the entity in
question. In every case I've checked the change can be explained by the
fact that the two most recent history rows (as of beginning 2007) have
identical timestamps. This means my queries sometimes return one row,
sometimes the other depending on the particular query characteristics.
I don't think there's much I can do about this. Given that it is a very
small set of changes, it is probably something we can live with and fix
on a case by case basis as problems are picked up.
Congratulations if you've made it this far :-) Next up I'll produce
some timings for creating daily planets and then I'll start looking into
the huge planet in more detail. But for now, bed time.
Cheers,
Brett
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20070101diff.osc
Type: text/xml
Size: 21157 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20070822/5f800fc2/attachment.xml>
More information about the dev
mailing list