[OSM-dev] Osmosis Replication Statistics

Brett Henderson brett at bretth.com
Wed Aug 22 14:46:34 BST 2007


Hi All,

I'm inching ever closer towards osmosis being ready for wider use.  This 
is just an FYI for anybody who is interested to know where I'm at.  I've 
recently been testing osmosis replication features against full osm 
history data and fixing up the various issues that I've encountered as a 
result of this.

This email is pretty long, it starts with some stats breaking down the 
time taken for osmosis to perform various change operations.  For now 
I'm focussing on the generation of snapshot and change files.  Applying 
them to a database will come later, the tasks are feature complete but 
not heavily tested yet and a lower priority.  Later in the email I 
describe a couple of problems I'm facing.

I produced a set of monthly changesets from the start of 2006.  The file 
sizes are as follows:
20060101.osm            53452410
20060101-20060201.osc   29113922
20060201-20060301.osc   39577139
20060301-20060401.osc   25482918
20060401-20060501.osc  279046515
20060501-20060601.osc 1401380463
20060601-20060701.osc  747336625
20060701-20060801.osc  862656139
20060801-20060901.osc  777061744
20060901-20061001.osc  838770700
20061001-20061101.osc 1261867463
20061101-20061201.osc  327296919
20061201-20070101.osc  959647780
20070101-20070201.osc 1223347868
20070201-20070301.osc  298255681
20070301-20070401.osc  326782626
20070401-20070501.osc  802615880
20070501-20070601.osc  851776524
20070601-20070701.osc  848936402
20070701-20070801.osc 1012877148

The most recent one (20070701-20070801.osc) is quite large so I'll focus 
on it.  It took 43m26s to produce.

Splitting into 7 day intervals and ignoring the last couple of days 
produces the following stats.
File                  Size      Duration
20070701-20070708.osc 164027848 10m37s
20070708-20070715.osc 220498572 10m10s
20070715-20070722.osc 347431979 14m42s
20070722-20070729.osc 253686702 13m40s

Selecting the biggest file in the interval (20070715-20070722.osc) and 
breaking into 1 day intervals produces the following stats.
File                  Size     Duration
20070715-20070716.osc 16684056 1m41s
20070716-20070717.osc 44242820 2m55s
20070722-20070729.osc 62863288 3m02s
20070718-20070719.osc 45727744 2m31s
20070719-20070720.osc 54787426 2m27s
20070720-20070721.osc 77041093 3m07s
20070721-20070722.osc 63953029 2m44s

Selecting the biggest file in the interval (20070720-20070721.osc) and 
breaking into 4 hour intervals produces the following stats.
File                      Size     Duration
2007072000-2007072004.osc  8932383 10.229s
2007072004-2007072008.osc 17590822 17.930s
2007072008-2007072012.osc 12615890 16.762s
2007072012-2007072016.osc 14360116 21.107s
2007072016-2007072020.osc 13339044 23.095s
2007072020-2007072100.osc 10841897 31.845s (Note: running a second time 
took 21.129s)

Selecting the biggest file in the interval (2007072004-2007072008.osc) 
and breaking into 1 hour intervals produces the following stats.
File                      Size    Duration (run 3 times)
2007072004-2007072005.osc 3160144 4.872s,4.655s,4.985s
2007072005-2007072006.osc 4156641 5.249s,5.185s,4.729s
2007072006-2007072007.osc 5766073 9.204s,8.871s,8.638s
2007072007-2007072008.osc 4584598 4.250s,4.650s,3.983s

Another bit of news is that I've just written merge tasks that will 
allow multiple files (both base osm files and change files) to be merged 
together.  If 3 minutes to produce the daily change file extraction is 
too long, the extraction can be broken down into smaller time intervals 
and the result files merged together afterwards.  Martijn requested this 
feature and may use it to help process his AND files.

So how long will it take to generate a new daily planet from the 
previous daily planet.  My next task is to pick the biggest daily change 
file (20070720-20070721.osc) and apply it to 20070720.osm but I need to 
generate 20070720.osm first.  I hope to do this in the next day or so.

In the meantime, this will give some indication.  Applying all 2006 
change files in a single pipeline against 20060101.osm to produce 
20070101.osm took 59m29s.  The result is 6146181638 bytes in size (yeah, 
huge for some reason, more on that in a sec).  So it appears that 
generating a new daily change set takes less than 4 minutes in its 
current state.  Applying this to an existing planet will take very 
approximately one hour and can be performed offline.

Now, back to the huge planet file.  I have no idea why the planet file 
is so big.  My first thought was that my 12-month change application 
process was wrong so I took a snapshot at 20070101 to verify the 
results.  The files were *almost* identical in size.  The differences 
are contained in the file attached to this email.

So two problems to focus on:
1. Why is the planet file is so big?
2. Why do I have differences between my snapshotted 20070101 planet and 
my derived 20070101 using 12 months of changes?

Problem 1.
I have no idea what is causing this huge planet.  My only thought is 
that perhaps data exists in the history tables that doesn't exist in the 
current tables.  I'm really not sure what's going on here.  I need to 
look into this further.

Problem 2
I've examined a random sample of the changes between my two 20070101.osm 
files.  For each change I examined the history of the entity in 
question.  In every case I've checked the change can be explained by the 
fact that the two most recent history rows (as of beginning 2007) have 
identical timestamps.  This means my queries sometimes return one row, 
sometimes the other depending on the particular query characteristics.  
I don't think there's much I can do about this.  Given that it is a very 
small set of changes, it is probably something we can live with and fix 
on a case by case basis as problems are picked up.

Congratulations if you've made it this far :-)  Next up I'll produce 
some timings for creating daily planets and then I'll start looking into 
the huge planet in more detail.  But for now, bed time.

Cheers,
Brett

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20070101diff.osc
Type: text/xml
Size: 21157 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20070822/5f800fc2/attachment.xml>


More information about the dev mailing list