On Thu, Apr 29, 2010 at 2:15 AM, Frederik Ramm <span dir="ltr"><<a href="mailto:frederik@remote.org">frederik@remote.org</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Scott,<div class="im"><br>
<br>
Scott Crosby wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
I would like to announce code implementing a binary OSM format that<br>
supports the full semantics of the OSM XML. <br>
</blockquote>
<br></div>
This all sounds very interesting, and you seem to have spent a lot of thought on it and documented it well.<br>
<br>
If I understand it correctly, this is meant to be a replacement for the XML files as a "transport format" for XML data. It is not meant to offer random access in any way, and thus differs from other attempts at creating binary formats that could be used in lieu of databases, having indexes and all.<br>
<br></blockquote><div><br>Roughly, yes, this is intended as a transport format, but the design is flexible.<br><br>if the file is physically ordered so that blocks have strong geographic locality and block metadata includes bounding boxes, then those bounding boxes can be used to skip unneeded blocks. If the file is physically ordered in type/id order (as the current planet is), and block metadata includes the minimum and maximum id for each block, then, as before, the metadata can be used to only examine desired blocks. If both searches are critical, then generate two files, one with geographic locality and one sorted by type/id. Storing 'planet-omitmeta.bin' *AND* 'planet.bin' is still cheaper than storing 'planet-100303.osm.gz'<br>
<br> As you point out, a binary format means different things to different people. I chose this design because it would be useful as-is and it could offer future features without requiring changing the file format. Geographic searches and searches by type&id merely wait on implementing code for physically reordering the file to be written. Adding the appropriate fields to the metadata header is trivial in comparison. In addition, nothing in the design precludes adding having fileblocks that contain data other than OSM entities. Fileblocks can contain an index from a node ID to the ways and relations it is contained. Metadata headers on these blocks can indicate which block contains the index entries for a particular node.<br>
<br>My vision is that in many cases it is better to have a simple format that is very dense and lets you skip the 95% of the data that you don't care about, rather than design a very complex or significantly larger format (e.g, a relational database) that lets you skip 99+% of the data that you don't care about. The more advanced formats may return less data, but the simple format is still 20 times less data than reading everything.<br>
<br><br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Maybe we should be careful about naming these formats to make their purpose clearer. The generic "OSM binary format" seems to mean different things to different people. The file extension ".bin" is perhaps not the best choice.<br>
<br></blockquote><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Have you considered/evaluated "Fast Infoset" and if so, what were the reasons against that?<div class="im"><br>
<br></div></blockquote><div><br>No, I was not aware of that compressed XML design. <br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="im">
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
It is 5x-10x faster at<br>
reading and writing and 30-50% smaller<br>
</blockquote>
<br></div>
The size figure is obviously compared to bz2; is the "5x-10x faster" also compared to bz2, and if so, compared to the native Java bz2 or the external C one?<div class="im"><br></div></blockquote><div><br>For filesizes, I was comparing to bzip2. For performance, I compared against the gzip'ed planet; I didn't have the patience to compress or decompress that much XML in bzip2.<br>
<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
an entire planet, including<br>
all metadata, can be read in about 12 minutes and written in about 50<br>
minutes on a 3 year old dual-core machine. <br>
</blockquote>
<br></div>
How did you measure write performance decoupled from read performance? Surely your 3 year old dual-core machine did not have the 150 gigs of RAM needed to suck the entire planet into memory?<br>
<br></blockquote><div><br>I benchmarked:<br><br> osmosis --read-bin file=planet.bin --write-null<br> osmosis --read-bin file=planet.bin --write-bin file=planet2.bin<br><br>And measured ~12 minutes of CPU time for the first and ~60 minutes of CPU time for the second.<br>
<br>With a dual-core system, using '--b bufferCapacity=20000' gives some concurrency and writing can be done in somewhere around 40 minutes. <br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
You have paid an impressive amount attention to details in order to achieve the good performance and compression rates that you do. I'm slightly concerned about the robustness of it all - in the past, we often had planet files that were broken one way or the other, and it was usually possible to remedy this with some standard grep, sed, or dd actions - if one of your files ever breaks then I guess it is likely to be complete garbage ;-)<div class="im">
<br></div></blockquote><div><br>Without knowing how those prior planets were broken, I can't say whether analogous breakage of files in my format could be repaired.<br><br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="im">
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Probably the most important TODO is packaging and fixing the build system.<br>
I have no almost no experience with ant and am unfamiliar with java<br>
packaging practices, so I'd like to request help/advice on ant and suggestions on<br>
how to package the common parsing/serializing code so that it can be<br>
re-used across different programs.<br>
</blockquote>
<br></div>
I suggest to ask on osmosis-dev, an get your new code into the Osmosis trunk quickly so people can play with it.<br>
<br></blockquote><div><br>I think it would be prudent to get suggestions from the OSM community first. Once the code is in osmosis, our ability to make<br>compatability-breaking changes to the format will be reduced.<br>
<br>Scott<br><br></div></div>