Hello!<br><br>I would like to announce code implementing a binary OSM format that<br>supports the full semantics of the OSM XML. It is 5x-10x faster at<br>reading and writing and 30-50% smaller; an entire planet, including<br>
all metadata, can be read in about 12 minutes and written in about 50<br>minutes on a 3 year old dual-core machine. I have implemented an<br>osmosis reader and writer and have enhancements to the map splitter to<br>read the format. Code is pure Java and uses Google protocol buffers<br>
for the low-level serialization.<br><br>Comparing the file sizes:<br><br> 8.2gb planet-100303.osm.bz2<br>12 gb planet-100303.osm.gz<br> 5.2gb planet-omitmeta.bin<br> 6.2gb planet.bin<br><br>The omitmeta version omits the uid/user/version/timestamp metadata<br>
fields on each entity and are faster to generate and read.<br><br>The design is very extensible. The low-level file format is designed<br>to support random access at the 'fileblock' granularity, where a<br>fileblock can contain ~8k OSM entities. There is *no* tag hardcoding<br>
used; all keys and values are stored in full as opaque strings. For<br>future scalability, 64 bit node/way/relation ID's are assumed. The<br>current serializer preserves the order of OSM entities and tags on OSM<br>entities. To flexibly handle multiple resolutions, the granularity, or<br>
resolution used for representing locations and timestamps is<br>adjustable in multiples of 1 millisecond and 1 nanodegree and can be<br>set independently for each fileblock. The default scaling factor is<br>1000 milliseconds and 100 nanodegrees, corresponding to about ~1cm at<br>
the equator. These are the current resolution of the OSM database.<br><br>Smaller files can be generated. At 10 microdegrees granularity,<br>corresponding to about 1m of resolution, the filesize decreases by<br>about 1gb. Space may also be saved by removing uninteresting UUID<br>
tags or perhaps by having stronger geographic locality when building<br>the file.<br><br>I have also tested the binary format on some SRTM contour lines in OSM<br>0.5 XML format, obtaining about a 50:1 compression ratio. This might<br>
be further improved by choosing a granularity equal to the isohypsis<br>grid size.<br><br>// Testing<br><br>I have tested this code on the Cloudmade extract of Rhode<br>Island. After converting the entire file to and from binary format,<br>
the XML output is bytewise identical to original file except for the<br>one line indicating the osmosis version number.<br><br>When run through the splitter, the output is not bytewise identical to<br>before because of round-off errors 16 digits after the decimal point;<br>
this could be fixed by having the splitter behave like osmosis and<br>only output 7 significant digits.<br><br>// To use:<br><br>Demonstration code is available on github at<br><br> <a href="http://github.com/scrosby/OSM-Osmosis" target="_blank">http://github.com/scrosby/OSM-Osmosis</a> and<br>
<a href="http://github.com/scrosby/OSM-splitter" target="_blank">http://github.com/scrosby/OSM-splitter</a><br><br>See the 'master' branches.<br><br>Please note that this is at present unpackaged demonstration code and<br>
the fileformat may change to incorporate suggestions. Also note that<br>
the shared code between the splitter and osmosis currently lives in<br>the osmosis git repository. You'll also need to go into the<br>crosby.binary directory and run the protocol compiler ('protoc') on<br>the .proto files (See comments in those files for the command line.).<br>
<br><br>/// The design ///<br><br>I use Google protocol buffers for the low-level store. Given a<br>specification file of one or more messages, the protocol buffer<br>compiler writes low-level serialization code. Messages may contain<br>
other messages, forming hierarchical structures. Protocol buffers also<br>support extensibility; new fields can be added to a message and old<br>clients can read those messages without recompiling. For more details,<br>please see <a href="http://code.google.com/p/protobuf/" target="_blank">http://code.google.com/p/protobuf/</a>. Google officially<br>
supports C++, Java, and Python, but compilers exist for other<br>languages. An example message specification is:<br><br>message Node {<br> required sint64 id = 1;<br> required sint64 lat = 7;<br> required sint64 lon = 8;<br>
repeated uint32 keys = 9 [packed = true]; // Denote strings <br> repeated uint32 vals = 10 [packed = true];// Denote strings<br> optional Info info = 11; // Contains metadata<br>}<br><br><br>Protocol buffers use a variable-bit encoding for integers. An integer<br>
is encoded at 7 bits per byte, where the high bit indicates whether or<br>not the next byte is to be read. When messages contain small integers,<br>the filesize is minimized. Two encodings exist, one intended for<br>mostly positive integers and one for signed integers. In the standard<br>
encoding, integers [0,127] require one byte, [128,16383] require two<br>bytes, etc. In the signed number encoding, the sign bit is placed in<br>the least significant position; numbers [-64,63] require one byte,<br>[-8192,8191] require two bytes, and so forth.<br>
<br>To encode OSM entities into protocol buffers, I collect 8k entities to<br>form a 'file block', Within each block, I extract all strings (key,<br>value, role, user) into a separate string table. Thereafter, strings<br>
are referred to by their index into this table. To ensure that<br>frequently used strings have small indexes, I sort the string table by<br>the use frequency for each string. To improve deflate compressibility<br>of the stringtable I then sort strings that have the same frequency<br>
lexiconographically.<br><br>For ways and relations, which contain the ID's of other nodes, I<br>exploit the tendency of consecutive nodes in a way or relation to have<br>nearby node ID's by using delta compression, resulting in small integers.<br>
<br><br>Within each block, I then divide entities into groups that contain<br>consecutive messages all of the same type (node/way/relation), with<br>one interesting case. If there is a batch of consecutive nodes to be<br>
output that have no tags at all, I use a special dense format. I omit<br>the tags and store the group 'columnwise', as an array of ID's, array<br>of latitudes, and array of longitudes, and delta-encode each<br>
column. This reduces header overheads and allows delta-coding to work<br>very effectively. With the default ~1cm granularity, nodes within<br>about 6 km of each other can be represented by as few as 7 bytes<br>each plus the costs of the metadata if it is included.<br>
<br>message DenseNodes {<br> repeated sint64 id = 1 [packed = true]; // DELTA coded<br> repeated sint64 lat = 7 [packed = true]; // DELTA coded<br> repeated sint64 lon = 8 [packed = true]; // DELTA coded<br> repeated Info info = 12;<br>
}<br><br>After being serialized into a string, each block is gzip/deflate <br>compressed individually. When deflate compression is not <br>used, the file is somewhat faster to generate and parse and <br>filesizes increases from 6.2gb to 21gb for the file<br>
containing metadata and from 5.2 to 8.8gb for the file without<br>metadata. This means that metadata accounts for 60% of the<br>uncompressed file size and compression reduces the size of the map<br>data by 45% and the metadata by 87%.<br>
<br>In osmosis, I can select the number of entities in a block, whether<br>deflate compression is used, and the granularity at the command line.<br><br>// Common code<br><br>Unfortunately, osmosis, mkgmap, and the splitter each use a different<br>
representation for OSM entities. This means that I have to reimplement<br>the entity serialization code for each program and there is less<br>common code between the implementations than I would like. I have only<br>implemented a serializer and deserializer for osmosis and a<br>
deserializer for the splitter. Some code is common, for instance the<br>protocol buffer definitions and code for reading and parsing at the<br>file level.<br><br>// Changes to osmosis<br><br>The changes to osmosis are just some new tasks to handle reading and<br>
writing the binary format. Possible future improvements could include<br>extending the fileformat to support changelists or using this format<br>for storing data internally.<br><br>// Changes to the splitter <br><br>As my binary format is about one quarter the size of the cache used by<br>
the current splitter design, I ripped out that code. I then refactored<br>the file reader code of the splitter to operate at the level of<br>entities, not XML tags. Finally, I added the binary parser. When<br>splitting the entire planet file, the first pass takes about 10<br>
minutes to read the file and calculate the regions. The remaining<br>passes are much slower as they generate gzipped XML. I have not<br>enhanced the splitter to generate binary output as mkgmap does not<br>support the binary file format and would require refactoring.<br>
However, I suspect that if it generated binary output, the splitter<br>could do a full split of the planet in one to two hours.<br><br>Currently, the splitter requires that the osmosis be in the classpath<br>in order to be able to access the common code.<br>
<br><br>// Possible enhancements<br><br>In the design, each block is independently decodable and there is<br>support for attaching metadata (E.g., bounding boxes) so that unneeded<br>blocks can be skipped. If nodes and ways and relations are placed into<br>
blocks with high geographic locality, a bounding box test may be able<br>to skip substantial portions of the file and increase<br>efficiency for extracting relevant subsets. Alternative indexing<br>structures are possible, such as cross-references between a blocks.<br>
Currently, no such metadata is generated or used.<br><br>The extendability of protocol buffer messages to support optional<br>fields can be used to cheaply store optional extra data, such as<br>bounding boxes, in nodes, ways and relations. There is no space<br>
penalty in having unused fields in files that do not contain optional<br>data, except that each field requires permanently reserving a tag<br>number (the '= 7'). Small tag numbers in [1,15] are more valuable than<br>
larger numbers because they require only one byte of header<br>overhead. Tag numbers in [16,2047] require two bytes.<br><br>I estimate that adding a bounding box to each way would require 20-25<br>bytes per way for 50M ways or about a gigabyte when deflate<br>
compression is disabled.<br><br>// Rejected design choices<br><br>There are a few design choices I rejected. I chose not to delta encode<br>node ID's, latitudes, and longitudes for nodes that contain tags.<br>Although this would save around 3-5%, it makes the parser and<br>
serializer more complex.<br><br>I considered specially encoding tags with integer-valued or<br>float-valued values. They occur fairly often in a planet dump (~15%),<br>but encoding them in binary saves almost nothing (.2%) and is a<br>
significant slowdown.<br><br>// TODO's<br><br>Probably the most important TODO is packaging and fixing the build system.<br>I have no almost no experience with ant and am unfamiliar with java<br>packaging practices, so I'd like to request help/advice on ant and suggestions on<br>
how to package the common parsing/serializing code so that it can be<br>re-used across different programs.<br><br>For future-compatibility reasons, I need to consider including<br>file-format flags and version numbers in the file header to indicate<br>
when a file includes features that a reader must support. (e.g., dense<br>nodes, as described above.)<br><br>// Questions<br><br>Ideas, suggestions, improvements, and other uses of this format?<br><br>Should I checkin the protobuf.jar and generated code?<br>
<br>Can anyone suggest how to add appropriate ant rules, or know of a<br>convenient ant buildfile generator/editor that integrates well with<br>eclipse.<br><br>// Command lines:<br><br>Some example command lines to generate files, for invocation from<br>
the build directory ~/source/osmosis/eclipse:<br><br><br>java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" org.openstreetmap.osmosis.core.Osmosis --read-xml file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet-all.bin<br>
<br><br>java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" org.openstreetmap.osmosis.core.Osmosis --read-xml file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet.bin omitmetadata=true<br><br>
# Version with ~1m precision <br>
java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" org.openstreetmap.osmosis.core.Osmosis --read-xml file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet-for-display.bin omitmetadata=true granularity=10000<br>
<br><br>