[OSM-dev] New OSM binary fileformat implementation.

Wed Apr 28 21:28:20 BST 2010

I got it built,

Build instructions :
first get the protobuf, and build manually,

  526  svn checkout http://protobuf.googlecode.com/svn/trunk/ protobuf-read-only
  527  ls
  528  cd protobuf-read-only/
  529  ls
  530  bash ./autogen.sh
  531  ls
  532  make
  533  ./configure
  534  make
  535  sudo make install

go into the java dir and use ant to build the jar files.

got to the sources of the OSM tool :
install the protobuf into the lib/compile, normally this will be
handled by ivy, but there are issues.
 cp /home/mdupont/experiments/osm/protobuf-read-only/java/target/protobuf-java-2.3.1-pre.jar
lib/compile/

  607  cd src/
  611  cd crosby/
  613  cd binary/
  614  ls
  615  protoc --java_out=../.. fileformat.proto
  616  protoc --java_out=../.. osmformat.proto

this generates the needed code.

now you can build.

On Wed, Apr 28, 2010 at 7:02 PM, Scott Crosby <scrosby06 at gmail.com> wrote:
> Hello!
>
> I would like to announce code implementing a binary OSM format that
> supports the full semantics of the OSM XML. It is 5x-10x faster at
> reading and writing and 30-50% smaller; an entire planet, including
> all metadata, can be read in about 12 minutes and written in about 50
> minutes on a 3 year old dual-core machine. I have implemented an
> osmosis reader and writer and have enhancements to the map splitter to
> read the format. Code is pure Java and uses Google protocol buffers
> for the low-level serialization.
>
> Comparing the file sizes:
>
>  8.2gb   planet-100303.osm.bz2
> 12  gb   planet-100303.osm.gz
>  5.2gb   planet-omitmeta.bin
>  6.2gb   planet.bin
>
> The omitmeta version omits the uid/user/version/timestamp metadata
> fields on each entity and are faster to generate and read.
>
> The design is very extensible. The low-level file format is designed
> to support random access at the 'fileblock' granularity, where a
> fileblock can contain ~8k OSM entities. There is *no* tag hardcoding
> used; all keys and values are stored in full as opaque strings. For
> future scalability, 64 bit node/way/relation ID's are assumed. The
> current serializer preserves the order of OSM entities and tags on OSM
> entities. To flexibly handle multiple resolutions, the granularity, or
> resolution used for representing locations and timestamps is
> adjustable in multiples of 1 millisecond and 1 nanodegree and can be
> set independently for each fileblock. The default scaling factor is
> 1000 milliseconds and 100 nanodegrees, corresponding to about ~1cm at
> the equator. These are the current resolution of the OSM database.
>
> Smaller files can be generated. At 10 microdegrees granularity,
> corresponding to about 1m of resolution, the filesize decreases by
> about 1gb. Space may also be saved by removing uninteresting UUID
> tags or perhaps by having stronger geographic locality when building
> the file.
>
> I have also tested the binary format on some SRTM contour lines in OSM
> 0.5 XML format, obtaining about a 50:1 compression ratio. This might
> be further improved by choosing a granularity equal to the isohypsis
> grid size.
>
> // Testing
>
> I have tested this code on the Cloudmade extract of Rhode
> Island. After converting the entire file to and from binary format,
> the XML output is bytewise identical to original file except for the
> one line indicating the osmosis version number.
>
> When run through the splitter, the output is not bytewise identical to
> before because of round-off errors 16 digits after the decimal point;
> this could be fixed by having the splitter behave like osmosis and
> only output 7 significant digits.
>
> // To use:
>
> Demonstration code is available on github at
>
>    http://github.com/scrosby/OSM-Osmosis       and
>    http://github.com/scrosby/OSM-splitter
>
> See the 'master' branches.
>
> Please note that this is at present unpackaged demonstration code and
> the fileformat may change to incorporate suggestions. Also note that
> the shared code between the splitter and osmosis currently lives in
> the osmosis git repository. You'll also need to go into the
> crosby.binary directory and run the protocol compiler ('protoc') on
> the .proto files (See comments in those files for the command line.).
>
>
> /// The design ///
>
> I use Google protocol buffers for the low-level store. Given a
> specification file of one or more messages, the protocol buffer
> compiler writes low-level serialization code. Messages may contain
> other messages, forming hierarchical structures. Protocol buffers also
> support extensibility; new fields can be added to a message and old
> clients can read those messages without recompiling. For more details,
> please see http://code.google.com/p/protobuf/. Google officially
> supports C++, Java, and Python, but compilers exist for other
> languages.  An example message specification is:
>
> message Node {
>    required sint64 id = 1;
>    required sint64 lat = 7;
>    required sint64 lon = 8;
>    repeated uint32 keys = 9 [packed = true]; // Denote strings
>    repeated uint32 vals = 10 [packed = true];// Denote strings
>    optional Info info = 11; // Contains metadata
> }
>
>
> Protocol buffers use a variable-bit encoding for integers. An integer
> is encoded at 7 bits per byte, where the high bit indicates whether or
> not the next byte is to be read. When messages contain small integers,
> the filesize is minimized. Two encodings exist, one intended for
> mostly positive integers and one for signed integers. In the standard
> encoding, integers [0,127] require one byte, [128,16383] require two
> bytes, etc. In the signed number encoding, the sign bit is placed in
> the least significant position; numbers [-64,63] require one byte,
> [-8192,8191] require two bytes, and so forth.
>
> To encode OSM entities into protocol buffers, I collect 8k entities to
> form a 'file block', Within each block, I extract all strings (key,
> value, role, user) into a separate string table. Thereafter, strings
> are referred to by their index into this table. To ensure that
> frequently used strings have small indexes, I sort the string table by
> the use frequency for each string. To improve deflate compressibility
> of the stringtable I then sort strings that have the same frequency
> lexiconographically.
>
> For ways and relations, which contain the ID's of other nodes, I
> exploit the tendency of consecutive nodes in a way or relation to have
> nearby node ID's by using delta compression, resulting in small integers.
>
>
> Within each block, I then divide entities into groups that contain
> consecutive messages all of the same type (node/way/relation), with
> one interesting case. If there is a batch of consecutive nodes to be
> output that have no tags at all, I use a special dense format. I omit
> the tags and store the group 'columnwise', as an array of ID's, array
> of latitudes, and array of longitudes, and delta-encode each
> column. This reduces header overheads and allows delta-coding to work
> very effectively. With the default ~1cm granularity, nodes within
> about 6 km of each other can be represented by as few as 7 bytes
> each plus the costs of the metadata if it is included.
>
> message DenseNodes {
>    repeated sint64 id = 1 [packed = true]; // DELTA coded
>    repeated sint64 lat = 7 [packed = true]; // DELTA coded
>    repeated sint64 lon = 8 [packed = true]; // DELTA coded
>    repeated Info info = 12;
> }
>
> After being serialized into a string, each block is gzip/deflate
> compressed individually. When deflate compression is not
> used, the file is somewhat faster to generate and parse and
> filesizes increases from 6.2gb to 21gb for the file
> containing metadata and from 5.2 to 8.8gb for the file without
> metadata. This means that metadata accounts for 60% of the
> uncompressed file size and compression reduces the size of the map
> data by 45% and the metadata by 87%.
>
> In osmosis, I can select the number of entities in a block, whether
> deflate compression is used, and the granularity at the command line.
>
> // Common code
>
> Unfortunately, osmosis, mkgmap, and the splitter each use a different
> representation for OSM entities. This means that I have to reimplement
> the entity serialization code for each program and there is less
> common code between the implementations than I would like. I have only
> implemented a serializer and deserializer for osmosis and a
> deserializer for the splitter. Some code is common, for instance the
> protocol buffer definitions and code for reading and parsing at the
> file level.
>
> // Changes to osmosis
>
> The changes to osmosis are just some new tasks to handle reading and
> writing the binary format.  Possible future improvements could include
> extending the fileformat to support changelists or using this format
> for storing data internally.
>
> // Changes to the splitter
>
> As my binary format is about one quarter the size of the cache used by
> the current splitter design, I ripped out that code. I then refactored
> the file reader code of the splitter to operate at the level of
> entities, not XML tags. Finally, I added the binary parser.  When
> splitting the entire planet file, the first pass takes about 10
> minutes to read the file and calculate the regions. The remaining
> passes are much slower as they generate gzipped XML. I have not
> enhanced the splitter to generate binary output as mkgmap does not
> support the binary file format and would require refactoring.
> However, I suspect that if it generated binary output, the splitter
> could do a full split of the planet in one to two hours.
>
> Currently, the splitter requires that the osmosis be in the classpath
> in order to be able to access the common code.
>
>
> // Possible enhancements
>
> In the design, each block is independently decodable and there is
> support for attaching metadata (E.g., bounding boxes) so that unneeded
> blocks can be skipped. If nodes and ways and relations are placed into
> blocks with high geographic locality, a bounding box test may be able
> to skip substantial portions of the file and increase
> efficiency for extracting relevant subsets. Alternative indexing
> structures are possible, such as cross-references between a blocks.
> Currently, no such metadata is generated or used.
>
> The extendability of protocol buffer messages to support optional
> fields can be used to cheaply store optional extra data, such as
> bounding boxes, in nodes, ways and relations. There is no space
> penalty in having unused fields in files that do not contain optional
> data, except that each field requires permanently reserving a tag
> number (the '= 7'). Small tag numbers in [1,15] are more valuable than
> larger numbers because they require only one byte of header
> overhead. Tag numbers in [16,2047] require two bytes.
>
> I estimate that adding a bounding box to each way would require 20-25
> bytes per way for 50M ways or about a gigabyte when deflate
> compression is disabled.
>
> // Rejected design choices
>
> There are a few design choices I rejected. I chose not to delta encode
> node ID's, latitudes, and longitudes for nodes that contain tags.
> Although this would save around 3-5%, it makes the parser and
> serializer more complex.
>
> I considered specially encoding tags with integer-valued or
> float-valued values. They occur fairly often in a planet dump (~15%),
> but encoding them in binary saves almost nothing (.2%) and is a
> significant slowdown.
>
> // TODO's
>
> Probably the most important TODO is packaging and fixing the build system.
> I have no almost no experience with ant and am unfamiliar with java
> packaging practices, so I'd like to request help/advice on ant and
> suggestions on
> how to package the common parsing/serializing code so that it can be
> re-used across different programs.
>
> For future-compatibility reasons, I need to consider including
> file-format flags and version numbers in the file header to indicate
> when a file includes features that a reader must support. (e.g., dense
> nodes, as described above.)
>
> // Questions
>
> Ideas, suggestions, improvements, and other uses of this format?
>
> Should I checkin the protobuf.jar and generated code?
>
> Can anyone suggest how to add appropriate ant rules, or know of a
> convenient ant buildfile generator/editor that integrates well with
> eclipse.
>
> // Command lines:
>
> Some example command lines to generate files, for invocation from
> the build directory ~/source/osmosis/eclipse:
>
>
> java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar"
> org.openstreetmap.osmosis.core.Osmosis --read-xml
> file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet-all.bin
>
>
> java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar"
> org.openstreetmap.osmosis.core.Osmosis --read-xml
> file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet.bin
> omitmetadata=true
>
> # Version with ~1m precision
> java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar"
> org.openstreetmap.osmosis.core.Osmosis --read-xml
> file=/mnt/map/planet-100303.osm.gz --lp --write-bin
> file=planet-for-display.bin omitmetadata=true granularity=10000
>
>
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/dev
>
>