[OSM-dev] Binary OSM; the first pass encoder

Sun Nov 9 04:33:35 GMT 2008

Hi All,

Because I am getting more and more disappointed with the current state 
of affairs with respect to the downloading of OSM content some people on 
the Dutch OSM IRC channel thought of an alternative way of distribution 
that could potentionally get binary diffs after any possible download in 
the past.

I wrote the first implementation of it in the last couple of hours and 
tested it on the Dutch dataset. The current gzip compressed data is 
about 135MB. Extracted it represents 1.4GB of XML.

The binary file is completely analogue to the XML, no shortcuts what so 
ever. The first reduction to binary format containing only data reduced 
the set to 418MB and allows a bzip2 compression to 78MB.

In principle it is nothing more than:
N [long id] [float lat] [float lon] [time_t timestamp]
[uint length of userfield] [non terminated userfield]

And likewise for the other subtries.

As discussed before; it is possible to do a second pass binary encoding 
with all strings in a distinct table. Where the linked list can be 
recovered to an array can be recovered from the storage. This would make 
a significance difference for the tag keys alone.

In this case all string fields can converted to unsigned long fields for 
now 4G of distinct fields seems enough :)

If interested taking a peak is possible at;
http://repo.or.cz/w/handlerosm.git?a=tree;f=osmbinary;h=1701a9194285a56e7a91536def314fb8b2e95350;hb=96c7b81af692df89bc6c5eba999e9bb61c92323c

Stefan