[OSM-dev] Help needed for processing the planet.osm (forosmdoc.com)

Mon Aug 24 13:22:41 BST 2009

Hi Lars,

While it's not solving exactly the same problem as you, the mkgmap splitter 
utility is faced with similar challenges. It is written in Java and uses 
various techniques to reduce the amount of memory required while processing 
the planet osm. I've spent quite a bit of time profiling and tuning it, so 
hopefully there are some ideas (or code) in there that can help you out. 
For example there are some custom collection-like classes for efficiently 
holding primitives, bit-level storage of data, and conditional use of different 
data structures depending on whether a common case or a uncommon case is 
encountered. Quite a bit of effort has also been put in to avoiding unnecessary 
object construction. Additionally, I checked in an update yesterday that 
creates a disk cache after parsing the planet file for the first time. After 
that it reads from this cache rather than making multiple passes over the 
planet XML file.

My suggestion is that you try doing something similar; make one pass over 
the XML that writes out the data to a custom binary format. Then you'll be 
able to make multiple passes over the data much more quickly, processing 
a subset of the data each time. You can choose an appropriate sized subset 
of the data depending on how much you want to trade off speed vs performance 
(that's exactly what the --max-areas parameter does with the splitter).

You can grab the splitter from here if you want to take a look:

http://www.mkgmap.org.uk/page/tile-splitter

I've also worked on other similar problems at my job where I've used in-memory 
compression of data to greatly reduce the RAM required. This approach depends 
a lot on being able to find a good way to exploit any redundancy in the particular 
data you're working with.

I'm happy to discuss this further with you offline if you like.

Chris