[OSM-dev] Help needed for processing the planet.osm (for osmdoc.com)

Mon Aug 17 14:37:09 BST 2009

Hi,

I'm in the process of updating osmdoc.com after several people have
reminded me that the data is a bit....old :)
The following explanations are rather technical and Java centric but
the underlying problem should be language independent.

Previously I used several Map/Reduce jobs running on Hadoop. That took
two days and multiple steps and is very complicated. I decided to
rewrite this import process. My goals are that it could be a "fire and
forget" thing. I just want to start the process and get the results.
The second goal ist that it should finish in under a week :)

I'm currently parsing the planet.osm file using StAX and I'm building
several Maps in memory. One is linking a key (e.g. "amenity") to an
array of integers representing the current counts for changesets,
nodes, relations, ways and distinct values. Another one links values
(e.g. "pub") to an array of integers and so on. With several hundreds
of millions of tags these maps grow to large for my RAM very fast. I'm
looking for ideas on how to solve this problem and how to process this
data in a performant way.

Things I've tried already:
- I've used EHcache with eviction of elements to a backing DiskStore.
While this works flawlessly EHcache keeps an index of the DiskStore in
memory and eventually this index becomes to large for my memory, too
:) There is no option to disable this feature (I know that this is
done for performance reasons but that is not as important in this
case). From the documentation I gathered that OScache does this the
same way so I didn't bother trying.

- At the moment I'm testing JBoss Cache and with "Cache Passivation"
JBoss Cache seems to have the feature I need. Unfortunately there are
problems here as well that have to do with the way elements are
evicted (i.e. passivated) and loaded (i.e. activated) from the backing
store using threads. Another possiblilty is that I'm just not doing it
right ;-). So if there is a JBoss Cache expert in the room please
stand up! JBoss Cache seems like overkill for this job.

- The very first thing I tried was just writing everything directly to
the database as soon as something doesn't fit into memory any more but
the performance was just horrible (using PostgreSQL) and it would have
taken days to long to process a single planet.osm.

So what I'm looking for is a simple (I really don't want to have to
setup another Hadoop cluster) but still reasonably performant way to
process the planet.osm to aggregate the needed data in a suitable
format for importing it into a relational database. Any ideas or input
are welcome and if anyone wants the source code for what I've done so
far please email me directly. At the moment I'm stuck and lacking any
ideas :(

Cheers,
Lars