[OSM-dev] Help needed for processing the planet.osm (for osmdoc.com)

Tue Aug 18 12:35:48 BST 2009

On Tue, Aug 18, 2009 at 09:54, Jochen Topf<jochen at remote.org> wrote:
>> I'll need output in the following form:
>> tag-key, number of changesets, nodes, relations and ways this key is
>> used on, number of distinct values
>> tag-value, the tag-key this value belongs to, number of changesets,
>> nodes, relations and ways this value is used on
>>
>> Additionally the following information would be nice:
>> key/key combinations and how often these two keys are used in together
>> on changesets, nodes, relations and ways
>
> I think the bad news is that this kind of job really needs to be done in RAM.
> Using the disk you'll just be paging blocks in and out all the time.

Yep...

> I have a Perl job doing almost exactly what you want to do. Its reads CSV
> files (which have been generated in an earlier step from the planet XML)
> and does all the counting and spits out CSV again. I don't know how
> much memory it needs at the moment (should probably check that :-), but
> it fits in the 16GB the machine has. It counts the number of nodes, ways
> and relations having the same tag key, key-value combo and key-key combo.
> It takes about three hours on my machine.

16 GB would be _really_ nice :)
Unfortunately I only have 2 to 3 GB and that's why I run into these
swapping problems
But it seems as if you generate exactly the data I'd need (except the
changeset tags but I can generate those easily). Mind to share? :)

> You could probably save RAM by using some kind of tighter data structure.
> Depending on the programming language you might be wasting huge amounts
> of memory. But not everybody wants to write in C.

C is not an option for me. I profiled my applications and I am quite
happy with the memory footprint of a single entry in my cache.

> If you absolutely can't get by with the RAM, you might need some kind of
> last-recently-used cache where you can keep the counters for the tags used
> most often in RAM and put the others on disk. Maybe use the results from
> the previous run to optimize this.

That's what I tried (EHcache and JBoss Cache both use/are able to use
a LRU eviction policy) but those didn't work for various reasons. I
really don't want to implement this myself just for this :) I'm
looking for anyone who has already (and successfully) done this
(preferably in Java) because I ran into problems.

> You might also get a performance
> increase if you special case certain tags. For instance you know that
> it doesn't really make much sense to count how often the different values
> for the 'name' tag appear. There are other keys like this, like strange
> id tags from imports.

While you are correct, that these numbers don't make _a lot_ of sense
for some special tags (e.g. 'name') I don't want to special case any
tags. That was one of the motivations to start osmdoc.com. I wanted to
look at the name and created_by tags and so on. So, special casing
certain keys or values is not an option for me.

Thanks a lot for your input. While I still have no solution at least I
now know that I'm not just doing it wrong :)

Lars