[OSM-dev] Tagstat History/Change over time
scrosby at cs.rice.edu
Sun Sep 5 05:30:56 BST 2010
I have some code for collecting tag statistics that may be useful. It
doesn't do as much as osmdoc/tagstat does, but it is a *lot* less
resource intensive. It runs about as fast as osmosis can parse a
planet dump and collects statistics entirely in-memory. The current
configuration uses a gigabyte or so of RAM. Presently it is sitting in
github in my old osmosis work. I haven't had a chance to repackage it
for new osmosis.
For each tag, it records:
If it has N < 8000 or so distinct values, it tracks the set of
distinct values and usecount.
For 8000 < N < 500000 distinct values, it reports an *estimate* of
the number of distinct values.
For 500000 < N distinct values, it reports ">500000" distinct values.
How it works:
Iterate over each tag. For each key, store its value into a
HashMap<String,Integer> recording the frequency. If the hashmap has
more than 1000 distinct values in it, stop storing the values. Instead
insert each value into a Bloom filter of 1000000 bits. (Idea: for each
value, do filter.setBit(value.hash() % 10000000). Then at the end,
count the number of '1' bits. If the bloom filter is more than half
full, just report >500000.)
More information about the dev