[OSM-talk] [Tagging] Introducing Taginfo

Sun Oct 24 10:56:05 BST 2010

On Thu, Oct 21, 2010 at 06:07:33PM -0500, Scott Crosby wrote:
> On Wed, Oct 13, 2010 at 2:38 AM, Jochen Topf <jochen at remote.org> wrote:
> 
> > On Tue, Oct 12, 2010 at 10:58:34PM +0200, Sebastian Klein wrote:
> > > Jochen Topf wrote:
> > >> On Tue, Oct 05, 2010 at 04:46:26PM +0200, Pieren wrote:
> > >>> If I could have one request, it would be nice to see the amount of
> > different
> > >>> contributors using the same tag. This to distinguish between quantity
> > and
> > >>> popularity. I know it might be challenging since we should only count
> > the
> > >>> user of the tag creation in the element history...
> > >>
> > >> On http://taginfo.openstreetmap.de/keys there is a 'users' column. But
> > this
> > >> doesn't look at the history, only the current use. It gives you still
> > some
> > >> idea, but its not perfect. But reading the history is not an option at
> > the
> > >> moment, because this would need far more resources.
> > >>
> > >> The number of users is also taking into account when creating the tag
> > cloud
> > >> for the home page. This way some tags from imports which are very common
> > in
> > >> the database but have a small user count are downgraded. :-)
> > >>
> > >> Jochen
> > >
> > > Is it planned to have users count for the individual key pages? It can
> > > be interesting to see how popular common_key=some_exotic_value really is.
> > > Sometimes it is used frequently, but by a single mapper only.
> >
> > Users are counted for keys only and not for key=value combinations, because
> > there are just too many key=value combinations and too many users to do
> > this
> > counting efficiently. At least I haven't come up with an idea how to do
> > this.
> > maybe somebody else can.
> >
> > Currently for every key I create a hash with all users in it, that use this
> > key. When I am through all the tags, I count how many elements there are in
> > each hash and thats the number of users for this key. This is rather
> > inefficient and could probably be improved using some clever hashing for
> > the price of some inaccuracies (which don't matter too much in this case,
> > all we really want to know is roughly how many users there are).
> 
> 
> Very nice work you have and a very neat tool you wrote.
> 
> I have an idea that might help with these counts. Bloom filters. For
> instance, a few months ago, I did a test where I counted the frequency of
> different various values for each key across a whole planet in only a few
> tens of megabytes of RAM. I used a hash table for counting, but when there
> were too many distinct values for a key (>1000), I switched to a bloom
> filter of 1000000 bits and could only report the number of distinct values
> for that key. If there were >500000 distinct values, I just reported
> ">500000". Something similar might work in other areas of your design. I
> think it could solve this:
>
> > But even when this is done in a more efficient manner, we can't to that
> > for 50 million different key=value combinations. We might be able to do
> > it for the more popular combinations, after all if a key=value combination
> > only appears twice in the whole database, it doesn't really matter if that
> > was from one or two users.
> >
> 
> in about 8gb of RAM. Have a smallish bloom filter, say, 64 bits for each of
> the 50m distinct tags. Store the user that uses that tag in the bloom
> filter. You'll need 50m of them (only 400mb), plus the hash table from tag
> to bloom. Then, figure out which bloom filters have only one bit set. Those
> tag pairs have, with high probability, only been written by one user. In a
> second pass, figure out which user was responsible for causing that bit to
> be set. The hash table from tags to bloom filters will require more memory
> than the filters themselves, but you could cheat. Have an array of 512
> million 64-bit bloom filters. Use hashing to map from a tag to the index of
> the bloom filter in that  array and ignore the chances of collisions. To
> detect accidental collisions between where two tags map to the same bloom
> filter, or two users map to the same index in the bloom filter, use a keyed
> hash function and repeat the above a 3-5 times with different keys. Use the
> minimum value across the different passes.

I have been very reluctant to do anything that gives me results that are not
100% accurate. But I see that this might be too much to ask for with the kind
of ressources we have.

There are some interesting ideas here. In this case the bloom filter could give
me exactly the needed answer, namely whether there are lots of users using a
tag or only a few. I already have the hash table from tag to some structure
that I use to count several things, so I only need the bloom filter. But the
bloom filter would have to be bigger than 64 bits, wouldn't it? User IDs are
in the range from 1 to about 360000 currently. If I want to know with a
certain probability whether a tag has less than or more than n users, how
can I figure out how big the bloom filter has to be?

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298