[OSM-talk] [Tagging] Introducing Taginfo

Scott Crosby scrosby at cs.rice.edu
Fri Oct 22 00:07:33 BST 2010


On Wed, Oct 13, 2010 at 2:38 AM, Jochen Topf <jochen at remote.org> wrote:

> On Tue, Oct 12, 2010 at 10:58:34PM +0200, Sebastian Klein wrote:
> > Jochen Topf wrote:
> >> On Tue, Oct 05, 2010 at 04:46:26PM +0200, Pieren wrote:
> >>> If I could have one request, it would be nice to see the amount of
> different
> >>> contributors using the same tag. This to distinguish between quantity
> and
> >>> popularity. I know it might be challenging since we should only count
> the
> >>> user of the tag creation in the element history...
> >>
> >> On http://taginfo.openstreetmap.de/keys there is a 'users' column. But
> this
> >> doesn't look at the history, only the current use. It gives you still
> some
> >> idea, but its not perfect. But reading the history is not an option at
> the
> >> moment, because this would need far more resources.
> >>
> >> The number of users is also taking into account when creating the tag
> cloud
> >> for the home page. This way some tags from imports which are very common
> in
> >> the database but have a small user count are downgraded. :-)
> >>
> >> Jochen
> >
> > Is it planned to have users count for the individual key pages? It can
> > be interesting to see how popular common_key=some_exotic_value really is.
> > Sometimes it is used frequently, but by a single mapper only.
>
> Users are counted for keys only and not for key=value combinations, because
> there are just too many key=value combinations and too many users to do
> this
> counting efficiently. At least I haven't come up with an idea how to do
> this.
> maybe somebody else can.
>
> Currently for every key I create a hash with all users in it, that use this
> key. When I am through all the tags, I count how many elements there are in
> each hash and thats the number of users for this key. This is rather
> inefficient and could probably be improved using some clever hashing for
> the price of some inaccuracies (which don't matter too much in this case,
> all we really want to know is roughly how many users there are).


Very nice work you have and a very neat tool you wrote.

I have an idea that might help with these counts. Bloom filters. For
instance, a few months ago, I did a test where I counted the frequency of
different various values for each key across a whole planet in only a few
tens of megabytes of RAM. I used a hash table for counting, but when there
were too many distinct values for a key (>1000), I switched to a bloom
filter of 1000000 bits and could only report the number of distinct values
for that key. If there were >500000 distinct values, I just reported
">500000". Something similar might work in other areas of your design. I
think it could solve this:


> But even when this is done in a more efficient manner, we can't to that
> for 50 million different key=value combinations. We might be able to do
> it for the more popular combinations, after all if a key=value combination
> only appears twice in the whole database, it doesn't really matter if that
> was from one or two users.
>

in about 8gb of RAM. Have a smallish bloom filter, say, 64 bits for each of
the 50m distinct tags. Store the user that uses that tag in the bloom
filter. You'll need 50m of them (only 400mb), plus the hash table from tag
to bloom. Then, figure out which bloom filters have only one bit set. Those
tag pairs have, with high probability, only been written by one user. In a
second pass, figure out which user was responsible for causing that bit to
be set. The hash table from tags to bloom filters will require more memory
than the filters themselves, but you could cheat. Have an array of 512
million 64-bit bloom filters. Use hashing to map from a tag to the index of
the bloom filter in that  array and ignore the chances of collisions. To
detect accidental collisions between where two tags map to the same bloom
filter, or two users map to the same index in the bloom filter, use a keyed
hash function and repeat the above a 3-5 times with different keys. Use the
minimum value across the different passes.

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk/attachments/20101021/bf2622be/attachment.html>


More information about the talk mailing list