[OSM-talk] [Tagging] Introducing Taginfo

Sun Oct 24 11:58:49 BST 2010

Hi,
I'm looking to match about 700 features, so 2,000 should be enough to
account for the variable tagging.
I guess it could be run again, for 5,000 primary key/value pairs to
get a better sample of the planets feature tagging system.

cheers,
sam

On 10/24/10, Jochen Topf <jochen at remote.org> wrote:
> On Thu, Oct 21, 2010 at 06:07:33PM -0500, Scott Crosby wrote:
>> On Wed, Oct 13, 2010 at 2:38 AM, Jochen Topf <jochen at remote.org> wrote:
>>
>> > On Tue, Oct 12, 2010 at 10:58:34PM +0200, Sebastian Klein wrote:
>> > > Jochen Topf wrote:
>> > >> On Tue, Oct 05, 2010 at 04:46:26PM +0200, Pieren wrote:
>> > >>> If I could have one request, it would be nice to see the amount of
>> > different
>> > >>> contributors using the same tag. This to distinguish between
>> > >>> quantity
>> > and
>> > >>> popularity. I know it might be challenging since we should only
>> > >>> count
>> > the
>> > >>> user of the tag creation in the element history...
>> > >>
>> > >> On http://taginfo.openstreetmap.de/keys there is a 'users' column.
>> > >> But
>> > this
>> > >> doesn't look at the history, only the current use. It gives you still
>> > some
>> > >> idea, but its not perfect. But reading the history is not an option
>> > >> at
>> > the
>> > >> moment, because this would need far more resources.
>> > >>
>> > >> The number of users is also taking into account when creating the tag
>> > cloud
>> > >> for the home page. This way some tags from imports which are very
>> > >> common
>> > in
>> > >> the database but have a small user count are downgraded. :-)
>> > >>
>> > >> Jochen
>> > >
>> > > Is it planned to have users count for the individual key pages? It can
>> > > be interesting to see how popular common_key=some_exotic_value really
>> > > is.
>> > > Sometimes it is used frequently, but by a single mapper only.
>> >
>> > Users are counted for keys only and not for key=value combinations,
>> > because
>> > there are just too many key=value combinations and too many users to do
>> > this
>> > counting efficiently. At least I haven't come up with an idea how to do
>> > this.
>> > maybe somebody else can.
>> >
>> > Currently for every key I create a hash with all users in it, that use
>> > this
>> > key. When I am through all the tags, I count how many elements there are
>> > in
>> > each hash and thats the number of users for this key. This is rather
>> > inefficient and could probably be improved using some clever hashing for
>> > the price of some inaccuracies (which don't matter too much in this
>> > case,
>> > all we really want to know is roughly how many users there are).
>>
>>
>> Very nice work you have and a very neat tool you wrote.
>>
>> I have an idea that might help with these counts. Bloom filters. For
>> instance, a few months ago, I did a test where I counted the frequency of
>> different various values for each key across a whole planet in only a few
>> tens of megabytes of RAM. I used a hash table for counting, but when there
>> were too many distinct values for a key (>1000), I switched to a bloom
>> filter of 1000000 bits and could only report the number of distinct values
>> for that key. If there were >500000 distinct values, I just reported
>> ">500000". Something similar might work in other areas of your design. I
>> think it could solve this:
>>
>> > But even when this is done in a more efficient manner, we can't to that
>> > for 50 million different key=value combinations. We might be able to do
>> > it for the more popular combinations, after all if a key=value
>> > combination
>> > only appears twice in the whole database, it doesn't really matter if
>> > that
>> > was from one or two users.
>> >
>>
>> in about 8gb of RAM. Have a smallish bloom filter, say, 64 bits for each
>> of
>> the 50m distinct tags. Store the user that uses that tag in the bloom
>> filter. You'll need 50m of them (only 400mb), plus the hash table from tag
>> to bloom. Then, figure out which bloom filters have only one bit set.
>> Those
>> tag pairs have, with high probability, only been written by one user. In a
>> second pass, figure out which user was responsible for causing that bit to
>> be set. The hash table from tags to bloom filters will require more memory
>> than the filters themselves, but you could cheat. Have an array of 512
>> million 64-bit bloom filters. Use hashing to map from a tag to the index
>> of
>> the bloom filter in that  array and ignore the chances of collisions. To
>> detect accidental collisions between where two tags map to the same bloom
>> filter, or two users map to the same index in the bloom filter, use a
>> keyed
>> hash function and repeat the above a 3-5 times with different keys. Use
>> the
>> minimum value across the different passes.
>
> I have been very reluctant to do anything that gives me results that are not
> 100% accurate. But I see that this might be too much to ask for with the
> kind
> of ressources we have.
>
> There are some interesting ideas here. In this case the bloom filter could
> give
> me exactly the needed answer, namely whether there are lots of users using a
> tag or only a few. I already have the hash table from tag to some structure
> that I use to count several things, so I only need the bloom filter. But the
> bloom filter would have to be bigger than 64 bits, wouldn't it? User IDs are
> in the range from 1 to about 360000 currently. If I want to know with a
> certain probability whether a tag has less than or more than n users, how
> can I figure out how big the bloom filter has to be?
>
> Jochen
> --
> Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/
> +49-721-388298
>
>
> _______________________________________________
> talk mailing list
> talk at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/talk
>

-- 
Twitter: @Acrosscanada
Blogs: http://acrosscanadatrails.posterous.com/
http://Acrosscanadatrails.blogspot.com
Facebook: http://www.facebook.com/sam.vekemans
Skype: samvekemans
IRC: irc://irc.oftc.net #osm-ca Canadian OSM channel (an open chat room)
@Acrosscanadatrails