[Talk-us] Finding one-trick ponies

Frederik Ramm frederik at remote.org
Thu Jul 6 22:23:26 UTC 2017


Hi,

in case someone wants to run their own analyses on what might constitute
spam edits, here's a couple steps I did to come up with the numbers
posted over in the "SEO Damage" thread. Some Perl required!

1. Download the latest changeset dump (changesets-latest.osm.bz2)

2. Here's a small Perl script that counts how many changesets per user
and lists those that are the respective user's only work:

--- cut ---

while(<>)
{
    if (/<changeset id="(\d+)".*user="([^"]*)".*num_changes=\"(\d+)"/)
    {
        if (defined($user))
        {
            push(@{$changesets->{$user}}, [ $id, $changes, $comment,
$editor ]);
            $num->{$user}++;
        }
        ($id, $user, $changes) = ($1, $2, $3);
        undef $comment;
        undef $editor;
    }
    elsif (/<tag k="comment" v="(.*)"/)
    {
        $comment = $1;
    }
    elsif (/<tag k="created_by" v="(.*)"/)
    {
        $editor = $1;
    }
}

foreach my $user(keys %$num)
{
    # you could change the below to "next if ($num->{$user}>2)"
    # if you wanted to list those that have one or two changesets etc.
    next unless($num->{$user}==1);
    # this grabs the user's first changeset which in my configuration
    # is also the only changeset, you might need a loop here if you
    # want to output multiple
    $cs = $changesets->{$user}->[0];
    # the below quits if the changeset has more than one edit
    next unless ($cs->[1]==1);
    # output user name, changeset id, comment, and editor
    printf '"%s",%d,"%s", "%s"%s', $user, $cs->[0], $cs->[2], $cs->[3],
"\n";
}

--- cut ---

Run this with

bzcat changesets-latest.osm.bz | perl myscript.pl > changesets.csv

This is what gave me the initial list of 140k changesets.

3, Now if you want to continue and download the contents of each
changeset so identified, run the csv through this other script

--- cut ---
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

while(<>)
{
    chomp;
    my ($user, $cs, $comment, $editor) = split(/,/);
    # line below ignores all with a short comment - this is where one
    # could also filter for other kinds of characteristic comments
    next unless (length($comment) > 50);
    my $r =
$ua->get("http://api.openstreetmap.org/api/0.6/changeset/$cs/download");
    if ($r->is_success)
    {
        print;
        foreach (split(/\n/, $r->content()))
        {
            if (/<(node|way|relation).* id="(\d+)"/)
            {
                print ",\"$1\",$2";
                if (/ version="([^"]+)"/)
                {
                    print ",$1";
                }
                else
                {
                    print ",";
                }
                if (/ lat="([^"]+)"/)
                {
                    print ",$1";
                }
                else
                {
                    print ",";
                }
                if (/ lon="([^"]+)"/)
                {
                    print ",$1";
                }
                else
                {
                    print ",";
                }
            }
            elsif(/<tag k="([^"]+)" v="([^"]+)"/)
            {
                $k=$1;
                $v=$2;
                $k =~ y/"/'/;
                $v =~ y/"/'/;
                print ",\"$1=$2\"";
            }
        }
        print "\n";
    }
}
--- cut ---

like so

cat changesets.csv | perl otherscript.pl > changsets-with-edit.csv

The script tries to download the object from the changeset and outputs a
CSV with the important properties. (It's not really geared towards
changesets with more than one edit though.)

After this I had the ~ 12k changes left, and used "grep" to concentrate
on those that had a website, note, or description tag, leaving me with
~3500.

Then if you need to augment that by downloading the *latest* version of
each object and see if it is still the same as before, you could pipe
that CSV through a script like this

--- cut ---
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

while(<>)
{
    chomp;
    /^(.*),"(node|way|relation)",(\d+),(\d+),(.*)/ or next;
    my ($a, $b, $c, $d, $e)=($1, $2, $3, $4, $5);
    my $r = $ua->get("http://api.openstreetmap.org/api/0.6/$b/$c");
    undef $version;
    undef $t;
    if ($r->is_success)
    {
        foreach (split(/\n/, $r->content()))
        {
            if (/<(node|way|relation).* id="(\d+)"/)
            {
                if (/ version="([^"]+)"/)
                {
                    $version=$1;
                }
            }
            elsif(/<tag k="([^"]+)" v="([^"]+)"/)
            {
                $k=$1;
                $v=$2;
                $k =~ y/"/'/;
                $v =~ y/"/'/;
                $t->{$k}=$v;
            }
        }
    }
    print "$a,\"$b\",$c,$d,$version";
    $same++ if ($d == $version);
    foreach (split(/,/, $e))
    {
        if (/^"(.*)=(.*)"$/ && ($2 ne $t->{$1}))
        {
            printf ',"%s=%s->%s"', $1, $2, $t->{$1};
        }
        else
        {
            print ",$_";
        }
    }
    print "\n";
}

--- cut ---

And you'll end up with something like my "one trick ponies" CSV posted
in the other thread.

This is all super hacky of course, suffers from lack of proper escaping
and XML parsing, and could all be done properly in a more modern
language. But if this encourages one or two people to play a bit then
maybe it was already useful.

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00'09" E008°23'33"



More information about the Talk-us mailing list