[OSM-dev] OSM Project Idea for GSoC 17: Vandalism Detection in Map Edits

Wed Dec 21 18:13:31 UTC 2016

Hi,

As Ethan mentioned, binary classification would be a tough nut to crack
because of the existence of multiple objects in  a changeset. According to
the paper [2], OSMPatrol marked 44% of the edits as possible vandalism. It
means there are a lot of false positives. 50% of the users whose edits were
detected as vandalism had reputation more than 66%. Hence, some work can
also be done on the rule for assigning reputation for users. Since, I have
worked on text processing in the past, I'd like to work on detecting this
type of vandalism: *Adding fake data and tags*.

As Jason mentioned, there would be plenty of data representing good edits.
So, we know that most of the edits would be regular or correct. Hence, the
distribution of good to bad edits would be skewed and there would be a lot
of scope in modeling this problem as an Anomaly Detection task. As I
mentioned above that some edits of users having reputation more than 66%
are also marked as potential vandalism, this is against the well defined
notion of normal behavior of the highly reputed user. Hence, this is an
anomaly in the behaviour of the user.

I plan to use *One Class SVM* (kernel can be decided using cross
validation) and *Isolation Forest* to detect these kind of possible
outliers. The anomaly detection systems implemented in the past usually
detect outliers with the dataset as a whole and not focusing on a
particular stakeholder (editor in this case). Hence, I would like to
explore if implementing such a system would increase Precision and Recall
simultaneously.

Thanks,
Animesh Sinha

On Tue, Dec 20, 2016 at 9:27 AM, Jason Remillard <remillard.jason at gmail.com>
wrote:

> Hi,
>
> There is obviously plenty of data that represents "good" changes. The
> data working group reversions could be used to train a classifier on
> what a bad edit looks like. After that, looking for change sets that
> are logically erased after a short period of time (say 2 weeks), might
> also yield some bad change set.
>
> Jason
>
> On Sun, Dec 18, 2016 at 6:38 PM, Animesh Sinha
> <sinha.animesh34 at gmail.com> wrote:
> > Hi,
> >
> > I am a first year masters students at Purdue University and would like to
> > propose a project idea for GSoC 2017. I have worked on Vandalism
> Detection
> > in Wikipedia in the past and understand how important it is to predict
> if an
> > information is correct or not as it may be misleading to others.
> >
> > Hence, I would like to propose this project idea:
> >
> > Title: Detect if a user edit made in OSM is a vandal edit or regular.
> > Summary: It's a very challenging task to monitor the malicious edits or
> > spams manually for a large active user base. I plan to identify the
> cases of
> > vandalism on OSM by classifying edits as either regular or vandal. This
> is
> > clearly a Binary Classification task, but if the distribution of regular
> and
> > vandalism cases in the dataset are skewed, it can also be explored as an
> > Anomaly Detection problem.
> > Requirements: Lots of data about the edits made, information about the
> users
> > making the edit, information about the people annotating the true labels,
> > etc.
> >
> > I would appreciate if someone can provide a feedback on the project idea
> and
> > the requirements needed.
> >
> > Thanks,
> > Animesh Sinha
> >
> > _______________________________________________
> > dev mailing list
> > dev at openstreetmap.org
> > https://lists.openstreetmap.org/listinfo/dev
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20161221/558ff0a4/attachment.html>