<div dir="ltr">Hi,<div><br></div><div>As Ethan mentioned, binary classification would be a tough nut to crack because of the existence of multiple objects in  a changeset. According to the paper [2], OSMPatrol marked 44% of the edits as possible vandalism. It means there are a lot of false positives. 50% of the users whose edits were detected as vandalism had reputation more than 66%. Hence, some work can also be done on the rule for assigning reputation for users. Since, I have worked on text processing in the past, I'd like to work on detecting this type of vandalism: <b>Adding fake data and tags</b>.</div><div><br></div><div>As Jason mentioned, there would be plenty of data representing good edits. So, we know that most of the edits would be regular or correct. Hence, the distribution of good to bad edits would be skewed and there would be a lot of scope in modeling this problem as an Anomaly Detection task. As I mentioned above that some edits of users having reputation more than 66% are also marked as potential vandalism, this is against the well defined notion of

normal behavior of the highly reputed user. Hence, this is an anomaly in the behaviour of the user. </div><div><br></div><div>I plan to use <b>One Class SVM</b> (kernel can be decided using cross validation) and <b>Isolation Forest</b> to detect these kind of possible outliers. The anomaly detection systems implemented in the past usually detect outliers with the dataset as a whole and not focusing on a particular stakeholder (editor in this case). Hence, I would like to explore if implementing such a system would increase Precision and Recall simultaneously.</div><div><br></div><div>Thanks,</div><div>Animesh Sinha</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 20, 2016 at 9:27 AM, Jason Remillard <span dir="ltr"><<a href="mailto:remillard.jason@gmail.com" target="_blank">remillard.jason@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

There is obviously plenty of data that represents "good" changes. The<br>

data working group reversions could be used to train a classifier on<br>

what a bad edit looks like. After that, looking for change sets that<br>

are logically erased after a short period of time (say 2 weeks), might<br>

also yield some bad change set.<br>

<span class="HOEnZb"><font color="#888888"><br>

Jason<br>

</font></span><span class="im HOEnZb"><br>

On Sun, Dec 18, 2016 at 6:38 PM, Animesh Sinha<br>

<<a href="mailto:sinha.animesh34@gmail.com">sinha.animesh34@gmail.com</a>> wrote:<br>

</span><div class="HOEnZb"><div class="h5">> Hi,<br>

><br>

> I am a first year masters students at Purdue University and would like to<br>

> propose a project idea for GSoC 2017. I have worked on Vandalism Detection<br>

> in Wikipedia in the past and understand how important it is to predict if an<br>

> information is correct or not as it may be misleading to others.<br>

><br>

> Hence, I would like to propose this project idea:<br>

><br>

> Title: Detect if a user edit made in OSM is a vandal edit or regular.<br>

> Summary: It's a very challenging task to monitor the malicious edits or<br>

> spams manually for a large active user base. I plan to identify the cases of<br>

> vandalism on OSM by classifying edits as either regular or vandal. This is<br>

> clearly a Binary Classification task, but if the distribution of regular and<br>

> vandalism cases in the dataset are skewed, it can also be explored as an<br>

> Anomaly Detection problem.<br>

> Requirements: Lots of data about the edits made, information about the users<br>

> making the edit, information about the people annotating the true labels,<br>

> etc.<br>

><br>

> I would appreciate if someone can provide a feedback on the project idea and<br>

> the requirements needed.<br>

><br>

> Thanks,<br>

> Animesh Sinha<br>

><br>

</div></div><div class="HOEnZb"><div class="h5">> ______________________________<wbr>_________________<br>

> dev mailing list<br>

> <a href="mailto:dev@openstreetmap.org">dev@openstreetmap.org</a><br>

> <a href="https://lists.openstreetmap.org/listinfo/dev" rel="noreferrer" target="_blank">https://lists.openstreetmap.<wbr>org/listinfo/dev</a><br>

><br>

</div></div></blockquote></div><br></div>