[OSM-talk] Student feedback request : Weighted product model for erroneous contributions in OSM
Sam Grace
samdgrace at yahoo.com
Thu Mar 3 12:27:05 UTC 2016
Hello, as per advice posted here :
https://help.openstreetmap.org/questions/47710/what-is-the-correct-avenue-for-contacting-the-osm-community-with-a-research-related-question
I am using the mailing list to seek feedback from the OSM community on the design of a weighting system that aims to grade erroneous contributions in OSM. Error candidates are selected from the full
history file if an object's geometry is modified and rolled back to a former state. The research focusses on rollbacks of way geometry. I've used Peter Korner's 'OSM history importer', which calculates linestrings for ways and 'minor' way versions when way node members are moved. This allowed me to detect erroneous displacement of ways as well as erroneous node membership edits.
Erroneously created ways are identified by association - If the repairer of an erroneously modified way also deleted a created way belonging to the same changeset this is added to the results.
I have attempted to design a weighted product model to represent their impact.
Below is a list of identified characteristics of the erroneous contributions that could attribute to
the relative loss of quality in the OSM dataset. The starting score is assigned according to the what type of object has been vandalled, 7 other factors moderate the given score up or down.
***********************************************************************************************************
Object type & Starting score
Categories are based on the most commonly occurring tags identified in a set of erroneous contributions on a sample extract of the full history file that best describe a way object. All objects that do not contain one of these tags are classed as 'other'.
Start Score
1)building : yes 0.25
2)waterway='stream',' highway='other' AND all objects that don't fall under predefined category 0.5
3)highway= 'path','bridleway','footway','high','track','cycleway','river','canal' 2
4)building : 'other' 3
5)boundary ='protected_area',' national_park','unclassified' ,'highway =residential' 4
6)highway = 'tertiary','secondary','secondary_link','tertiary_link' 5
7)highway = 'motorway','trunc','primary','secondary','motorway_link','trunc_link','primary_link',
boundary = 'administrative 7
******************************************************************
moderate 1 – relative length of way
Each record in the sample dataset, including the error candidates are graded into pentile buckets according to their length relative to other records classed in the same category. The starting scores are multiplied by factors according the reverted way relative length. As length is not applicable to buildings they are all multiplied by 0.5 regardless of their length.
multiplication factor
1)object length in top 20 percentile 1 2)object length in top 40 percentile 0.8
3)object length in top 60 percentile 0.6
4)object length in top 80 percentile 0.4 0.4
5)object length in bottom 20 percentile 0.2 0.2
******************************************************************
moderate 2 – percentage of vandalized nodes in way
The starting score is further moderated according to the ratio of the way's vandalised nodes.
multiplication factor
1) >= 80% (nodes vandalized in each way object) 1
2) 50 - 80% 0.75
3) 25 - 50% 0.5
4) >25 % 0.25
******************************************************************
moderate 3 - tags 1
Are the vandalized ways heavily tagged ?
multiplication factor
1)above average tags for object category 1.2
2)average or below 1
******************************************************************
moderate 4 – tags 2
have the tags been erroneously edited as well as the geometry ?
multiplication factor
1)yes 1.2
2)no 1
******************************************************************
moderate 4 - amendment type
6 different kinds of erroneous contributions were identified. The weighting system aims to reflect their relative severity. Minor modifications are scored down, deletions and big changes scored up.
multiplication factor
1)node order reverse, node added or removed while positional geometry unaffected,
erroneously created objects 0.25 2)node added or removed and positional geometry affected 0.5
3)nodes in way erroneously displaced by 100 metre or less 1
4)way object deleted or node displaced by more than 100 metre 2
******************************************************************
moderate 6 - fix rate detection
erroneous contributions that are picked up quickly are scored down, those that remain undetected are scored up.
multiplication factor
1) < 3 days 0.25
2) 3 days – 1 week 0.75
3) 1 week - month 1
4) 1 month – 6 months 1.5
5) > 6 months 2
******************************************************************
moderate 7 - fix rate absolute
Sometimes erroneous contributions are reverted back to the state which is older than the immediate previous version which undoes mappers seemingly good work. Such cases should be scored up.
multiplication factor
1)fix rate detection = fix rate absolute 1
2)fix rate absolute – fix rate detection > 1 month 1.2
3)fix rate absolute – fix rate detection > 6 month 1.5
***********************************************************************************************************
In testing this system, some candidates obtain a very high score and some virtually nil, but that is the model was I hoping to obtain. Consider a scenario where a unnecessary node is added to a tiny section of footpath which is fixed immediately. Compare that to a country border that gets erroneously deleted and remains undetected for several weeks.
Below is an excerpt of scores, this comes from a table of 321 candidates. I've
selected a few of the highest scoring, a few of the lowest scoring and few from the middle. Way id and version asre in included if you would like to look these up via the openstreetmap url. Note - “displaced way” error category is based on rolled back nodes in the way, so the rollback will not be seen in the way version.
***********************************************************************************************************
Error Candidate example 1
way id : 224884227
version : 5
Way length : 54 metres
Length weight : 50%
ratio of vandalized nodes - 25%
Detection rate (time interval) : 02:36:02 0.25
Amendment type : node change geom uneffect
Way category : "building"=>"other"
Weighted score : 0.007
******************************************************************
Error Candidate Example 2
way id : 275421474
Version : 1
way_length : 1
length weight : 50%
ratio of vandalized nodes : 100%
detection rate : 1 day 00:36:56
amendment type : created way
way category : untagged
Overall weighted score : 0.0078
******************************************************************
Error candidate Example 3
way id : 56084404
Version : 3
way_length : 26
length weight : 20%
ratio of vandalized nodes : 100%
detection rate : 52 days 10:04:21
amendment type : order reverse
way category : "highway"=>"path"
Overall weighted score : 0.4050
******************************************************************
Error candidate Example 4
way id : 124188191
Version : 4
way_length : 127
length weight : 40%
ratio of vandalized nodes : 100%
detection rate : 4 days 03:25:27
amendment type : node change geom effect
way category : "highway"=>"residential"
Overall weighted score : 1.0800
******************************************************************
Error candidate Example 5
way id : 56084398
Version : 4
way_length : 702
length weight : 80%
ratio of vandalized nodes : 75%
detection rate : 52 days 10:04:20
amendment type : node change geom effect
way category : "highway"=>"path"
Overall weighted score : 2.0250
******************************************************************
Error candidate Example 6
way id : 220903484
Version : 1
way_length : 833
length weight : 80%
ratio of vandalized nodes : 100%
detection rate : 52 days 05:27:41
amendment type : displaced way
way category : "highway"=>"path"
Overall weighted score : 3.6000
******************************************************************
Error candidate Example 7
way id : 258494217
Version : 4
way_length : 2974
length weight : 100%
ratio of vandalized nodes : 100%
detection rate : 52 days 05:27:26
amendment type : way deleted
way category : "highway"=>"path"
Overall weighted score : 10.8000
******************************************************************
Error candidate Example 8
way id : 170015088
Version : 3
way_length : 4577
length weight : 100%
ratio of vandalized nodes : 100%
detection rate : 7 days 17:18:53
amendment type : way deleted
way category : "highway"=>"tertiary"
Overall weighted score : 14.4000
***********************************************************************************************************
Of course this is all highly subjective, thus for anyone interested I welcome any comments.
Sam Grace
More information about the talk
mailing list