[OSM-talk] Wikipedia/Wikidata admins cleanup
Tomas Straupis
tomasstraupis at gmail.com
Wed Jan 4 19:54:00 UTC 2017
> This all conversation confort my (un-educated, I confess) idea of the
> uselessness of cross referencing the Wikipedia ecosystem with OSM with OSM
> tags.
>
> Automated addition of wikidata id to OSM objects seems worthy, so why not
> doing it on the fly instead of writing it to the database? Next year maybe?
When you have a lot of data, it becomes difficult to verify it. That
is you encounter an "Oracle problem" (it takes too much time for
somebody with a knowledge to verify the date, or you do not have an
Oracle who can verify the data). One way of solving Oracle problem is
to have a different dataset captured INDEPENDENTLY. This way you can
compare two (or more) datasets and identify PROBABLE errors. When such
errors/incompatibilities are found, they should be checked manually
and resolved manually. If you do a dumb copy/overwrite data (maybe
converted data as is the case of wikipedia article->wikidata) you lose
the "two dataset" situation. That is you can no longer use these two
datasets to solve the Oracle problem -> to verify the data in BOTH
datasets. You simply take one of the datasets (or somebody's
assumption of how data in set A converts to set B) as "correct" and
overwrite the other dataset thus destroying the possibility to do a
genuine data validation.
So such automated addition of wikidata tags without local knowledge
does more damage than good. If all of this change is based on existing
wikipedia tags, such conversion can be done by anybody with minimal
knowledge on the fly.
And to give more practical perspective. We had been doing
OSM<->wikipedia comparison for more than two years now. That is we
take osm objects which have wikipedia tag and so we get
page:coordinates. Then we take wikipedia dump (***-latest-geo_tags)
and thus get a different dataset of page:coordinates. Then we compare
those two datasets and identify miss-matches. Then we manually check
each of those to be sure that information in BOTH datasets is correct.
We NEVER do any automated update. This is the only way to be sure data
quality is kept at high level. Try going at least through a hundred of
miss-matches and you will see how many different situations there are
and you will understand why automated update is NOT an option.
--
Tomas
More information about the talk
mailing list