[OSM-talk] Adding Wikidata tags to 70k items automatically

Archer arch3r at gulli.com
Sun Aug 31 19:08:24 UTC 2014


2014-08-31 20:19 GMT+02:00 Edward Betts <edward at 4angle.com>:

> Archer <arch3r at gulli.com> wrote:
> > Please don’t understand me wrong. I’m a big fan of Wikidata but I'm
> against
> > an automated import. The mismatches list gives good examples that your
> > matching algorithm doesn't work very well:
> > http://edwardbetts.com/osm-wikidata/mismatches.html
> >
> > Some examples:
> >
> > 1. Isar Nuclear Power Plant <http://wikidata.org/wiki/Q569510>: your
> > algorithm matches only one reactor of the power plant: Isar 2
> > <http://www.openstreetmap.org/way/32918120> but the right matching
> > would be Kernkraftwerke
> > Isar <http://www.openstreetmap.org/way/23802422>
>
> Q569510 is matching Isar 2 (Way 32918120) because Isar 2 is in the list of
> German aliases in the Wikidata object:
>
> [ "KKW Isar", "AKW Isar", "Isar 2", "Kernkraftwerk Isar I", "Isar 1",
>   "Atomkraftwerk Isar" ]
>
> The German label on the Wikidata item is "Kernkraftwerke Isar", notice the
> extra 'e' on the end of the first word.
>
> I could add Levenshtein distance calculations to my matching, we could say
> if
> there is a single character difference the names match. With this change
> both
> OSM objects would match and my code would skip the wikidata item.
>
> The problem with this change is that hill and hall would match.
>
> Ok, but the Wikidata object describes the whole power plant and not only
one reactor.

I'd propose to take "is a" (WD-Property: P37) into account. For example in
Wikidata Q569510 is classified as a nuclear power plant (Q134447) the match
algorithm should find the matching OSM tags. For example for power plants
the right tag would be power=plant. Otherwise there should be no match.


> > 2. Heligoland <http://wikidata.org/wiki/Q3038>: you’ve matched the
> island
> > Heligoland <http://www.openstreetmap.org/relation/3787052> but the right
> > match would be the municipality Heligoland
> > <http://www.openstreetmap.org/relation/1157962> (for the island there
> > exists a different object in Wikidata)
>
> I can't find the Wikidata item that represents the island.
>


island: https://www.wikidata.org/wiki/Q3129772
municipality: https://www.wikidata.org/wiki/Q3038
archipelago: https://www.wikidata.org/wiki/Q17515918


> > I also don’t understand why you prefer nodes instead of ways or
> relations.
> > Ways and relations provide more information (e.g. extent of an area) than
> > nodes. The Matching algorithm should first look for relations, when
> there’s
> > no relation it should search for ways. Nodes should come last.
>
> The matching algorithm is only considering objects within 400m, so the
> nodes
> happen to be close, but the centre of the relation is more than 400m from
> the
> location in Wikidata.
>
> I've modified my matching algorithm to use much large distances for some
> types
> of object, it is running now. My hope is that when it is finished the code
> will detect the presence of the node and relation and skip the Wikidata
> item.
> Most of these node vs relation mismatches should disappear.
>

The radius for natural and administrative features should be much bigger.
For example if you want to find the island Hispaniola you'll need a radius
of  93 km. There are also big glaciers, lakes, etc.


>
> > What does your matching algorithm when a Wikidata object describes
> > different objects and therefore should be split?
> >
> > A good example for this is the Wikidata object for Thasos
> > <https://www.wikidata.org/wiki/Q204096> (currently it describes the
> island
> > and the municipality “Thasos”) but the object has to be split into two
> > Wikidata objects so that you can say “the island Thasos lies in the
> > administrative division Thasos”. There are also other examples like mixed
> > up nature reserves, lakes and administrative divisions in Wikidata which
> > you have to solve before you can import the IDs into OSM.
>
> My code doesn't do anything special with a wikidata item that represents
> multiple things like islands and municipalities. If Wikidata/Wikipedia
> claim a
> thing is an island, and in OSM there is a thing tagged with place=island
> and
> the same name they will match.
>
> OSM objects can be tagged as both an island and a municipality.

I'd propose to drop Wikidata objects which have the following property
combinations:
"is a" island and at the same time administrative division
"is a" nature reserve and administrative division
"is a" lake and administrative division
"is a" forest and administrative division
These are the combinations where I've encountered problems in Wikidata yet.

Another problem here: municipality Langeneß:
https://www.wikidata.org/wiki/Q29931 the algorithm matches the island which
is also called "Langeneß". But the island has its own WD-object:
https://www.wikidata.org/wiki/Q13747872 OSM Tags und Wikidata Propertys
(P39) should be compared and only if the attributes match there should be a
match.

Or Mawson Peak: http://www.openstreetmap.org/node/2774722248 the match of
the algorithm was Big Ben (volcanoe) https://www.wikidata.org/wiki/Q858516
but it should be Mawson Peak: https://www.wikidata.org/wiki/Q2114101
(Mawson Peak is the highest point of the volcanoe "Big Ben". It seems that
the algorithm focuses to much on aliases in Wikidata.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk/attachments/20140831/8b73f588/attachment.html>


More information about the talk mailing list