[OSM-talk] Adding Wikidata tags to 70k items automatically

Edward Betts edward at 4angle.com
Sun Aug 31 18:19:04 UTC 2014


Archer <arch3r at gulli.com> wrote:
> Please don’t understand me wrong. I’m a big fan of Wikidata but I'm against
> an automated import. The mismatches list gives good examples that your
> matching algorithm doesn't work very well:
> http://edwardbetts.com/osm-wikidata/mismatches.html
> 
> Some examples:
> 
> 1. Isar Nuclear Power Plant <http://wikidata.org/wiki/Q569510>: your
> algorithm matches only one reactor of the power plant: Isar 2
> <http://www.openstreetmap.org/way/32918120> but the right matching
> would be Kernkraftwerke
> Isar <http://www.openstreetmap.org/way/23802422>

Q569510 is matching Isar 2 (Way 32918120) because Isar 2 is in the list of
German aliases in the Wikidata object:

[ "KKW Isar", "AKW Isar", "Isar 2", "Kernkraftwerk Isar I", "Isar 1",
  "Atomkraftwerk Isar" ]

The German label on the Wikidata item is "Kernkraftwerke Isar", notice the
extra 'e' on the end of the first word.

I could add Levenshtein distance calculations to my matching, we could say if
there is a single character difference the names match. With this change both
OSM objects would match and my code would skip the wikidata item.

The problem with this change is that hill and hall would match.

> 2. Heligoland <http://wikidata.org/wiki/Q3038>: you’ve matched the island
> Heligoland <http://www.openstreetmap.org/relation/3787052> but the right
> match would be the municipality Heligoland
> <http://www.openstreetmap.org/relation/1157962> (for the island there
> exists a different object in Wikidata)

I can't find the Wikidata item that represents the island.

> 3. Puerto Rico <http://wikidata.org/wiki/Q1183>: the Wikidata objects says
> „is a unincorporated area of the United states“ – the right match therefore
> would be the administrative relation: Puerto Rico
> <http://www.openstreetmap.org/relation/306157> but your algorithm matches
> the island: Island of Puerto Rico
> <http://www.openstreetmap.org/node/357271412>

The English Wikipedia article Puerto Rico is in the 'Islands of Puerto Rico'
category, so my code considers Q1183 to represent an island. Node 357271412 is
tagged as place=island, so it is perfect match.

We could argue that the node doesn't have much purpose in OSM, the tags could
be merged into Relation 306157.

> I also don’t understand why you prefer nodes instead of ways or relations.
> Ways and relations provide more information (e.g. extent of an area) than
> nodes. The Matching algorithm should first look for relations, when there’s
> no relation it should search for ways. Nodes should come last.

The matching algorithm is only considering objects within 400m, so the nodes
happen to be close, but the centre of the relation is more than 400m from the
location in Wikidata.

I've modified my matching algorithm to use much large distances for some types
of object, it is running now. My hope is that when it is finished the code
will detect the presence of the node and relation and skip the Wikidata item.
Most of these node vs relation mismatches should disappear.

> What does your matching algorithm when a Wikidata object describes
> different objects and therefore should be split?
> 
> A good example for this is the Wikidata object for Thasos
> <https://www.wikidata.org/wiki/Q204096> (currently it describes the island
> and the municipality “Thasos”) but the object has to be split into two
> Wikidata objects so that you can say “the island Thasos lies in the
> administrative division Thasos”. There are also other examples like mixed
> up nature reserves, lakes and administrative divisions in Wikidata which
> you have to solve before you can import the IDs into OSM.

My code doesn't do anything special with a wikidata item that represents
multiple things like islands and municipalities. If Wikidata/Wikipedia claim a
thing is an island, and in OSM there is a thing tagged with place=island and
the same name they will match.

OSM objects can be tagged as both an island and a municipality.

-- 
Edward.



More information about the talk mailing list