[OSM-talk] 176k Wikidata tags to add to OSM
edward at 4angle.com
Mon Nov 24 11:28:00 UTC 2014
This is a progress report about my attempt to match Wikidata items and OSM
Here are some page about adding Wikidata identifiers to OSM:
The list is available here, it is split up by English Wikipedia category:
Some OSM/Wikidata items will appear in multiple categories.
Each page of results is sorted by distance, then by the English Wikidata
label. The results include links to Wikidata, the location on OSM from
Wikidata and the matched OSM object.
A quick recap about how my system works. I have a list of categories on
Wikipedia with the appropriate tags on OpenStreetMap. For example, articles in
the subcategories of the category "Airports by Country" should appear on the
map tagged as aeroway=aerodrome.
I use a Wikimedia Labs tool called CatScan to get a list of every article in
the category or subcategory: https://tools.wmflabs.org/catscan2/catscan2.php
For each article in English Wikipedia this is a matching item in Wikidata. I
use the Wikidata API to find the Wikidata items within the category. Items
without coordinates are skipped.
Once all the categories are processed I have a list of Wikidata items that include coordinates and the label in multiple languages. I split this list up by coordinates into half degree squares. I use the Overpass API to look for OSM objects (nodes, ways and relations) with a name and the expected tags.
The acceptable distance for most objects is 1km, for some entity types it has
been increased further. I've included a distance field in my results, so you
can see how far apart the matched items are.
The names in the OSM object are compared with the labels and aliases in the
Wikidata item. The code looks at the various name keys listed in the
http://wiki.openstreetmap.org/wiki/Key:name page. I exclude old_name from the
The matching code considers addr:housename and can match buildings with
Wikidata item labels that are street addresses to the addr:housenumber and
addr:street tags. For example "8 Canada Square" will match a building tagged
with "addr:housenumber=8" and "addr:street=Canada Square"
The overpass API can calculate the centroid of an OSM object, this is what I
used in the past. I've switched to using the bounding box for the object, this
gives better results for large objects like lakes and forests.
The result is that I now have a list of 176,794 OSM objects and matching
Wikidata items. The whole process of extracting the data and looking for
matches takes about three days to run. This is after quite a few changes
to speed it up. I think there are still more improvements possible. I will
post the code on github soon.
It has been suggested that I shouldn't be using Wikipedia at all, instead I
should be looking at the 'instance of' property in Wikidata. Using English
Wikipedia introduces an English-language bias, there are items in Wikidata
without an associated article in English Wikipedia. The reason for using
Wikipedia Categories is because use of the 'instance of' property is very
patchy. The majority of the items in my result list don't include the
'instance of' property. A related piece of work will be to populate this
field in Wikidata, but for now I'm focused on linking OSM and Wikidata.
The system gets confused by chains of restaurants and shops. The Wikidata item
will often include the coordinates of the headquarters. The name will match
with a nearby store. I should be able to fix this by filtering out Wikidata
chain store items.
Example: John Lewis - UK department store chain
Wikidata coordinates are 51.497, -0.144 near Victoria station.
The match is for the flag ship store in Oxford Circus, 2km from the HQ.
Some of the coordinates in Wikipedia and Wikidata are wrong, there are many
cases where the location in Wikidata is 5km or more from where it should be.
London Hackspace moved from Islington to Hackney in 2009, the location has
been updated on OSM, but Wikidata still has the old location:
There are two pubs in London called Barley Mow that are less than 1k apart,
both are mapped on OSM. One of the pubs has an item in Wikidata (Q17985738).
My code is matching it to the wrong pub. I will fix this.
When checking the results for fountains I found that the Butt-Millet Memorial Fountain is mapped twice in different locations:
There are already 25k things with a Wikidata tag in OSM. When I compare this
list with my generated list I find 2,000 cases where a given Wikidata ID is
assigned to a different OSM object from the one picked by my system. In
addition there are 26 OSM objects with a wikidata tag pointing at a different
Many of these mismatches are villages, towns and municipalities in Germany.
One possible way to solve this is if I exclude settlements in Germany from my
list. It looks like Germany doesn't need an automated import of Wikidata tags.
There are many more OSM objects tagged with a wikipedia tag. I haven't tried
comparing them to my results.
I'm going to continue to refine my results and reduce the number of false
positives. Once I'm happy with the list I'll post it here. When we have
reached consensus I'll add the Wikidata tags to OSM. I won't upload my results
as a single changeset, I'll split it up by region, maybe in one degree squares.
More information about the talk