[OSM-talk] Adding wikidata tags to the remaining objects with only wikipedia tag

Andy Townsend ajt1047 at gmail.com
Thu Sep 28 00:54:16 UTC 2017

On 26/09/2017 18:08, Yuri Astrakhan wrote:
>       When data consumers want to get a link to corresponding wikipedia
>     article, doing that with wikipedia[:xx] tags is straightforward. Doing
>     the same with wikidata requires additional pointless and time
>     consuming abrakadabra.
> no, you clearly haven't worked with any data consumers recently. Data 
> consumers want Wikidata, much more than wikipedia tags - please talk 
> to them.

That would be me in a former job, I think.

One of the things that I used to spend a lot of time doing was finding 
ways to encode data so that knowledge could be shared by e.g. field 
engineers, and then analysing those results so that you can find out 
what was related to what, what caused what, and how much store you can 
set by a particular result or prediction.  There are a couple of points 
worth sharing from that experience:

1) The first point to make about human-contributed data is that it's 
variable.  Some people will say something is probably an X, some people 
probably a Y.  The reality is that they're actually both right some of 
the time.  You might think (in the context of e.g. shop brands) "hang on 
- surely a shop can be only one brand?  It must be _either_ X or Y!" but 
you'd be wrong.  There are _always_ exceptions, and there will always be 
"errors" - you just don't know which way is right and which wrong.

2) The second point that's relevant here is that codes such as CODE1, 
CODE2 etc. are to be avoided at all costs since they don't enable any 
natural visualisation of what's been captured.  You have already said 
"but surely every system that displays data can look up the description" 
but anyone familar with the real world knows that that simply won't 
happen.  This means that there's no way for an ordinary mapper to verify 
whether the magic code on an OSM item is correct or not.  Verifiability 
is one of the key concepts of OSM (see 
https://wiki.openstreetmap.org/wiki/Verifiability et al) and anything 
that moves away from it means that data isn't going to be maintained, 
because people simply won't understand what it means.  I suspect that a 
key part of the success of OSM was the reliance on natural 
language-based keys and values, and a loose tagging scheme that allowed 
easy expansion.

3) The third point is that a database that has been "cleaned" so that 
there are no "errors" in it is worth far less than one that hasn't, when 
you're trying to understand the complex relationships between objects.  
This goes against most normal data processing instincts because 
obviously normally you'd try and ensure that data has full referential 
integrity - but where there are edge cases (and as per (1) above there 
are always edge cases) different consumers will very likely want to 
treat those edge cases differently, which they can't do if someone has 
"helpfully" merged all the edge cases into more popular categories.

To be blunt, if I was trying to process OSM data and had a need to get 
into the wikidata/wikipedia world based on it (for example because I 
wanted the municipal coat of arms - something not in OSM) I'd take a 
wikipedia link over a wikidata one every time because all mappers will 
have been able to see the text of the wikipedia link rather than just 
something like Q123456.  You've made the point that things change in 
wikipedia regularly (articles get renamed etc.), but it's important to 
remember that things change in the real world all the time as well - and 
a link that's suddenly pointing at something different in wikipedia is 
immediately apparent, in the same way that if Q123456 was no longer 
relevant (because the real world thing has changed) it wouldn't be.

All that said, I don't see wikidata as a key component (or even a very 
useful component) of OSM - but we all map things that are of interest to 
us - some people map in great detail the style of British telephone 
boxes or the "Royal Cipher" on postboxes which I see absolutely no point 
in, but if it's verifiable, why not - I'm sure I'm mapping stuff that is 
irrelevent to them.  A problem with wikidata (as noted above) is that 
I'm not sure that it _is_ verifiable data - I suspect it'll get stale 
after adding and never be maintained, simply because people will never 
notice that it's wrong.

(and on an unrelated comment in the same message)

> Sure, it can be via dump parsing, but it is a much more complicated 
> than querying.  Would you rather use Overpass turbo to do a quick 
> search for some weird thing that you noticed, or download and parse 
> the dump?  Most people would rather do the former.

It depends - if you want to do a "quick search for something" then an 
equivalent to overpass turbo might be an option, but in the real world 
what you'd _actually_ want to do is a local database query. 
Unfortunately that side of things seems to be completely missing (or at 
least very well-hidden) - wikidata seems to be quite immature in that 
respect.   Where's the "switch2osm" for wikidata?  Where's the 
"osm2pgsql" or "osmosis"?  Sure I can download 20Gb of gzipped JSON from 
https://dumps.wikimedia.org/wikidatawiki/entities/20170925/ and try and 
write some sort of parser based on 
https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON , but this seems 
very much like going back to banging the rocks together (and no, a 
third-party query interface that depends on an external network 
connection such as https://query.wikidata.org/ or anything else isn't a 
better option).


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk/attachments/20170928/44e16894/attachment-0001.html>

More information about the talk mailing list