[OSM-talk] Adding wikidata tags to the remaining objects with only wikipedia tag
ajt1047 at gmail.com
Thu Sep 28 00:54:16 UTC 2017
On 26/09/2017 18:08, Yuri Astrakhan wrote:
> When data consumers want to get a link to corresponding wikipedia
> article, doing that with wikipedia[:xx] tags is straightforward. Doing
> the same with wikidata requires additional pointless and time
> consuming abrakadabra.
> no, you clearly haven't worked with any data consumers recently. Data
> consumers want Wikidata, much more than wikipedia tags - please talk
> to them.
That would be me in a former job, I think.
One of the things that I used to spend a lot of time doing was finding
ways to encode data so that knowledge could be shared by e.g. field
engineers, and then analysing those results so that you can find out
what was related to what, what caused what, and how much store you can
set by a particular result or prediction. There are a couple of points
worth sharing from that experience:
1) The first point to make about human-contributed data is that it's
variable. Some people will say something is probably an X, some people
probably a Y. The reality is that they're actually both right some of
the time. You might think (in the context of e.g. shop brands) "hang on
- surely a shop can be only one brand? It must be _either_ X or Y!" but
you'd be wrong. There are _always_ exceptions, and there will always be
"errors" - you just don't know which way is right and which wrong.
2) The second point that's relevant here is that codes such as CODE1,
CODE2 etc. are to be avoided at all costs since they don't enable any
natural visualisation of what's been captured. You have already said
"but surely every system that displays data can look up the description"
but anyone familar with the real world knows that that simply won't
happen. This means that there's no way for an ordinary mapper to verify
whether the magic code on an OSM item is correct or not. Verifiability
is one of the key concepts of OSM (see
https://wiki.openstreetmap.org/wiki/Verifiability et al) and anything
that moves away from it means that data isn't going to be maintained,
because people simply won't understand what it means. I suspect that a
key part of the success of OSM was the reliance on natural
language-based keys and values, and a loose tagging scheme that allowed
3) The third point is that a database that has been "cleaned" so that
there are no "errors" in it is worth far less than one that hasn't, when
you're trying to understand the complex relationships between objects.
This goes against most normal data processing instincts because
obviously normally you'd try and ensure that data has full referential
integrity - but where there are edge cases (and as per (1) above there
are always edge cases) different consumers will very likely want to
treat those edge cases differently, which they can't do if someone has
"helpfully" merged all the edge cases into more popular categories.
To be blunt, if I was trying to process OSM data and had a need to get
into the wikidata/wikipedia world based on it (for example because I
wanted the municipal coat of arms - something not in OSM) I'd take a
wikipedia link over a wikidata one every time because all mappers will
have been able to see the text of the wikipedia link rather than just
something like Q123456. You've made the point that things change in
wikipedia regularly (articles get renamed etc.), but it's important to
remember that things change in the real world all the time as well - and
a link that's suddenly pointing at something different in wikipedia is
immediately apparent, in the same way that if Q123456 was no longer
relevant (because the real world thing has changed) it wouldn't be.
All that said, I don't see wikidata as a key component (or even a very
useful component) of OSM - but we all map things that are of interest to
us - some people map in great detail the style of British telephone
boxes or the "Royal Cipher" on postboxes which I see absolutely no point
in, but if it's verifiable, why not - I'm sure I'm mapping stuff that is
irrelevent to them. A problem with wikidata (as noted above) is that
I'm not sure that it _is_ verifiable data - I suspect it'll get stale
after adding and never be maintained, simply because people will never
notice that it's wrong.
(and on an unrelated comment in the same message)
> Sure, it can be via dump parsing, but it is a much more complicated
> than querying. Would you rather use Overpass turbo to do a quick
> search for some weird thing that you noticed, or download and parse
> the dump? Most people would rather do the former.
It depends - if you want to do a "quick search for something" then an
equivalent to overpass turbo might be an option, but in the real world
what you'd _actually_ want to do is a local database query.
Unfortunately that side of things seems to be completely missing (or at
least very well-hidden) - wikidata seems to be quite immature in that
respect. Where's the "switch2osm" for wikidata? Where's the
"osm2pgsql" or "osmosis"? Sure I can download 20Gb of gzipped JSON from
https://dumps.wikimedia.org/wikidatawiki/entities/20170925/ and try and
write some sort of parser based on
https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON , but this seems
very much like going back to banging the rocks together (and no, a
third-party query interface that depends on an external network
connection such as https://query.wikidata.org/ or anything else isn't a
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the talk