[Tagging] Discussion about Multivalued Keys
meillo at marmaro.de
Thu Jan 28 10:15:57 UTC 2016
I'd like to share some thoughts about the ``How to implement MV in
OSM'' question, as opened in:
I'd prefer to first have explicit agreement that we actually need
MV ... but as the implementation discussion is already rolling ...
Initially, I wanted to add a section to the talk page of the wiki
page but my text now appears to be better suited for an email. Please
feel free to integrate any part of my considerations into the wiki
I see two general ways to solve the problem of MV in OSM:
1) Allow multiple identical keys per object (as is was before API
0.6, I learned). This means tag names of one object need *not* be
unique. When we talk about tag names being unique, we should
distinguish between being unique in the data storage and being
unique at the surface (GUI). It seems ... well, what does it seem?
Are we more concerned about the technical storage level or at the
user experience? Which of them are we discussing?
2) Make multiple identical keys by some *technical* measurement
unique. This is the currently assumed way to go, at least as such
it appears to me.
I (now) think that it is important to keep the value domain free
from logic and thus have it reserved for literal data. This means,
MV need to be implemented in the key domain.
Currently, we mostly discuss with concrete examples. The assumption
is, that the user would have to deal with these suffixes. Maybe he
doesn't have to. It might be possible to abstract the user's view
from the internal storage. Then the actual encoding becomes
irrelevant from the user's perspective. Multiple identical keys
could be presented to him (even grouped) ... and they'd be
translated (e.g. by appending arbitrary suffixes (e.g. hashes of
the value)) at the interface to the data storage layer. (I focus
on unordered MVs here.)
As a user, I'd never want to have to deal with this MV problem at
all, which means no encoding should be required by me, neither in
the value *nor* in the key domain. If there are two refs, then I'd
want to tag: ref=foo + ref=bar. The internal storage should not be
the user's problem.
Of course, it's not that easy, because raw data is dealt with much
too often. Nonetheless we should kept in mind, that a separation
of the user's view from the data storage can solve colliding wishes.
Concerning the choice, of how to add such a suffix:
We should realize what we try to do here: We're violating the
first normal form for relational databases, by encoding two
separate bits of information in one field (the key's name and some
unique suffix). We already came to the opinion, that encoding
multiple values in one field in the value domain is bad ... but it
is equally bad in the key domain.
And it is even worse if the separator is not (technically)
reserved for that specific purpose. If we would use the underscore
(_) to separate the key's name from the unique suffix, then the
technical separation of name and suffix would be pretty fragile,
because names already contain underscores. The split would be
rather guessing, based on the suffix to be a number.
Hence, if we do encode two separate values in one field, then we
better try hard to make the separator reserved. This not only
spares us escaping, but also allows us to search for exact key
names, because the search engine can then be enabled to know which
is the name to compare and which is the suffix to ignore.
The underscore approach fails in this respect equally as the colon
approach. Of the currently discussed approaches, only the subscript
(bracket) syntax satisfies this need. (Assuming that there are no
brackets in key names, currently.) However, it's closing bracket
is technically superfluous and only motivated by the thinking that
humans have to see these suffixes.
What we need in my opinion is one single character, that must
never be part of any key name and never be part of any suffix.
Using this separator, we encode two separate bits of information
in one field (the key field) ... and thus have effectively three
columns in a two-column table.
At the surface (GUI) we should rather hide the technical suffix
Ordered MVs are not considered here. It is not clear if we need
to consider them.
More information about the Tagging