[OSM-dev] Disallowing certain characters in tag keys

Jochen Topf jochen at remote.org
Sat Oct 16 19:44:38 BST 2010


Hi!

I am currently fighting some issues where tags with strange characters in them
need to be represented in a URL for Taginfo. Lots of other websites probably
will have similar issues. Characters like /, ?, &, etc. have special meaning
in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
escaping characters as %XX helps, sometimes not. And those problems are not
confined to web pages and URLs only. Special characters that need escaping
are often a problem.

We can't really do anything about that with regard to tag values, they must be
allowed to contain all those characters. But it would help at least a little if
we knew those characters can never appear in tag keys. And I can't really see a
legitimate reason why we need those characters in keys. Looking at the database
almost all cases where they appear in keys are obvious errors. Out of the about
20000 different keys, there are only about 190 keys with problematic characters
in them (another about 800 with whitespace). Really the only case that I can't
immediately rule out as errors or see an alternative tagging are tag keys like
"maxspeed:weight>7.5". And with those you can already see the problems: Some of
them have ">" instead of the ">".

So I'd like us to think about whether we can disallow a few characters from
appearing in tag keys. Technically this would mean changing the API to check
for those characters, removing any that are already in the database (can be
done with normal manual edits because there are so few cases) and adding checks
to the editors so that they can give meaningful error messages. Shouldn't be
too hard.

So, what characters am I talking about? I haven't drawn up a complete list
and we certainly would need to discuss this further.

Here is a preliminary list:

Whitespace   Should use '_' instead of whitespace in keys, whitespace are
             also very confusing for users, especially at beginning and end
             of a text.

<>&/+?#;%'"  Special characters in XML, HTML and/or URLs.

\'"          Characters often used for quoting.

=            Because its used in many places as the separation character
             between tag key and tag value. If we disallow this, we can always
             treat one string like "foo=bar" as k:foo, v:bar without any
             ambiguities.

This is a small list of special characters, all other characters should still
be allowed. That means tag keys can still be in Chinese or whatever. We'd just
disallow a few characters of which we know that they will make problems again
and again.

And to emphasize this again: I am only talking about tag keys. Tag values must
be allowed to contain the full Unicode set of characters.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298




More information about the dev mailing list