[OSM-dev] Disallowing certain characters in tag keys

Tue Oct 19 11:52:09 BST 2010

On Tue, Oct 19, 2010 at 10:25 AM, Jochen Topf <jochen at remote.org> wrote:
> On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:
>> On 16/10/10 19:44, Jochen Topf wrote:
>>
>>> I am currently fighting some issues where tags with strange characters in them
>>> need to be represented in a URL for Taginfo. Lots of other websites probably
>>> will have similar issues. Characters like /, ?,&, etc. have special meaning
>>> in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
>>> escaping characters as %XX helps, sometimes not. And those problems are not
>>> confined to web pages and URLs only. Special characters that need escaping
>>> are often a problem.
>>
>> I really don't understand the problem here - as far as I know all
>> characters can be used in URLs so long as they are properly escaped. If
>> your server software is not coping with that for some reason then I
>> think it's a bug.
>
> That might well be a bug. But those bugs creep up all the time, because these
> things are hard to do and because the specs are not as clear as they should be.
> I am not saying these things can't be done right, but wouldn't it be nice if
> we can get rid of that problem instead of everybody writing software for OSM
> having to make sure all those cases are handled properly?
>
>> As a test I just created a file called '<>&+?#;%.html' in an apache
>> served directory and then asked Firefox to fetch:
>>
>>   http://server/%3c%3e%26%2b%3f%23%3b%25.html
>>
>> and it was retrieved just fine.
>
> And now try the same thing again creating a filename with a '/' in it and see
> whether it works this time. It doesn't, because '/' is special for Unix (and
> HTTP) and you need to create a directory with the first part of your name and
> then the second as file. If you would actually want to create one file for
> every key in the OSM database in your filesystem, you'd have a problem.
>
> You example basically proves my point. :-)

No, it really doesn't.

Let's put it this way - there is a subset[1] of unicode code points
that is valid for both keys and values. If you find any characters
emitted by OSM that lie outwith that range, then do let us know[3] But
we've taken great care to permit all other code points in both keys
and values alike, since we've no idea when someone is going to need
them. Your example of why we need > (and presumably <) is actually
great example to undermine your point.

Some of these characters need escaping for particular purposes. If you
find a unicode character that cannot be URLencoded[2], then do let us
know. Or if you find another encoding scenario which can only encode a
sub-set of unicode code points, let us know.

Your application should be able to handle every valid input. You've
found that your application is buggy, and now you're asking for the
input to be changed. But just the keys, not the values, and only
current data, not historical data. It seems a bit ... weird. And your
original list of characters is completely arbitrary, not based on any
formal specification as far as I can see.

If your editor can't handle all necessary characters, fix the editor.
If your application can't handle all the characters, fix the
application. And if you find dealing with " or = or & in a key to be
"hard", it's probably worth taking some time to test with non-BMP
characters.

(If you later find that having ');DROP DATABASE;-- in a key or value
is breaking your database inserts, then please don't ask for these
characters to be banned too!)

Thanks,
Andy

[1] See http://www.w3.org/TR/2008/REC-xml-20081126/#charsets
[2] http://en.wikipedia.org/wiki/Urlencode - / is %2f, by the way.
[3] But you shouldn't rely on it, and defensively program anyway. Not
all OSM files are generated by the API.