[OSM-dev] Disallowing certain characters in tag keys

Wed Oct 20 09:34:53 BST 2010

On Tue, Oct 19, 2010 at 11:52:09AM +0100, Andy Allan wrote:
> On Tue, Oct 19, 2010 at 10:25 AM, Jochen Topf <jochen at remote.org> wrote:
> > On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:
> >> On 16/10/10 19:44, Jochen Topf wrote:
> >>
> >>> I am currently fighting some issues where tags with strange characters in them
> >>> need to be represented in a URL for Taginfo. Lots of other websites probably
> >>> will have similar issues. Characters like /, ?,&, etc. have special meaning
> >>> in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
> >>> escaping characters as %XX helps, sometimes not. And those problems are not
> >>> confined to web pages and URLs only. Special characters that need escaping
> >>> are often a problem.
> >>
> >> I really don't understand the problem here - as far as I know all
> >> characters can be used in URLs so long as they are properly escaped. If
> >> your server software is not coping with that for some reason then I
> >> think it's a bug.
> >
> > That might well be a bug. But those bugs creep up all the time, because these
> > things are hard to do and because the specs are not as clear as they should be.
> > I am not saying these things can't be done right, but wouldn't it be nice if
> > we can get rid of that problem instead of everybody writing software for OSM
> > having to make sure all those cases are handled properly?
> >
> >> As a test I just created a file called '<>&+?#;%.html' in an apache
> >> served directory and then asked Firefox to fetch:
> >>
> >>   http://server/%3c%3e%26%2b%3f%23%3b%25.html
> >>
> >> and it was retrieved just fine.
> >
> > And now try the same thing again creating a filename with a '/' in it and see
> > whether it works this time. It doesn't, because '/' is special for Unix (and
> > HTTP) and you need to create a directory with the first part of your name and
> > then the second as file. If you would actually want to create one file for
> > every key in the OSM database in your filesystem, you'd have a problem.
> >
> > You example basically proves my point. :-)
> 
> No, it really doesn't.

Obviously I haven't made my point clear enough. I am saying, those special
characters don't work like normal characters in many cases. They have special
meanings. For instance as directory separators. Or in URLs or HTML code or
programming languages. So whenever you do anything where those characters
can appear, you have to take special care that your code doesn't break. And
programmers are notoriously bad at taking that special care.

> Let's put it this way - there is a subset[1] of unicode code points
> that is valid for both keys and values. If you find any characters
> emitted by OSM that lie outwith that range, then do let us know[3] But
> we've taken great care to permit all other code points in both keys
> and values alike, since we've no idea when someone is going to need
> them. Your example of why we need > (and presumably <) is actually
> great example to undermine your point.

Its really a case a weighting the different cases. On the one hand it
makes sense to allow "everything", because you never know what you will
need. But on the other hand it makes sense to restrict what you allow
to make handling easier. We have restricted the number of characters
in keys and values for instance. There are certainly cases where it
would be nice to have more characters, but for practical reasons they
are restricted. We have put in a restriction that a key can only appear
once on an object. Thats also for practical purposes. I am arguing that
there are other things we can do to make working with OSM-tags more
convenient, for what I think, no extra cost.

Look at what happend with email addresses: You can have nearly every ASCII
character in email addresses, spaces and double quotes are allowed for
instance, but you have to escape them in the right way. "Real" mail programs
can handle that generally. But most scripts tha people write don't.  The result
is that in practice you can't use all those characters in email addresses,
because they work only half the time. If you send programmers to the RFC
and ask them to implement it properly, they can't figure out how to do that
and give up. And each one implements his own system, each having his own
list of characters that work and that don't work. The end-result is a rather
small list of characters that always work and some that work sometimes.
(See the details at http://www.remote.org/jochen/mail/info/chars.html )

I argue that if we disallow some characters we can actually expect developers
to implement "our spec", if we leave "the spec" open too much, people will
ignore the difficult parts. If too many programs don't work with the difficult
bits those tags will in practice not be usable anyway, so why not forbid them
outright and all have an easier life?

> Some of these characters need escaping for particular purposes. If you
> find a unicode character that cannot be URLencoded[2], then do let us
> know. Or if you find another encoding scenario which can only encode a
> sub-set of unicode code points, let us know.
> 
> Your application should be able to handle every valid input. You've
> found that your application is buggy, and now you're asking for the
> input to be changed. But just the keys, not the values, and only
> current data, not historical data. It seems a bit ... weird. And your
> original list of characters is completely arbitrary, not based on any
> formal specification as far as I can see.

Yes, it seems a bit at first that I want to change keys and not values. But
while it might be inconsistent, its practical. The weigthing I spoke about
above comes out different for different cases.

I haven't made up my mind about historical data. Most people don't work with
historical data, so, again, it's not that big an issue there. We would need to
discuss this further and maybe look into the data. I wouldn't mind faking the
history to clean this up. We have faked the history for a whole country to
avoid copyright problems, so faking it for a few hundred keys that nobody ever
used seems to be a small problem.

My list of characters comes from more than a decade of experience writing
software and seeing my own software and other peoples software fail because I
or somebody else did not take special cases into account.  Thats why I wrote "I
haven't drawn up a complete list and we certainly would need to discuss this
further. Here is a preliminary list: ..." I fully expect us to haggle over each
and every character and only then come up with a sensible list. The list of
characters is not tied to a formal specification because this is not about
formal specifications, its about practical use. Formal specifications generally
allow more than what can practically be used. I don't care what the HTTP spec
says, if it doesn't work with Apache thats much more important for every
practical case.

I have spend a few more hours and I think I got all the problems out of
Taginfo. I have three different functions for escaping (HTML, URLs, and JSON),
some of them implemented on the server side in Ruby and in client-side in
Javascript. Plus the extra-escaping that XAPI needs.  Extra bonus for handling
the case that the key and/or value is empty.  (Why exactly do we need this?). I
have worked around the limitations in Apache and the Sinatra framework, the
URLs look a bit uglier than they maybe could have, but I now think everything
works. (But, please, everybody, go find the bugs, there must be more in there :-)

To be fair most of that would have been necessary anway, because I also have to
support tag values. But Taginfo is a bit special, because it wouldn't make much
sense if it doesn't work with all allowed tags especially the weird ones.

I hope that everybody writing software for OSM knows about the different
escaping and precautions needed for XML, HTML, HTTP, SQL, regular expressions
and what not, so that they all write perfect software that can handle all the
cases in the right way so that we can enjoy the benefits of having a double
quote or an equals sign in our tag keys. :-)

> If your editor can't handle all necessary characters, fix the editor.
> If your application can't handle all the characters, fix the
> application. And if you find dealing with " or = or & in a key to be
> "hard", it's probably worth taking some time to test with non-BMP
> characters.
> 
> (If you later find that having ');DROP DATABASE;-- in a key or value
> is breaking your database inserts, then please don't ask for these
> characters to be banned too!)
> 
> Thanks,
> Andy
> 
> [1] See http://www.w3.org/TR/2008/REC-xml-20081126/#charsets
> [2] http://en.wikipedia.org/wiki/Urlencode - / is %2f, by the way.

Thats not the problem in this case. The problem seems to be that Apache
decodes the %2f into / and then does the path matching instead of just
matching against / and pass through the %2f to the application. The spec
is not really clear in who is responsible for un-escaping on the server
side and anyway its, as I said, not the spec thats important, but what
software actually does. And I guess most web frameworks will decode those
URLs for you, so your application doesn't have to.

> [3] But you shouldn't rely on it, and defensively program anyway. Not
> all OSM files are generated by the API.

Thats true. But if some characters are not allowed I only have to make sure my
program doesn't break in some bad way (like opening a security hole), I don't
have to make sure it does a sensible thing. For instance an empty tag key is
probably no security problem, but I am generating an HTML page with links to
each key. And you can't click on a click that is zero characters long. If I
knew this can never happen in normal operations, I can just ignore this case.
But because it can, I have to special case this.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298