[OSM-dev] usernames, keys, and values

Thu Oct 29 18:52:43 GMT 2009

On Thu, Oct 29, 2009 at 6:16 PM, Anthony <osm at inbox.org> wrote:
> On Thu, Oct 29, 2009 at 1:31 PM, Matt Amos <zerebubuth at gmail.com> wrote:
>> On Wed, Oct 28, 2009 at 3:38 PM, Andy Allan <gravitystorm at gmail.com> wrote:
>>> I would say that if the dump code and
>>> http://www.w3.org/TR/REC-xml/#NT-Char
>>> are in conflict, there's a bug in the dump code. But since I'm not
>>> going to fix it, maybe I'll keep my opinions quiet :-)
>>>
>>> As for the rails code, there is (AFAIK) no explicit character
>>> checking. The server implicitly relies on libxml to ensure the
>>> characters in the XML requests and responses are only those allowed by
>>> the XML spec above.
>>
>> there is explicit checking in the potlatch API, as that doesn't go
>> through libxml:
>>
>> http://trac.openstreetmap.org/browser/sites/rails_port/app/controllers/amf_controller.rb#L909
>
> There doesn't seem to be a spec, so everyone's just making it up as
> they go along.

sort of. the "spec" is just that the API talks XML, and we aim not to
have any restrictions beyond that. so anything that's a valid XML
character (http://www.w3.org/TR/REC-xml/#charsets) should be allowed
in OSM.

> But I'm going to attempt to clarify, with a quote from W3: "In
> attribute values, the character information items TAB (#x9), newline
> (#xA), and carriage-return (#xD) are represented by "&#x9;", "&#xA;",
> and "&#xD;" respectively."
> (http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html)
>
> Including a tab, newline, or carriage return unescaped in an xml
> attribute would clearly be incorrect.  But as long as it's escaped,
> it's valid xml.  <tag k="name" v="line 1&#xA;line 2" /> is valid xml.
> It may or may not represent valid OSM data.  This is why I'm saying my
> question has nothing at all to do with XML.

but there are other escaped values which are invalid XML (e.g: &#x0).

> Apparently under that potlatch code, tabs, carriage returns, and
> newlines are not allowed in keys or values (I don't actually know
> ruby/rails enough to say for sure, but that seems to be what Matt just
> pointed out).

the code, which is a bit opaque:

delete "\000-\037", "^\011\012\015"

says that it'll delete any char in the range 0-37 (octal), but not 11,
12 or 15. in ascii this corresponds to anything "less than" a space,
but not tab, newline or carriage return. it should be (modulo bugs)
the same as the XML valid character productions.

> On the other hand, usernames apparently *can*, at this
> point, contain these characters.  Actually changing one's username to
> include them would require using an input method other than the web
> page, but I don't see any code to forbid this.

yes, we should probably add a rails validation to stop people using
those chars in their username (who needs a tab in their username
anyway?)

> On the other hand, the planet dump code is silently changing control
> characters to "?".  This could cause problems (for instance, two
> usernames might wind up being silently changed to identical values),
> though it would probably require a deliberate attack.

the planet dump format includes the uids, so this case would be
detectable. however, this behaviour in planet dump is a bug and i'll
have a go at fixing it.

> I wonder, what happens if someone enters tabs into keys or values
> through the API (where there apparently are no checks for this), and
> then someone tries to edit it in potlatch?  Looks like a denial of
> service attack to me.

there don't need to be any checks - the xml is parsed using an xml
parser, which would barf if those chars were in the API calls.
potlatch also can't enter those chars via its API. i don't think
there's any attacks possible here, but i'd like to know if there are!

> It would be a good idea to release an official spec on exactly what
> characters are allowed in keys, values, and usernames.  Just
> disallowing control characters (decimal value less than 32) altogether
> would probably be the best.  But if the decision is made to allow
> them, fine, they need to be handled properly.

that's a good idea. i'll stick something up on the wiki - for
reference i think the current "spec" is the XML valid character
productions, although i can't think of any particular reason to keep
\t, \n or \r:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

http://www.w3.org/TR/REC-xml/#NT-Char

cheers,

matt