[OSM-dev] broken utf8 in minute changeset 200907140650
Jon Burgess
jburgess777 at googlemail.com
Sat Jul 18 20:13:17 BST 2009
On Tue, 2009-07-14 at 11:16 +0100, Andy Allan wrote:
> Interestingly, it's actually valid UTF8 (they are ASCII control
> characters). The problem is that XML defines a subset of Unicode
> characters that excludes these and a few other ranges.
>
> http://www.w3.org/TR/REC-xml/#NT-Char
>
> None of the rails code is explicitly aware of the difference between
> UTF8 and this XML-UTF8-subset. All the XML parsing is done by libxml2*
> (so we haven't come across this distinction before) but this was
> inputted via Potlatch and so wasn't parsed by an XML parser. Arguably
> it does the right thing, because during the API 0.6 we decided that
> "all UTF8" would be valid in OSM tags (and that there wouldn't be any
> normalization between e.g. e-acute and e+combining acute etc etc) but
> maybe we should tweak that definition to say only "all XML-UTF8-subset
> characters" as defined in the above link are permitted.
>
> Test cases and code fixes to follow. This was all figured out by Matt.
Looks like we still have an issue with a stray ^H from this afternoon:
$ gzip -dc 200907181527-200907181528.osc.gz | xmllint -noout -
-:4540: parser error : invalid character in attribute value
<tag k="name" v=ирогова"/>
^
-:4540: parser error : attributes construct error
<tag k="name" v=ирогова"/>
^
-:4540: parser error : Couldn't find end of Start Tag tag line 4540
<tag k="name" v=ирогова"/>
^
-:4540: parser error : PCDATA invalid Char value 8
<tag k="name" v=ирогова"/>
http://www.openstreetmap.org/browse/way/37799039
Jon
More information about the dev
mailing list