[OSM-dev] broken utf8 in minute changeset 200907140650

Jon Burgess jburgess777 at googlemail.com
Sat Jul 18 20:13:17 BST 2009


On Tue, 2009-07-14 at 11:16 +0100, Andy Allan wrote:
> Interestingly, it's actually valid UTF8 (they are ASCII control
> characters). The problem is that XML defines a subset of Unicode
> characters that excludes these and a few other ranges.
> 
> http://www.w3.org/TR/REC-xml/#NT-Char
> 
> None of the rails code is explicitly aware of the difference between
> UTF8 and this XML-UTF8-subset. All the XML parsing is done by libxml2*
> (so we haven't come across this distinction before) but this was
> inputted via Potlatch and so wasn't parsed by an XML parser. Arguably
> it does the right thing, because during the API 0.6 we decided that
> "all UTF8" would be valid in OSM tags (and that there wouldn't be any
> normalization between e.g. e-acute and e+combining acute etc etc) but
> maybe we should tweak that definition to say only "all XML-UTF8-subset
> characters" as defined in the above link are permitted.
> 
> Test cases and code fixes to follow. This was all figured out by Matt.

Looks like we still have an issue with a stray ^H from this afternoon:

$ gzip -dc 200907181527-200907181528.osc.gz | xmllint -noout -
-:4540: parser error : invalid character in attribute value
      <tag k="name" v=ирогова"/>
                       ^
-:4540: parser error : attributes construct error
      <tag k="name" v=ирогова"/>
                       ^
-:4540: parser error : Couldn't find end of Start Tag tag line 4540
      <tag k="name" v=ирогова"/>
                       ^
-:4540: parser error : PCDATA invalid Char value 8
      <tag k="name" v=ирогова"/>

http://www.openstreetmap.org/browse/way/37799039


Jon






More information about the dev mailing list