[OSM-dev] broken utf8 in minute changeset 200907140650

Andy Allan gravitystorm at gmail.com
Tue Jul 14 11:16:28 BST 2009


Interestingly, it's actually valid UTF8 (they are ASCII control
characters). The problem is that XML defines a subset of Unicode
characters that excludes these and a few other ranges.

http://www.w3.org/TR/REC-xml/#NT-Char

None of the rails code is explicitly aware of the difference between
UTF8 and this XML-UTF8-subset. All the XML parsing is done by libxml2*
(so we haven't come across this distinction before) but this was
inputted via Potlatch and so wasn't parsed by an XML parser. Arguably
it does the right thing, because during the API 0.6 we decided that
"all UTF8" would be valid in OSM tags (and that there wouldn't be any
normalization between e.g. e-acute and e+combining acute etc etc) but
maybe we should tweak that definition to say only "all XML-UTF8-subset
characters" as defined in the above link are permitted.

Test cases and code fixes to follow. This was all figured out by Matt.

Cheers,
Andy

* hopefully, but that's not been audited

On Tue, Jul 14, 2009 at 9:42 AM, Jon Burgess<jburgess777 at googlemail.com> wrote:
> I noticed that the diff parsing on the tile server stopped this morning.
> This changeset seems to be the cause:
>
> $ gzip -dc 200907140650-200907140651.osc.gz | xmllint -noout -
> -:36: parser error : invalid character in attribute value
>      <tag k="name" v="▒Meycauayan City Northbound Entry Point"/>
>                       ^
> -:36: parser error : attributes construct error
>      <tag k="name" v="▒Meycauayan City Northbound Entry Point"/>
>                       ^
>
> http://www.openstreetmap.org/browse/node/410383150
> http://www.openstreetmap.org/browse/node/441527354
>
>
>        Jon
>
>
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/dev
>




More information about the dev mailing list