[OSM-dev] Non-ASCII characters in XML generated from PostGIS

Nick Whitelegg Nick.Whitelegg at solent.ac.uk
Fri May 18 11:50:18 BST 2012


>There's a few issues here that are colliding.

>First of all, you're returning HTML entities in an XML document.
>That's going to throw an error. In your example, you have à ,
>which isn't defined for XML, only for HTML.

>http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

>Secondly, you're getting your character encodings in a muddle. The
>database is storing your e-acute in UTF8. It's unicode character
>U+0039, and represented in the UTF-8 encoding as two bytes (C3 and
>E9).

>http://www.fileformat.info/info/unicode/char/e9/index.htm <- look at
>the UTF-8 (hex) section

>I don't understand why you're trying to force the UTF-8 characters
>into an ISO-8859-1 encoding - which has barely enough code points to
>cover Western European languages, never mind Greek, Russian or any
>other OSM data. Stick to UTF-8 encoding in your XML, and you should be
>fine. Provided, of course, you can persuade PHP that's it's an XML
>rather than an HTML document, to sort out the entity encoding as I
>explained above.

Hello Andy,

OK thanks for that. That's the problem with speaking English as a first language and only working with a UK subset of OSM data... you can go for years and years developing software before hitting these issues! 

Nick


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20120518/26dcc214/attachment.html>


More information about the dev mailing list