[OSM-dev] planet.osm - fix
David Sheldon
dave-osm at earth.li
Tue Aug 15 15:04:27 BST 2006
On Tue, Aug 15, 2006 at 03:48:57PM +0200, Michael Strecke wrote:
> If I understand the Wikipedia article correctly, Unicode is this huge
> collection of character (european, chinese, arabic, etc.) These
> characters are usually designated with U+XXXX
These Unicode code points are what XML represents by the &#...;
entities,
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-references
These are not affected by the encoding for reading the current document.
The encoding specifies what to do with characters in the XML stream.
Handily UTF-8, ASCII and Latin 1 all have the same characters for & #
and ;
> The encoding is usually specified in the xml header (and is missing in
> planet.osm):
>
> <?xml version="1.0" encoding="utf-8"?>
XML specifies UTF-8 as the default if no encoding is specified.
http://www.w3.org/TR/2004/REC-xml-20040204/#NT-EncodingDecl
" it is a fatal error for an entity including an encoding declaration
to be presented to the XML processor in an encoding other than that
named in the declaration, or for an entity which begins with neither a
Byte Order Mark nor an encoding declaration to use an encoding other
than UTF-8."
> Latin-1 or UTF-8... there should be at least a consistent use.
> I vote for UTF-8, in case some Japanese start mapping their streets.
I vote for ASCII, and anything that cannot be represented in ASCII being
represented by entities in the &#..; form. This way there should be no
transport problems. Just people who don't understand the XML
specification trying to write XML parsers.
David
--
A debugged program is one for which you have not yet found the conditions
that make it fail.
-- Jerry Ogdin
More information about the dev
mailing list