[OSM-dev] planet.osm - fix

Tue Aug 15 15:04:27 BST 2006

On Tue, Aug 15, 2006 at 03:48:57PM +0200, Michael Strecke wrote:
> If I understand the Wikipedia article correctly, Unicode is this huge
> collection of character (european, chinese, arabic, etc.) These
> characters are usually designated with U+XXXX

These Unicode code points are what XML represents by the &#...;
entities,

  http://www.w3.org/TR/2004/REC-xml-20040204/#sec-references

These are not affected by the encoding for reading the current document.
The encoding specifies what to do with characters in the XML stream.
Handily UTF-8, ASCII and Latin 1 all have the same characters for & #
and ;

> The encoding is usually specified in the xml header (and is missing in
> planet.osm):
> 
> 	<?xml version="1.0" encoding="utf-8"?>

XML specifies UTF-8 as the default if no encoding is specified. 

 http://www.w3.org/TR/2004/REC-xml-20040204/#NT-EncodingDecl

  " it is a fatal error for an entity including an encoding declaration
  to be presented to the XML processor in an encoding other than that
  named in the declaration, or for an entity which begins with neither a
  Byte Order Mark nor an encoding declaration to use an encoding other
  than UTF-8."

> Latin-1 or UTF-8... there should be at least a consistent use.
> I vote for UTF-8, in case some Japanese start mapping their streets.

I vote for ASCII, and anything that cannot be represented in ASCII being
represented by entities in the &#..; form. This way there should be no
transport problems. Just people who don't understand the XML
specification trying to write XML parsers.

David

-- 
A debugged program is one for which you have not yet found the conditions
that make it fail.
		-- Jerry Ogdin