[OSM-dev] planet.osm - fix

Tue Aug 15 14:48:57 BST 2006

David Sheldon wrote:
> On Tue, Aug 15, 2006 at 02:11:34PM +0200, Michael Strecke wrote:
>> <?xml version="1.0"?>
>> <osm version="0.3" generator="OpenStreetMap server">
>>   <way id="2837877" timestamp="2006-08-09 23:53:34">
>>     <seg id="10134927"/>
>>     <tag k="name" v="Genter Stra&#xDF;e"/>
>>   </way>
>> </osm>
>>
>> Not UTF-8, but latin-1 encoding. :(
> 
> This is Unicode encoding (see
> http://www.unicode.org/charts/PDF/U0080.pdf) ,

It may be *a* Unicode encoding, but it's not UTF-8.

If I understand the Wikipedia article correctly, Unicode is this huge
collection of character (european, chinese, arabic, etc.) These
characters are usually designated with U+XXXX

UTF-8 is one way to encode these U+XXXX characters, latin-1 is another.

UTF-8 encodes a Unicode character into 1, 2, 3 or 4 bytes. (1 for ASCII,
2 for most european stuff, 3 + 4 for the more exotic ones).

Latin-1 encodes its additional characters into 1 byte in the range
between 0x80 and 0xff (which can only represent a small subset of the
entire Unicode).

The encoding is usually specified in the xml header (and is missing in
planet.osm):

	<?xml version="1.0" encoding="utf-8"?>

> and is specified by XML
> to be the same as using the UTF-8 sequence.

Links please.

AFAIK you can use latin-1 as encoding in an XML file, if you want to
(and you should specify it in the header). But planet.osm uses a mix of
the UTF-8 and Latin-1 encoding schemes in one file, which most parsers
don't like.

If there is any documentation that contradicts these assumptions, please
let me know.

Latin-1 or UTF-8... there should be at least a consistent use.
I vote for UTF-8, in case some Japanese start mapping their streets.