[OSM-dev] planet.osm - fix
Michael Strecke
MStrecke at gmx.de
Tue Aug 15 14:48:57 BST 2006
David Sheldon wrote:
> On Tue, Aug 15, 2006 at 02:11:34PM +0200, Michael Strecke wrote:
>> <?xml version="1.0"?>
>> <osm version="0.3" generator="OpenStreetMap server">
>> <way id="2837877" timestamp="2006-08-09 23:53:34">
>> <seg id="10134927"/>
>> <tag k="name" v="Genter Straße"/>
>> </way>
>> </osm>
>>
>> Not UTF-8, but latin-1 encoding. :(
>
> This is Unicode encoding (see
> http://www.unicode.org/charts/PDF/U0080.pdf) ,
It may be *a* Unicode encoding, but it's not UTF-8.
If I understand the Wikipedia article correctly, Unicode is this huge
collection of character (european, chinese, arabic, etc.) These
characters are usually designated with U+XXXX
UTF-8 is one way to encode these U+XXXX characters, latin-1 is another.
UTF-8 encodes a Unicode character into 1, 2, 3 or 4 bytes. (1 for ASCII,
2 for most european stuff, 3 + 4 for the more exotic ones).
Latin-1 encodes its additional characters into 1 byte in the range
between 0x80 and 0xff (which can only represent a small subset of the
entire Unicode).
The encoding is usually specified in the xml header (and is missing in
planet.osm):
<?xml version="1.0" encoding="utf-8"?>
> and is specified by XML
> to be the same as using the UTF-8 sequence.
Links please.
AFAIK you can use latin-1 as encoding in an XML file, if you want to
(and you should specify it in the header). But planet.osm uses a mix of
the UTF-8 and Latin-1 encoding schemes in one file, which most parsers
don't like.
If there is any documentation that contradicts these assumptions, please
let me know.
Latin-1 or UTF-8... there should be at least a consistent use.
I vote for UTF-8, in case some Japanese start mapping their streets.
More information about the dev
mailing list