[OSM-dev] planet.osm - fix
Michael Strecke
MStrecke at gmx.de
Tue Aug 15 13:11:34 BST 2006
Michael Strecke wrote:
> I'm just writing a short program to identify the offending elements.
My utf-8 checker identified roughly 400 elements (nodes, segments, ways)
with non UTF-8 encoding, mostly latin-1.
The task to correct those elements seems feasible.
But the codeset handling of the API should be corrected first. Here is
what I found:
The German word for "street" is "Straße", and many street names contain
the word "Straße".
The encodings for the german "ß" are (hex):
UTF-8: C3 9F
latin-1: DF
AFAIK, in XML documents those can either be sent as 8 bit character, or
as HTML entity, e.g. ß
Step 1: Upload in UTF-8
=======================
connect: (www.openstreetmap.org, 80)
send: u'PUT /api/0.3/way/2837877 HTTP/1.1\r\nHost:
www.openstreetmap.org\r\nContent-Length: 140\r\nAccept-Encoding:
gzip\r\nAuthorization: Basic XXXXXXXXXXXXXXXXX\r\nUser-Agent:
pyosmeditor/0.1.1\r\n\r\n'
send: '<osm version="0.3" generator="pyosmeditor"><way id="2837877">\n
<seg id="10134927"/>\n \n <tag k="name" v="Genter
Stra\xc3\x9fe"/></way></osm>'
reply: 'HTTP/1.1 200 OK\r\n'
The upload is in UTF-8 (\xc3\x9f), as 8 bit characters.
Step 2: Download of the information (way command):
==================================================
wget is used to prevent any re-encoding on this end:
wget
http://xxxxx%40xxxx.xxx:xxxxxxx@www.openstreetmap.org/api/0.3/way/2837877
<?xml version="1.0"?>
<osm version="0.3" generator="OpenStreetMap server">
<way id="2837877" timestamp="2006-08-09 23:53:34">
<seg id="10134927"/>
<tag k="name" v="Genter Straße"/>
</way>
</osm>
Not UTF-8, but latin-1 encoding. :(
It seems that the upload itself was understood correctly (otherwise the
server wouldn't have chosen the correct latin-1 character for "ß").
Step 3: Download via map:
=========================
wget
http://xxxxxx%40xxxxx.xxx:xxxxx@www.openstreetmap.org/api/0.3/map?bbox=6.93100777156,50.9354174816,6.93938526447,50.9423637681
<?xml version="1.0"?>
<osm version="0.3" generator="OpenStreetMap server">
...
</way>
<way id="2837877" timestamp="2006-08-09 23:53:34">
<seg id="10134927"/>
<tag k="name" v="Genter Straße"/>
</way>
Well, it starts like UTF-8, but what is 0x178 ? A bug.
All in all, not a very consistent behavior.
>From what I have seen in a quick Google search, Ruby seems to lack a
proper Unicode support (unlike Python). Is this still true?
And the fact that the various databases are encoded differently may also
add its share to the confusion.
More information about the dev
mailing list