[OSM-dev] osmosis utf-8
Brett Henderson
brett at bretth.com
Wed Nov 7 11:56:23 GMT 2007
Martijn van Oosterhout wrote:
> On Nov 6, 2007 11:01 PM, Brett Henderson <brett at bretth.com> wrote:
>
>> Last night I also tried running the "set names 'latin1'" statement from
>> code. The JDBC driver documentation explicitly says not to use this
>> command because the driver won't detect the change. Sounded perfect but
>> again seemed to make no difference which surprised me.
>>
>
> I thought of another solution for the short term. The problem is that
> the characters are double encoded, because the rails code is
> connecting as latin1 but sending utf8. If you set osmosis so that it
> treats the input and output OSM files as latin1 then data will be
> borked inside osmosis but on disk it would look like the utf-8 the
> rails server sees.
>
> From the docs it seems you only need to build an OutputStreamWriter
> around the file with the encoding "ISO-8859-1".
>
> Does help if you're writing to a DB though.
>
> Hope this helps,
>
I did a quick test as you said changing line 111 of BaseXmlWriter from
using a "UTF-8" encoding to "ISO-8859-1" encoding.
It results in the tag being written as:
<tag k="name" v="Ilmenauer Stra�?e"/>
Not sure why the two ? characters are different, the file is at the
following URL if you're interested:
http://planet.openstreetmap.org/daily/testfile.osc
Presumably Java knows that there are two characters but can't write them
in it's latin1 encoding so writes question marks instead (although the
first question mark isn't a normal one so I might be misinterpreting
something). Perhaps the java ISO-8859-1 encoding is a subset of the
mysql latin1 encoding, not too sure what's going on here ...
I think something similar will happen if I try re-encoding strings in
code. If I receive a string, convert to "ISO-8859-1" bytes then read
back as UTF-8 I'll just receive a bunch of question marks.
I'm a little curious how the jdbc driver encodes data when the
connection is set to latin1. Presumably it will also write ? characters.
The following configuration settings are from the Charsets.properties
inside the jdbc driver jar file:
javaToMysqlMappings=\
US-ASCII = usa7,\
US-ASCII = ascii,\
Big5 = big5,\
GBK = gbk,\
SJIS = sjis,\
EUC_CN = gb2312,\
EUC_JP = ujis,\
EUC_JP_Solaris = >5.0.3 eucjpms,\
EUC_KR = euc_kr,\
EUC_KR = >4.1.0 euckr,\
ISO8859_1 = *latin1,\
ISO8859_1 = latin1_de,\
ISO8859_1 = german1,\
ISO8859_1 = danish,\
ISO8859_2 = latin2,\
ISO8859_2 = czech,\
ISO8859_2 = hungarian,\
ISO8859_2 = croat,\
ISO8859_7 = greek,\
ISO8859_7 = latin7,\
ISO8859_8 = hebrew,\
ISO8859_9 = latin5,\
ISO8859_13 = latvian,\
ISO8859_13 = latvian1,\
ISO8859_13 = estonia,\
Cp437 = *>4.1.0 cp850,\
Cp437 = dos,\
Cp850 = Cp850,\
Cp852 = Cp852,\
Cp866 = cp866,\
KOI8_R = koi8_ru,\
KOI8_R = >4.1.0 koi8r,\
TIS620 = tis620,\
Cp1250 = cp1250,\
Cp1250 = win1250,\
Cp1251 = *>4.1.0 cp1251,\
Cp1251 = win1251,\
Cp1251 = cp1251cias,\
Cp1251 = cp1251csas,\
Cp1256 = cp1256,\
Cp1251 = win1251ukr,\
Cp1257 = cp1257,\
MacRoman = macroman,\
MacCentralEurope = macce,\
UTF-8 = utf8,\
UnicodeBig = ucs2,\
US-ASCII = binary,\
Cp943 = sjis,\
MS932 = sjis,\
MS932 = >4.1.11 cp932,\
WINDOWS-31J = sjis,\
WINDOWS-31J = >4.1.11 cp932,\
CP932 = sjis,\
CP932 = *>4.1.11 cp932,\
SHIFT_JIS = sjis,\
ASCII = ascii,\
LATIN5 = latin5,\
LATIN7 = latin7,\
HEBREW = hebrew,\
GREEK = greek,\
EUCKR = euckr,\
GB2312 = gb2312,\
LATIN2 = latin2
Given that latin1 is mapped to ISO8859_1 in this file I assume the
driver will just lose information if complex characters are written to
the db.
More information about the dev
mailing list