[OSM-dev] osmosis utf-8

Wed Nov 7 11:56:23 GMT 2007

Martijn van Oosterhout wrote:
> On Nov 6, 2007 11:01 PM, Brett Henderson <brett at bretth.com> wrote:
>   
>> Last night I also tried running the "set names 'latin1'" statement from
>> code.  The JDBC driver documentation explicitly says not to use this
>> command because the driver won't detect the change.  Sounded perfect but
>> again seemed to make no difference which surprised me.
>>     
>
> I thought of another solution for the short term.  The problem is that
> the characters are double encoded, because the rails code is
> connecting as latin1 but sending utf8. If you set osmosis so that it
> treats the input and output OSM files as latin1 then data will be
> borked inside osmosis but on disk it would look like the utf-8 the
> rails server sees.
>
> From the docs it seems you only need to build an OutputStreamWriter
> around the file with the encoding "ISO-8859-1".
>
> Does help if you're writing to a DB though.
>
> Hope this helps,
>   
I did a quick test as you said changing line 111 of BaseXmlWriter from 
using a "UTF-8" encoding to "ISO-8859-1" encoding.

It results in the tag being written as:
<tag k="name" v="Ilmenauer Stra�?e"/>

Not sure why the two ? characters are different, the file is at the 
following URL if you're interested:
http://planet.openstreetmap.org/daily/testfile.osc

Presumably Java knows that there are two characters but can't write them 
in it's latin1 encoding so writes question marks instead (although the 
first question mark isn't a normal one so I might be misinterpreting 
something). Perhaps the java ISO-8859-1 encoding is a subset of the 
mysql latin1 encoding, not too sure what's going on here ...

I think something similar will happen if I try re-encoding strings in 
code. If I receive a string, convert to "ISO-8859-1" bytes then read 
back as UTF-8 I'll just receive a bunch of question marks.

I'm a little curious how the jdbc driver encodes data when the 
connection is set to latin1. Presumably it will also write ? characters. 
The following configuration settings are from the Charsets.properties 
inside the jdbc driver jar file:
javaToMysqlMappings=\
US-ASCII = usa7,\
US-ASCII = ascii,\
Big5 = big5,\
GBK = gbk,\
SJIS = sjis,\
EUC_CN = gb2312,\
EUC_JP = ujis,\
EUC_JP_Solaris = >5.0.3 eucjpms,\
EUC_KR = euc_kr,\
EUC_KR = >4.1.0 euckr,\
ISO8859_1 = *latin1,\
ISO8859_1 = latin1_de,\
ISO8859_1 = german1,\
ISO8859_1 = danish,\
ISO8859_2 = latin2,\
ISO8859_2 = czech,\
ISO8859_2 = hungarian,\
ISO8859_2 = croat,\
ISO8859_7 = greek,\
ISO8859_7 = latin7,\
ISO8859_8 = hebrew,\
ISO8859_9 = latin5,\
ISO8859_13 = latvian,\
ISO8859_13 = latvian1,\
ISO8859_13 = estonia,\
Cp437 = *>4.1.0 cp850,\
Cp437 = dos,\
Cp850 = Cp850,\
Cp852 = Cp852,\
Cp866 = cp866,\
KOI8_R = koi8_ru,\
KOI8_R = >4.1.0 koi8r,\
TIS620 = tis620,\
Cp1250 = cp1250,\
Cp1250 = win1250,\
Cp1251 = *>4.1.0 cp1251,\
Cp1251 = win1251,\
Cp1251 = cp1251cias,\
Cp1251 = cp1251csas,\
Cp1256 = cp1256,\
Cp1251 = win1251ukr,\
Cp1257 = cp1257,\
MacRoman = macroman,\
MacCentralEurope = macce,\
UTF-8 = utf8,\
UnicodeBig = ucs2,\
US-ASCII = binary,\
Cp943 = sjis,\
MS932 = sjis,\
MS932 = >4.1.11 cp932,\
WINDOWS-31J = sjis,\
WINDOWS-31J = >4.1.11 cp932,\
CP932 = sjis,\
CP932 = *>4.1.11 cp932,\
SHIFT_JIS = sjis,\
ASCII = ascii,\
LATIN5 = latin5,\
LATIN7 = latin7,\
HEBREW = hebrew,\
GREEK = greek,\
EUCKR = euckr,\
GB2312 = gb2312,\
LATIN2 = latin2

Given that latin1 is mapped to ISO8859_1 in this file I assume the 
driver will just lose information if complex characters are written to 
the db.