[OSM-dev] osmosis utf-8

Tue Nov 6 10:47:20 GMT 2007

Martijn van Oosterhout wrote:
> On 11/6/07, Tom Hughes <tom at compton.nu> wrote:
>   
>>> AIUI the data is simply doubly encoded in the DB. The JDBC driver is
>>> doing the right thing by giving exactly what's in the database. I
>>> don't think you're going to "trick" JDBC into working around a problem
>>> like that.
>>>       
>> Just setting the connection character set to Latin-1 explicitly
>> should work (it's what I do with mysqldump which defaults to using
>> UTF-8) but it will only work for our broken server config and not
>> for any sensibly setup databases.
>>     
>
> My theory is that the JDBC driver sees the connection is Latin1 and
> converts the incoming stream back to unicode. In Java all strings are
> unicode, the JDBC drivers I know about automatically convert any
> incoming stream as appropriate. They even go so far as to detect if
> the user is trying to change the encoding.
>
> The end result is that you always get exactly what's in the DB, no
> matter what the config is. This is usually what you want, just it
> isn't here...
>
> Have a nice day,
>   
 From what I can tell, setting the connection set as suggested by Tom 
doesn't work. I've tried setting all manner of connection properties to 
UTF-8 and ISO8859_1 (latin1). It just seemed like it was configuring 
both the server and client side with this setting. I couldn't find a way 
to get the server to send in one encoding and the client read in 
another, I think the JDBC driver tells the server to change if the 
client changes. When using a Hebrew encoding, I ended up with ? 
characters as if the server wasn't able to encode it and wrote ? which 
the client then read. But without sniffing the connection it's hard to 
tell exactly what's going on.

Martijn, I can try your trick and suspect it may work but it's going to 
be a lot of coding effort due to jdbc string reads occurring all over 
the shop in osmosis. If the MySQL latin1 differs from java ISO8859_1 
then I'm screwed anyway. I'm also concerned that it will only work for 
characters that fit into the latin1 encoding, I wonder what would happen 
to other characters such as chinese.

How hard is it to fix the main db? Does it just require a dump and 
restore or is more substantial surgery required? Or don't we know yet?

Cheers,
Brett