[OSM-dev] osmosis utf-8

Tue Nov 6 09:43:22 GMT 2007

In message <2fc2c5f10711060133y22fdfca7iea0f1ca2f87c0be0 at mail.gmail.com>
        Martijn van Oosterhout <kleptog at gmail.com> wrote:

> On 11/6/07, Brett Henderson <brett at bretth.com> wrote:
>> The issue appears to be that somewhere along the line the multi-byte
>> utf-8 character is getting turned into two separate characters. Every
>> time I attempt to retrieve the data, I get two characters back. I have
>> not been able to find a way to trick the JDBC driver into somehow
>> re-assembling the broken data.
>
> AIUI the data is simply doubly encoded in the DB. The JDBC driver is
> doing the right thing by giving exactly what's in the database. I
> don't think you're going to "trick" JDBC into working around a problem
> like that.

Just setting the connection character set to Latin-1 explicitly
should work (it's what I do with mysqldump which defaults to using
UTF-8) but it will only work for our broken server config and not
for any sensibly setup databases.

> What I would suggest is simply recoding the data again after you
> receive it. For looking at the docs something like this should work:
>
> res = new String( input.getBytes( "Latin-1" ), "UTF-8" )
>
> (I didn't check the encoding names).
>
> You probably need some toggle to enable it only when you need it. And
> some exception handling in case the data is buggered anyway...

The problem is you need MySQL's interpretation of "Latin-1" and not
the real one ;-) Which is actually closer to Windows CP1252 (also
known as Windows Latin-1).

Tom

-- 
Tom Hughes (tom at compton.nu)
http://www.compton.nu/