[OSM-dev] osmosis utf-8

Tue Nov 6 09:33:30 GMT 2007

On 11/6/07, Brett Henderson <brett at bretth.com> wrote:
> The issue appears to be that somewhere along the line the multi-byte
> utf-8 character is getting turned into two separate characters. Every
> time I attempt to retrieve the data, I get two characters back. I have
> not been able to find a way to trick the JDBC driver into somehow
> re-assembling the broken data.

AIUI the data is simply doubly encoded in the DB. The JDBC driver is
doing the right thing by giving exactly what's in the database. I
don't think you're going to "trick" JDBC into working around a problem
like that.

What I would suggest is simply recoding the data again after you
receive it. For looking at the docs something like this should work:

res = new String( input.getBytes( "Latin-1" ), "UTF-8" )

(I didn't check the encoding names).

You probably need some toggle to enable it only when you need it. And
some exception handling in case the data is buggered anyway...

> Does anybody know what happens to an existing database if encoding
> settings in my.cnf are changed? Tom mentioned some double encoding
> issues which appear to be at the root of this issue. Presumably changing
> my.cnf to make everything utf-8 will either break something or possibly
> do nothing at all if the problem setting is persisted in the database
> somewhere.

AIUI the problem is that ruby does not check the actual encoding of
the connection but simply assumes something. Changing the encoding of
the connection will simply break everything :(

Have a nice day,
-- 
Martijn van Oosterhout <kleptog at gmail.com> http://svana.org/kleptog/