[OSM-dev] osmosis utf-8

Tue Nov 6 05:49:22 GMT 2007

Brett Henderson wrote:
> Christopher Schmidt wrote:
>> http://labs.metacarta.com/osm/?zoom=17&lat=6528334.17358&lon=1235482.2807&layers=B00 
>>
>>
>> Looks like Osmosis is still having issues with UTF-8.
> Thanks for letting me know. I'm fairly sure it's the database 
> connection causing problems but it's not clear how to fix it. I'll 
> experiment with different encoding settings over the next couple of 
> days and see if any of them fix it.
>
> Ruby and C apps apparently work correctly when using default 
> connection encoding settings, but java doesn't. I'm guessing Ruby and 
> C both share a common library but the jdbc driver is pure java and 
> might operate differently. I'm concerned this may not be fixable with 
> the current database configuration. Anyway, I'll see what I can do.
I'm stuck and don't know else to try. I want to get the daily diffs 
working correctly so that I can move onto the next task of making them 
easier to consume.

I'm trying to get the name tag of way 4810727 to dump correctly but the 
dumped data is always incorrect. The multi-byte character in that name 
tag always gets written as two characters.
http://www.openstreetmap.org/api/0.5/way/4810727/history

The current osmosis behaviour is to use the MySQL JDBC driver defaults 
for all character set encoding configuration. There appear to be several 
parameters that affect character sets as described in the following URL:
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html

The parameters are useUnicode, characterEncoding, characterSetResults 
and clobCharacterEncoding. useUnicode and characterEncoding were 
available in version 1.1g of the driver (a long time ago), 
characterSetResults was added in v3.0.13 of the driver (more recently 
but still some time ago) and clobCharacterEncoding was added in v5.0.0 
of the driver (very recently). I initially only focused on the first two 
settings.

useUnicode defaults to true, characterEncoding defaults to the 
"character_set_server" during connection to the server. On the current 
production MySQL DB, I believe the character_set_server parameter is set 
to latin1. I have tried the four combinations of useUnicode=true|false 
and characterEncoding=UTF-8|ISO8859_1(latin1) without any affect on the 
output.

Moving onto characterEncoding. I left useUnicode at default in all 
remaining tests. I tried all combinations of 
characterEncoding=ISO8859_1|UTF-8 and 
characterSetResults=ISO8859_1|UTF-8. No change in output.

Last chance is clobCharacterEncoding. I tried all combinations of 
characterEncoding=ISO8859_1|UTF-8, characterSetResults=ISO8859_1|UTF-8 
and clobCharacterEncoding=ISO8859_1|UTF-8. Still no change in output.

I'm sure my changes were taking effect. If I put in an invalid encoding 
at any point, osmosis would crash. When I changed the encoding to 
ISO8859_8 (Hebrew), the name tag changed to "Ilmenauer Stra??e".

The issue appears to be that somewhere along the line the multi-byte 
utf-8 character is getting turned into two separate characters. Every 
time I attempt to retrieve the data, I get two characters back. I have 
not been able to find a way to trick the JDBC driver into somehow 
re-assembling the broken data.

I checked the default settings when logging in via a standard connection 
to the production db.
Server characterset: latin1
Db characterset: utf8
Client characterset: latin1
Conn. characterset: latin1

All environments that I've previously tested on have the 
default-character-set property set to utf8 in the [mysql] and [mysqld] 
sections of my.cnf which results in:
Server characterset: utf8
Db characterset: utf8
Client characterset: utf8
Conn. characterset: utf8

All in all, I've spent about 3 hours on this with zero results. Getting 
more than a little frustrated ;-) I'm giving up for now unless anybody 
has any suggestions. I suspect the only way to fix this is to fix the 
root cause in the production database. I don't know how difficult this is.

Does anybody know what happens to an existing database if encoding 
settings in my.cnf are changed? Tom mentioned some double encoding 
issues which appear to be at the root of this issue. Presumably changing 
my.cnf to make everything utf-8 will either break something or possibly 
do nothing at all if the problem setting is persisted in the database 
somewhere.

Brett