[OSM-dev] osmosis utf-8
Brett Henderson
brett at bretth.com
Tue Nov 6 05:49:22 GMT 2007
Brett Henderson wrote:
> Christopher Schmidt wrote:
>> http://labs.metacarta.com/osm/?zoom=17&lat=6528334.17358&lon=1235482.2807&layers=B00
>>
>>
>> Looks like Osmosis is still having issues with UTF-8.
> Thanks for letting me know. I'm fairly sure it's the database
> connection causing problems but it's not clear how to fix it. I'll
> experiment with different encoding settings over the next couple of
> days and see if any of them fix it.
>
> Ruby and C apps apparently work correctly when using default
> connection encoding settings, but java doesn't. I'm guessing Ruby and
> C both share a common library but the jdbc driver is pure java and
> might operate differently. I'm concerned this may not be fixable with
> the current database configuration. Anyway, I'll see what I can do.
I'm stuck and don't know else to try. I want to get the daily diffs
working correctly so that I can move onto the next task of making them
easier to consume.
I'm trying to get the name tag of way 4810727 to dump correctly but the
dumped data is always incorrect. The multi-byte character in that name
tag always gets written as two characters.
http://www.openstreetmap.org/api/0.5/way/4810727/history
The current osmosis behaviour is to use the MySQL JDBC driver defaults
for all character set encoding configuration. There appear to be several
parameters that affect character sets as described in the following URL:
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html
The parameters are useUnicode, characterEncoding, characterSetResults
and clobCharacterEncoding. useUnicode and characterEncoding were
available in version 1.1g of the driver (a long time ago),
characterSetResults was added in v3.0.13 of the driver (more recently
but still some time ago) and clobCharacterEncoding was added in v5.0.0
of the driver (very recently). I initially only focused on the first two
settings.
useUnicode defaults to true, characterEncoding defaults to the
"character_set_server" during connection to the server. On the current
production MySQL DB, I believe the character_set_server parameter is set
to latin1. I have tried the four combinations of useUnicode=true|false
and characterEncoding=UTF-8|ISO8859_1(latin1) without any affect on the
output.
Moving onto characterEncoding. I left useUnicode at default in all
remaining tests. I tried all combinations of
characterEncoding=ISO8859_1|UTF-8 and
characterSetResults=ISO8859_1|UTF-8. No change in output.
Last chance is clobCharacterEncoding. I tried all combinations of
characterEncoding=ISO8859_1|UTF-8, characterSetResults=ISO8859_1|UTF-8
and clobCharacterEncoding=ISO8859_1|UTF-8. Still no change in output.
I'm sure my changes were taking effect. If I put in an invalid encoding
at any point, osmosis would crash. When I changed the encoding to
ISO8859_8 (Hebrew), the name tag changed to "Ilmenauer Stra??e".
The issue appears to be that somewhere along the line the multi-byte
utf-8 character is getting turned into two separate characters. Every
time I attempt to retrieve the data, I get two characters back. I have
not been able to find a way to trick the JDBC driver into somehow
re-assembling the broken data.
I checked the default settings when logging in via a standard connection
to the production db.
Server characterset: latin1
Db characterset: utf8
Client characterset: latin1
Conn. characterset: latin1
All environments that I've previously tested on have the
default-character-set property set to utf8 in the [mysql] and [mysqld]
sections of my.cnf which results in:
Server characterset: utf8
Db characterset: utf8
Client characterset: utf8
Conn. characterset: utf8
All in all, I've spent about 3 hours on this with zero results. Getting
more than a little frustrated ;-) I'm giving up for now unless anybody
has any suggestions. I suspect the only way to fix this is to fix the
root cause in the production database. I don't know how difficult this is.
Does anybody know what happens to an existing database if encoding
settings in my.cnf are changed? Tom mentioned some double encoding
issues which appear to be at the root of this issue. Presumably changing
my.cnf to make everything utf-8 will either break something or possibly
do nothing at all if the problem setting is persisted in the database
somewhere.
Brett
More information about the dev
mailing list