[OSM-dev] osmosis applied to latest planet
brett at bretth.com
Fri Oct 26 10:35:35 BST 2007
Okay, that makes sense ... I think.
For a long time I used to use the default encoding but changed it last
time I had utf8 issues. The cause of that one turned out to be me using
the default encoding when writing a file in which case you do have to
explicitly set the encoding. The MySQL jdbc driver is supposed to
autodetect so overriding it seemed a bit dodgy at the time.
I'll remove the forced encoding. Hopefully that will fix the problem.
Tom Hughes wrote:
> In message <4721988A.1020903 at bretth.com>
> Brett Henderson <brett at bretth.com> wrote:
>> I downloaded the problem area from the api and imported it into a
>> local database. I then dumped the db as a changeset. The data
>> doesn't get corrupted anywhere. Something is different when running
>> on the production servers but I don't know what. I've tried running
>> it on both my windows laptop (codepage 1252 I believe) which has utf8
>> settings applied in my.cnf and my linux server which seems to be set
>> to a utf8 console but not defaulted to utf8 in my.cnf but with the
>> database instance itself created with the utf8 encoding.
>> In both cases the data can be imported and dumped successfully without
>> utf8 loss.
> What you're seeing looks like it is the result of the weird double
> encoding we have in the master database. At a guess something is
> reading from the database via a connection that has been explicitly
> put into UTF-8 mode, and that will break things when reading from
> the master database and give you double-encoded data. You need to
> leave the connection with the default character set.
> The full story, or at least what Jon Burgess and I think is going
> on, is as follows...
> At some point the database was (I think) converted to UTF-8 so that
> it thinks it is storing UTF-8 data. The default connection character
> set was not changed however, and the API code (both pre and post
> rails) does not explicitly set the connection character set.
> So you have ruby code running the API which receives XML which of
> course contains UTF-8 data. It parses that into ruby variables which
> are themselves UTF-8 and then writes them to MySQL.
> Now the ruby MySQL client library seemingly doesn't automatically
> set the connection to UTF-8 or do any conversion if the connection
> is set to something else. So it just writes UTF-8 data to the
> database via a connection that is expecting Latin-1 data.
> Because the database is UTF-8 but is (it thinks) receiving Latin-1
> data it does a conversion as it writes the data to the database.
> That's all fine so long as the reverse happens when you retrieve
> the data. If you retrieve data via a UTF-8 connection then it
> doesn't bother undoing the conversion and you get mangled data
More information about the dev