[OSM-dev] osmosis applied to latest planet

Fri Oct 26 10:35:35 BST 2007

Okay, that makes sense ... I think.

For a long time I used to use the default encoding but changed it last 
time I had utf8 issues.  The cause of that one turned out to be me using 
the default encoding when writing a file in which case you do have to 
explicitly set the encoding.  The MySQL jdbc driver is supposed to 
autodetect so overriding it seemed a bit dodgy at the time.

I'll remove the forced encoding.  Hopefully that will fix the problem.

Tom Hughes wrote:
> In message <4721988A.1020903 at bretth.com>
>         Brett Henderson <brett at bretth.com> wrote:
>
>   
>> I downloaded the problem area from the api and imported it into a
>> local database.  I then dumped the db as a changeset.  The data
>> doesn't get corrupted anywhere.  Something is different when running
>> on the production servers but I don't know what.  I've tried running
>> it on both my windows laptop (codepage 1252 I believe) which has utf8
>> settings applied in my.cnf and my linux server which seems to be set
>> to a utf8 console but not defaulted to utf8 in my.cnf but with the
>> database instance itself created with the utf8 encoding.
>> In both cases the data can be imported and dumped successfully without
>> utf8 loss.
>>     
>
> What you're seeing looks like it is the result of the weird double
> encoding we have in the master database. At a guess something is
> reading from the database via a connection that has been explicitly
> put into UTF-8 mode, and that will break things when reading from
> the master database and give you double-encoded data. You need to
> leave the connection with the default character set.
>
> The full story, or at least what Jon Burgess and I think is going
> on, is as follows...
>
> At some point the database was (I think) converted to UTF-8 so that
> it thinks it is storing UTF-8 data. The default connection character
> set was not changed however, and the API code (both pre and post
> rails) does not explicitly set the connection character set.
>
> So you have ruby code running the API which receives XML which of
> course contains UTF-8 data. It parses that into ruby variables which
> are themselves UTF-8 and then writes them to MySQL.
>
> Now the ruby MySQL client library seemingly doesn't automatically
> set the connection to UTF-8 or do any conversion if the connection
> is set to something else. So it just writes UTF-8 data to the
> database via a connection that is expecting Latin-1 data.
>
> Because the database is UTF-8 but is (it thinks) receiving Latin-1
> data it does a conversion as it writes the data to the database.
>
> That's all fine so long as the reverse happens when you retrieve
> the data. If you retrieve data via a UTF-8 connection then it
> doesn't bother undoing the conversion and you get mangled data
> back.
>
> Tom
>
>