[OSM-dev] osmosis applied to latest planet

Tom Hughes tom at compton.nu
Fri Oct 26 08:59:36 BST 2007

In message <4721988A.1020903 at bretth.com>
        Brett Henderson <brett at bretth.com> wrote:

> I downloaded the problem area from the api and imported it into a
> local database.  I then dumped the db as a changeset.  The data
> doesn't get corrupted anywhere.  Something is different when running
> on the production servers but I don't know what.  I've tried running
> it on both my windows laptop (codepage 1252 I believe) which has utf8
> settings applied in my.cnf and my linux server which seems to be set
> to a utf8 console but not defaulted to utf8 in my.cnf but with the
> database instance itself created with the utf8 encoding.
> In both cases the data can be imported and dumped successfully without
> utf8 loss.

What you're seeing looks like it is the result of the weird double
encoding we have in the master database. At a guess something is
reading from the database via a connection that has been explicitly
put into UTF-8 mode, and that will break things when reading from
the master database and give you double-encoded data. You need to
leave the connection with the default character set.

The full story, or at least what Jon Burgess and I think is going
on, is as follows...

At some point the database was (I think) converted to UTF-8 so that
it thinks it is storing UTF-8 data. The default connection character
set was not changed however, and the API code (both pre and post
rails) does not explicitly set the connection character set.

So you have ruby code running the API which receives XML which of
course contains UTF-8 data. It parses that into ruby variables which
are themselves UTF-8 and then writes them to MySQL.

Now the ruby MySQL client library seemingly doesn't automatically
set the connection to UTF-8 or do any conversion if the connection
is set to something else. So it just writes UTF-8 data to the
database via a connection that is expecting Latin-1 data.

Because the database is UTF-8 but is (it thinks) receiving Latin-1
data it does a conversion as it writes the data to the database.

That's all fine so long as the reverse happens when you retrieve
the data. If you retrieve data via a UTF-8 connection then it
doesn't bother undoing the conversion and you get mangled data


Tom Hughes (tom at compton.nu)

More information about the dev mailing list