[OSM-dev] osmosis utf-8
Brett Henderson
brett at bretth.com
Thu Nov 8 11:59:55 GMT 2007
Martijn van Oosterhout wrote:
> On Nov 8, 2007 2:16 AM, Brett Henderson <brett at bretth.com> wrote:
>
>> Is there a simple tcp proxy/tunnel application I can use to log
>> connection data to file? It might be more useful than me guessing at
>> what is going on between osmosis and the database.
>>
>
> Maybe tcpdump, if the connection isn't encrypted...
>
Cool, I'll have to try this out. I knew there had to be something like
that but couldn't find it in a couple of minutes of googling.
>
>> I've just created test-utf8.osc and test-iso-8859-1.osc in the
>> http://planet.openstreetmap.org/daily
>> Both are performed with a utf-8 database connection. The output file
>> encoding is changed as indicated by the file name.
>>
>
> Ok, this is wierd. The utf8 file has c3 83 c5 b8 and the iso-8859-1
> has c3 3f. Now utf8(c3 83) = latin1(c3) so that's good. But utf8(c5
> b8) is not latin1, being unicode(0x178) which is not latin1 (it's a Y
> with two dots above it 'Ÿ').
>
> I'm going to take a guess in suggesting the character is supposed to
> be a 'ß', unicode(0xDF) = utf8(c3 9f). It turns out that in windows
> code page 1252 the character "Y is represented by 0x9f. So we have one
> or more of:
>
Yep, that's the desired character.
> 1. what mysql thinks is latin1 is not
> 2. ruby is connecting in a windows code page 1252
> 3. The recoding from the server encoding to java is wrong
>
> In any case, case you set the file output encoding to windows cp1252
> and see what happens?
>
That lines up with what Tom was saying about MySQL using a
windows-1252-like encoding. I'm feeling a little silly, I tried to find
the name of the 1252 encoding yesterday to try it out and came to the
conclusion java didn't support it, I was wrong (not sure why I didn't
see it ...). I might have fixed this sooner.
Check out:
http://planet.openstreetmap.org/daily/test-cp1252.osc
Unless I'm mistaken that's the required output! Thanks for your
assistance on this, hugely appreciated. You've become the "go to" man
for utf-8 bug solving ;-) I'll do a local change to the copy of osmosis
on dev to make it write in Cp1252, I'll do a proper fix to make it
optional on the command line over the next few days.
TomH, do you know if MySQL uses cp1252 exactly or are there some subtle
differences we should be aware of?
Cheers,
Brett
More information about the dev
mailing list