[OSM-dev] osmosis utf-8

Brett Henderson brett at bretth.com
Thu Nov 8 11:59:55 GMT 2007


Martijn van Oosterhout wrote:
> On Nov 8, 2007 2:16 AM, Brett Henderson <brett at bretth.com> wrote:
>   
>> Is there a simple tcp proxy/tunnel application I can use to log
>> connection data to file?  It might be more useful than me guessing at
>> what is going on between osmosis and the database.
>>     
>
> Maybe tcpdump, if the connection isn't encrypted...
>   
Cool, I'll have to try this out.  I knew there had to be something like 
that but couldn't find it in a couple of minutes of googling.
>   
>> I've just created test-utf8.osc and test-iso-8859-1.osc in the
>> http://planet.openstreetmap.org/daily
>> Both are performed with a utf-8 database connection.  The output file
>> encoding is changed as indicated by the file name.
>>     
>
> Ok, this is wierd. The utf8 file has c3 83 c5 b8 and the iso-8859-1
> has c3 3f. Now utf8(c3 83) = latin1(c3) so that's good. But utf8(c5
> b8) is not latin1, being unicode(0x178) which is not latin1 (it's a Y
> with two dots above it 'Ÿ').
>
> I'm going to take a guess in suggesting the character is supposed to
> be a 'ß', unicode(0xDF) = utf8(c3 9f). It turns out that in windows
> code page 1252 the character "Y is represented by 0x9f. So we have one
> or more of:
>   
Yep, that's the desired character.
> 1. what mysql thinks is latin1 is not
> 2. ruby is connecting in a windows code page 1252
> 3. The recoding from the server encoding to java is wrong
>
> In any case, case you set the file output encoding to windows cp1252
> and see what happens?
>   
That lines up with what Tom was saying about MySQL using a 
windows-1252-like encoding.  I'm feeling a little silly, I tried to find 
the name of the 1252 encoding yesterday to try it out and came to the 
conclusion java didn't support it, I was wrong (not sure why I didn't 
see it ...).  I might have fixed this sooner.

Check out:
http://planet.openstreetmap.org/daily/test-cp1252.osc

Unless I'm mistaken that's the required output!  Thanks for your 
assistance on this, hugely appreciated.  You've become the "go to" man 
for utf-8 bug solving ;-)  I'll do a local change to the copy of osmosis 
on dev to make it write in Cp1252, I'll do a proper fix to make it 
optional on the command line over the next few days.

TomH, do you know if MySQL uses cp1252 exactly or are there some subtle 
differences we should be aware of?

Cheers,
Brett





More information about the dev mailing list