[OSM-dev] osmosis utf-8

Thu Nov 8 01:16:07 GMT 2007

Martijn van Oosterhout wrote:
> On Nov 7, 2007 12:56 PM, Brett Henderson <brett at bretth.com> wrote:
>   
>> I did a quick test as you said changing line 111 of BaseXmlWriter from
>> using a "UTF-8" encoding to "ISO-8859-1" encoding.
>>
>> It results in the tag being written as:
>> <tag k="name" v="Ilmenauer Stra�?e"/>
>>     
>
> Ok, just checking: it's still connecting to the DB with UTF-8 right,
> so it's not mysql putting in the questions marks? (you see, I expect
> Java to throw an exception if the encoding fails).
>
> With the DB connection in UTF-8 can you show the output of
> getBytes("UTF-8") of the value of that tag, because it's now totally
> unclear to me what you're actually getting out of the DB.
>
> Have a nice day,
>   
Good point, I was connecting with default connection settings which were 
probably latin1.  However, I just ran it again with the db connection 
set to utf-8 and writing the xml as ISO-8859-1 with the same result.

Is there a simple tcp proxy/tunnel application I can use to log 
connection data to file?  It might be more useful than me guessing at 
what is going on between osmosis and the database.

I've just created test-utf8.osc and test-iso-8859-1.osc in the 
http://planet.openstreetmap.org/daily
Both are performed with a utf-8 database connection.  The output file 
encoding is changed as indicated by the file name.

The utf8 file should contain identical content to a call to 
getBytes("UTF-8") because I'm just writing strings directly to the file 
with utf-8 encoding.  I am escaping special XML characters as per the 
following list but that is all.
'<', "<"
'>', ">"
'"', """
'\'', "'"
'&', "&"
'\n', "&#xA;"
'\r', "&#xD;"
'\t', "&#x9;"

As for java exceptions, I don't believe it will throw an exception if 
the encoding fails.  It's fairly common behaviour for encoding libraries 
to use a ? character where a multi-byte character can't be represented.  
I know .NET does this and it appears that Java does as well.

There's one thing I don't understand.  When MySQL returns latin1 data no 
data is lost, but when Java converts to ISO-8859-1 data is lost.  
Perhaps the mysql latin1 allows more characters to be represented.  It 
seems like I need a special MySQL latin1 equivalent encoding in Java so 
that I can remove a layer of utf8 encoding without losing information.  
Without sniffing the network connection though it's hard to tell what's 
going on.

Cheers,
Brett