[OSM-dev] UTF8 problem with last night's daily .osc

Sun Aug 31 04:15:20 BST 2008

Grant Slater wrote:
> Karl Newman wrote:
>   
>> If I recall correctly, the database column is not actually set for 
>> UTF-8 (but is double-encoded to return actual UTF-8 to the client...). 
>> Wouldn't it be a better long-term fix to change the database to UTF-8 
>> (or whatever), then presumably MySql wouldn't allow invalid sequences 
>> to be stored? Still would be a good idea to raise an error if the 
>> length was too long, though.
>>
>>     
>
> Yes, this is the case.
>
> It's on the list of planned fixes for the 0.6 API. More on the wiki soon.
>
> / Grant
>   
I'm repeating a lot of what has already been said but here goes.

To "correctly" parse an xml file you have to first decode utf-8, *then* 
look for xml delimiters such as quotes, etc.  At least that's the way a 
standard xml parser works.  It might be possible to use a regex parser 
in an ascii mode to work around incorrectly encoded utf-8 data but this 
is just avoiding the real issue.  It shouldn't be that hard to ensure 
correctly encoded data is stored the database.

Which brings up the real issue.  The database is using the wrong encoding.

The production osmosis changeset extraction code has a hack enabled 
which reads the data from the database in the doubly encoded form but 
then fixes it while writing out the xml file.  But the *fix* is a bit 
ugly and uses a custom character set encoding.  It works if the data in 
the database is encoded as expected.  If other apps write data 
incorrectly to the database then I will spit garbage out in changeset 
files.  Adding invalid utf-8 detection in osmosis is difficult because 
the fix gets applied within a file encoding stream which occurs after 
the point where the xml content is produced.  Even if invalid utf-8 
detection code was added, it would only know that the entire xml file 
was corrupt and would have no way of discarding a problematic tag or 
whatever introduced the problem in the first place.

If the production api and/or potlatch are introducing corrupted data 
into the database they should be fixed (I can't help here).  Now that 
corrupted data exists in the database it should also be fixed (I'll take 
a look at this now).  If we correct the database encoding issues then I 
can disable the encoding hack in osmosis and it will always emit valid 
utf-8 therefore fixing invalid changeset files.  What I'm not keen on 
doing is trying to detect invalid utf-8 data and correct it somehow in 
osmosis.  I'm happy to accept a patch if somebody can do this however ...