[OSM-dev] UTF8 problem with last night's daily .osc
brett at bretth.com
Sun Aug 31 04:15:20 BST 2008
Grant Slater wrote:
> Karl Newman wrote:
>> If I recall correctly, the database column is not actually set for
>> UTF-8 (but is double-encoded to return actual UTF-8 to the client...).
>> Wouldn't it be a better long-term fix to change the database to UTF-8
>> (or whatever), then presumably MySql wouldn't allow invalid sequences
>> to be stored? Still would be a good idea to raise an error if the
>> length was too long, though.
> Yes, this is the case.
> It's on the list of planned fixes for the 0.6 API. More on the wiki soon.
> / Grant
I'm repeating a lot of what has already been said but here goes.
To "correctly" parse an xml file you have to first decode utf-8, *then*
look for xml delimiters such as quotes, etc. At least that's the way a
standard xml parser works. It might be possible to use a regex parser
in an ascii mode to work around incorrectly encoded utf-8 data but this
is just avoiding the real issue. It shouldn't be that hard to ensure
correctly encoded data is stored the database.
Which brings up the real issue. The database is using the wrong encoding.
The production osmosis changeset extraction code has a hack enabled
which reads the data from the database in the doubly encoded form but
then fixes it while writing out the xml file. But the *fix* is a bit
ugly and uses a custom character set encoding. It works if the data in
the database is encoded as expected. If other apps write data
incorrectly to the database then I will spit garbage out in changeset
files. Adding invalid utf-8 detection in osmosis is difficult because
the fix gets applied within a file encoding stream which occurs after
the point where the xml content is produced. Even if invalid utf-8
detection code was added, it would only know that the entire xml file
was corrupt and would have no way of discarding a problematic tag or
whatever introduced the problem in the first place.
If the production api and/or potlatch are introducing corrupted data
into the database they should be fixed (I can't help here). Now that
corrupted data exists in the database it should also be fixed (I'll take
a look at this now). If we correct the database encoding issues then I
can disable the encoding hack in osmosis and it will always emit valid
utf-8 therefore fixing invalid changeset files. What I'm not keen on
doing is trying to detect invalid utf-8 data and correct it somehow in
osmosis. I'm happy to accept a patch if somebody can do this however ...
More information about the dev