[OSM-dev] UTF-8 errors in our DB, or elsewhere?

Frederik Ramm frederik at remote.org
Tue Feb 12 02:11:05 GMT 2008


Hi,

   in the course of producing shapefiles, I applied the libxml2 built-in
character set conversion from UTF-8 to Latin-1 to our tag values, and
found a lot of problems (about 20k nodes/ways) where it complained.

I cross-checked some of these by downloading the data directly from
the API (the data I used for converting has been through Osmosis a
number of tims and I wanted to make sure it isn't an Osmosis bug),
then running iconv on it. Some went through ok, but many seem to be
wrong indeed. 

Can anybody tell me something about libxml2 character set conversion -
is it considered buggy?

And about the UTF-8 bugs in the database: are they real? For example:

http://www.openstreetmap.org/api/0.5/way/8138279

doesn't seem proper UTF-8 to me but maybe I'm wrong. If it really
isn't proper UTF-8, then why don't more of our tools choke on it? Is
everybody "sanitizing" in one form or the other? Should we generate
some statistics about UTF-8 problems and try to purge them?

Here's a list of objects that libxml2 complained about (not complete
as I didn't process a full planet):

http://www.remote.org/frederik/tmp/utf8.txt

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00.09' E008°23.33'





More information about the dev mailing list