[OSM-dev] Mixed character encoding in planet.osm - plan for fixing it
Ralf Zimmermann
Ralf at Zimmermann.com
Sat Nov 4 11:28:33 GMT 2006
Over the last weeks, several people have found out that the character
encoding in the planet.osm files is not fully valid UTF-8.
I would like to clean up this mess.
Let's start with some thoughts on the data storage, input and output.
Output:
In the first line of the planet.osm file, it states that the character
encoding should be UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
Most of the characters in the planet.osm are encoded in UTF-8, some are
encoded differently.
Is this an issue of the script that produces the dump or is the wrong
encoding already in the database and the script just passes it through?
To me, it sounds like it is the second reason - the wrong encoding is in
the database already.
(Someone please correct me if I am wrong with this assumption)
Looking at http://svn.openstreetmap.org/sql/mysql-schema.sql, I see that
the tables current_nodes and current_way_tags are defined as
CHARSET=utf8. To me, that means that the database is storing all name
tags in UTF-8.
How did the wrong encoding get into the database? Here are my first
thoughts:
- JOSM
- Online applet on OSM web page
- other editors
I am working with JOSM every day. It seems to handle German umlauts very
well. I am not sure if that would be different with any other language
characters.
How can we validate that none of the normal input methods listed above
is the source of the encoding issue?
In parallel, I am thinking of cleaning up the database. First, I will
try to make a list of all entries with non-valid UTF-8 encoding, based
on the latest planet.osm.
Is anyone else working on cleaning up this issue? I don't wan't to
interfere.
Ralf
Munich/Germany
More information about the dev
mailing list