[OSM-dev] Mixed character encoding in planet.osm - plan for fixing it

Sat Nov 4 11:28:33 GMT 2006

Over the last weeks, several people have found out that the character 
encoding in the planet.osm files is not fully valid UTF-8.

I would like to clean up this mess.
Let's start with some thoughts on the data storage, input and output.

Output:
In the first line of the planet.osm file, it states that the character 
encoding should be UTF-8:
   <?xml version="1.0" encoding="UTF-8"?>
Most of the characters in the planet.osm are encoded in UTF-8, some are 
encoded differently.
Is this an issue of the script that produces the dump or is the wrong 
encoding already in the database and the script just passes it through?

To me, it sounds like it is the second reason - the wrong encoding is in 
the database already.
(Someone please correct me if I am wrong with this assumption)

Looking at http://svn.openstreetmap.org/sql/mysql-schema.sql, I see that 
the tables current_nodes and current_way_tags are defined as 
CHARSET=utf8. To me, that means that the database is storing all name 
tags in UTF-8.

How did the wrong encoding get into the database? Here are my first 
thoughts:
- JOSM
- Online applet on OSM web page
- other editors

I am working with JOSM every day. It seems to handle German umlauts very 
well. I am not sure if that would be different with any other language 
characters.

How can we validate that none of the normal input methods listed above 
is the source of the encoding issue?

In parallel, I am thinking of cleaning up the database. First, I will 
try to make a list of all entries with non-valid UTF-8 encoding, based 
on the latest planet.osm.

Is anyone else working on cleaning up this issue? I don't wan't to 
interfere.

Ralf
Munich/Germany