[OSM-dev] Mixed character encoding in planet.osm - plan for fixing it

Jonas Svensson jonass at lysator.liu.se
Sun Nov 5 10:23:39 GMT 2006

On 4 Nov 2006 at 12:28, Ralf Zimmermann wrote:

> Over the last weeks, several people have found out that the character 
> encoding in the planet.osm files is not fully valid UTF-8.
> I would like to clean up this mess.
> Let's start with some thoughts on the data storage, input and output.

I have done some testing with the new planet-061105.osm.bz2 and 
some other tools. To me it seems like the data in the database now 
is correct but the data exported to the planet dump is broken. For 
example if I look at node 2385021, it is broken in the dump but is 
correct if you download it in your browser using the api or 
download and look at it using JOSM. The broken name for that node 
in the dump is "Handelsh�jskole Syd" (character code F8, would be 
correct if encoding had been iso latin-1), JOSM and others says 
"Handelshøjskole Syd" (character code C3 B8) which is correct UTF-
8. Please tell me if this analysis is faulty.

My UTF8Sanitizer now reports 580 errors in planet-061105.osm.bz2 so 
I think you should look for the error in the export script creating 
the planet dump.


