[OSM-dev] Mixed character encoding in planet.osm - plan for fixing it
jonass at lysator.liu.se
Sun Nov 5 10:23:39 GMT 2006
On 4 Nov 2006 at 12:28, Ralf Zimmermann wrote:
> Over the last weeks, several people have found out that the character
> encoding in the planet.osm files is not fully valid UTF-8.
> I would like to clean up this mess.
> Let's start with some thoughts on the data storage, input and output.
I have done some testing with the new planet-061105.osm.bz2 and
some other tools. To me it seems like the data in the database now
is correct but the data exported to the planet dump is broken. For
example if I look at node 2385021, it is broken in the dump but is
correct if you download it in your browser using the api or
download and look at it using JOSM. The broken name for that node
in the dump is "Handelsh�jskole Syd" (character code F8, would be
correct if encoding had been iso latin-1), JOSM and others says
"Handelshøjskole Syd" (character code C3 B8) which is correct UTF-
8. Please tell me if this analysis is faulty.
My UTF8Sanitizer now reports 580 errors in planet-061105.osm.bz2 so
I think you should look for the error in the export script creating
the planet dump.
More information about the dev