[OSM-dev] Mixed character encoding in planet.osm - plan for fixing it

raphael Jacquot sxpert at sxpert.org
Wed Nov 8 10:29:44 GMT 2006

Jonas Svensson wrote:
> On Wed, 8 Nov 2006, raphael Jacquot wrote:
>> Ralf Zimmermann wrote:
>>> How did the wrong encoding get into the database? Here are my first
>>> thoughts:
>>> - JOSM
>>> - Online applet on OSM web page
>>> - other editors
>> I'd blame the mysql first, as postgres complains loudly when trying to
>> insert something that's not valid utf-8 in a utf-8 database
> Pardon me for not understanding. Can you please explain why you say 
> there are errors in the database? I have not checked every and each tag 
> but the one I tested this sunday were correct utf-8 when retrieved by 
> the api <http://wiki.openstreetmap.org/index.php/REST> but faulty when 
> extracted from the database dump file
> <http://planet.openstreetmap.org/planet-061105.osm.bz2>. Is the 
> webserver converting characters from latin-1 to utf-8?
> /Jonas

well... if you have mostly correct UTF-8 in the planet dump, except for 
a few hundred entries, as the utf8sanitizer shows, then the script 
generating the file is probably working correctly and some entries in 
the database have to be wrong, it's as simple as that...

these for instance are good examples :

   <node id="100478" lat="58.4189071655273" lon="15.5100708007812" 
     <tag k="name" v="Kärnabrunns gatan" />
     <tag k="highway" v="secondary" />
   <node id="100479" lat="58.4187088012695" lon="15.5110473632812" 
     <tag k="name" v="Gamla Ledbergsvägen" />
     <tag k="highway" v="secondary" />
   <node id="100480" lat="58.4186553955078" lon="15.5122394561768" 
     <tag k="name" v="Gamla Ledbergsvägen" />
     <tag k="highway" v="secondary" />

