[OSM-dev] Mixed character encoding in planet.osm - plan for fixing it

raphael Jacquot sxpert at sxpert.org
Wed Nov 8 09:57:37 GMT 2006


Erik Johansson wrote:
> On 11/5/06, Jonas Svensson <jonass at lysator.liu.se> wrote:
>> On 4 Nov 2006 at 12:28, Ralf Zimmermann wrote:
>>
>>> Over the last weeks, several people have found out that the character
>>> encoding in the planet.osm files is not fully valid UTF-8.
>>>
>>> I would like to clean up this mess.
>>> Let's start with some thoughts on the data storage, input and output.
>> I have done some testing with the new planet-061105.osm.bz2 and
>> some other tools. To me it seems like the data in the database now
>> is correct but the data exported to the planet dump is broken. For
>> example if I look at node 2385021, it is broken in the dump but is
>> correct if you download it in your browser using the api or
>> download and look at it using JOSM. The broken name for that node
>> in the dump is "Handelsh�jskole Syd" (character code F8, would be
>> correct if encoding had been iso latin-1), JOSM and others says
>> "Handelshøjskole Syd" (character code C3 B8) which is correct UTF-
>> 8. Please tell me if this analysis is faulty.
> 
> Yes API download gives UTF-8 and planet dump gives latin1
> 

this is not all there's to be here.
at the begining of planet.osm, there's valid UTF-8 tags

2136260   :   0   0   "   /   >  nl  sp  sp   <   n   o   d   e  sp   i
         303a 2230 3e2f 200a 3c20 6f6e 6564 6920
2136300   d   =   "   1   0   0   4   7   8   "  sp   l   a   t   =   "
         3d64 3122 3030 3734 2238 6c20 7461 223d
2136320   5   8   .   4   1   8   9   0   7   1   6   5   5   2   7   3
         3835 342e 3831 3039 3137 3536 3235 3337
2136340   "  sp   l   o   n   =   "   1   5   .   5   1   0   0   7   0
         2022 6f6c 3d6e 3122 2e35 3135 3030 3037
2136360   8   0   0   7   8   1   2   "  sp   t   i   m   e   s   t   a
         3038 3730 3138 2232 7420 6d69 7365 6174
2136400   m   p   =   "   2   0   0   6   -   0   8   -   1   9   T   1
         706d 223d 3032 3630 302d 2d38 3931 3154
2136420   1   :   1   0   :   0   7   +   0   1   :   0   0   "   >  nl
         3a31 3031 303a 2b37 3130 303a 2230 0a3e
2136440  sp  sp  sp  sp   <   t   a   g  sp   k   =   "   n   a   m   e
         2020 2020 743c 6761 6b20 223d 616e 656d
2136460   "  sp   v   =   "   K   C   $   r   n   a   b   r   u   n   n
         2022 3d76 4b22 a4c3 6e72 6261 7572 6e6e
2136500   s  sp   g   a   t   a   n   "  sp   /   >  nl  sp  sp  sp  sp
         2073 6167 6174 226e 2f20 0a3e 2020 2020
2136520   <   t   a   g  sp   k   =   "   h   i   g   h   w   a   y   "
         743c 6761 6b20 223d 6968 6867 6177 2279
2136540  sp   v   =   "   s   e   c   o   n   d   a   r   y   "  sp   /
         7620 223d 6573 6f63 646e 7261 2279 2f20
2136560   >  nl  sp  sp   <   /   n   o   d   e   >

then, for some reason there's some non-utf8 chars in there.




More information about the dev mailing list