[OSM-dev] Mixed character encoding in planet.osm - plan for fixing it
raphael Jacquot
sxpert at sxpert.org
Wed Nov 8 09:57:37 GMT 2006
Erik Johansson wrote:
> On 11/5/06, Jonas Svensson <jonass at lysator.liu.se> wrote:
>> On 4 Nov 2006 at 12:28, Ralf Zimmermann wrote:
>>
>>> Over the last weeks, several people have found out that the character
>>> encoding in the planet.osm files is not fully valid UTF-8.
>>>
>>> I would like to clean up this mess.
>>> Let's start with some thoughts on the data storage, input and output.
>> I have done some testing with the new planet-061105.osm.bz2 and
>> some other tools. To me it seems like the data in the database now
>> is correct but the data exported to the planet dump is broken. For
>> example if I look at node 2385021, it is broken in the dump but is
>> correct if you download it in your browser using the api or
>> download and look at it using JOSM. The broken name for that node
>> in the dump is "Handelsh�jskole Syd" (character code F8, would be
>> correct if encoding had been iso latin-1), JOSM and others says
>> "Handelshøjskole Syd" (character code C3 B8) which is correct UTF-
>> 8. Please tell me if this analysis is faulty.
>
> Yes API download gives UTF-8 and planet dump gives latin1
>
this is not all there's to be here.
at the begining of planet.osm, there's valid UTF-8 tags
2136260 : 0 0 " / > nl sp sp < n o d e sp i
303a 2230 3e2f 200a 3c20 6f6e 6564 6920
2136300 d = " 1 0 0 4 7 8 " sp l a t = "
3d64 3122 3030 3734 2238 6c20 7461 223d
2136320 5 8 . 4 1 8 9 0 7 1 6 5 5 2 7 3
3835 342e 3831 3039 3137 3536 3235 3337
2136340 " sp l o n = " 1 5 . 5 1 0 0 7 0
2022 6f6c 3d6e 3122 2e35 3135 3030 3037
2136360 8 0 0 7 8 1 2 " sp t i m e s t a
3038 3730 3138 2232 7420 6d69 7365 6174
2136400 m p = " 2 0 0 6 - 0 8 - 1 9 T 1
706d 223d 3032 3630 302d 2d38 3931 3154
2136420 1 : 1 0 : 0 7 + 0 1 : 0 0 " > nl
3a31 3031 303a 2b37 3130 303a 2230 0a3e
2136440 sp sp sp sp < t a g sp k = " n a m e
2020 2020 743c 6761 6b20 223d 616e 656d
2136460 " sp v = " K C $ r n a b r u n n
2022 3d76 4b22 a4c3 6e72 6261 7572 6e6e
2136500 s sp g a t a n " sp / > nl sp sp sp sp
2073 6167 6174 226e 2f20 0a3e 2020 2020
2136520 < t a g sp k = " h i g h w a y "
743c 6761 6b20 223d 6968 6867 6177 2279
2136540 sp v = " s e c o n d a r y " sp /
7620 223d 6573 6f63 646e 7261 2279 2f20
2136560 > nl sp sp < / n o d e >
then, for some reason there's some non-utf8 chars in there.
More information about the dev
mailing list