[OSM-dev] Problems parsing planet.osm with Perl XML::Parser
james at mastros.biz
Wed Nov 1 16:38:13 GMT 2006
On Wed, Nov 01, 2006 at 04:19:36PM +0100, Ralf Zimmermann wrote:
> With a lot of OSM files, the script works just fine. But when I
> throw the planet file planet-061023.osm on this script, I get the
> following error message:
> not well-formed (invalid token) at line 587103, column 37, byte
> 45215417 at
> line 187
> Looking at the planet file shows the following line as being problematic:
> 587102: <node id="543408" lat="51.2714" lon="7.13737" timestamp="2006-02-16T16:43:38+00:00">
> 587103: <tag k="name" v="Ð°Ð±Ð²Ð³Ð´ÐµÐ¶Ð·Ð¸ÐºÐ»Ð¼Ð½Ð¾Ð¿ÑÑ?ÑÑÑÑÑÑÑÑÑÑÑÑ?ÑÑ?Ð?ÐÐÐÐÐÐÐÐÐÐÐ?ÐÐÐ Ð¡Ð¢Ð£Ð¤Ð¥Ð¦Ð§Ð¨Ð©Ð¬Ð«ÐªÐÐ®Ð¯" />
> 587104: <tag k="class" v="node" />
> 587105: </node>
> I eliminated this node from the planet file and I get other lines that have the same issue, for example:
> 1729956: <tag k="name" v="Handelshøjskole Syd" />
> Somehow, the parser does not like the special characters in the name
> tag. Whereas the first example seems somewhat misformed, the second
> example looks ok to me.
> To me it seems like the parser has a problem. But how can I solve that?
> Has anyone here used XML::Parser and experienced similar issues with special characters?
You've diagnosed this somewhat backwards, I'm afraid. The file
doesn't have a character-set declaration. According to the XML spec,
that means it's in utf8, and the garbage on line 587103 simply isn't
valid utf8. Several other lines have things that appear to be latin-1
instead of utf8.
-=- James Mastros
More information about the dev