[OSM-dev] Problems parsing planet.osm with Perl XML::Parser

Wed Nov 1 16:30:42 GMT 2006

I'm having a similar problem with the python SAX parser.

The first value is cryllic -- you're missing the font/trying to print on 
a non-UTF-8 display.

from my ubuntu box's terminal it comes out as:

<node id="543408" lat="51.2714" lon="7.13737" timestamp="2006-02-16T16:43:38+00:00">
  <tag k="name" v="????????????????/*<D1>?*/???????????/*<D1>?*/?/*<D1>?<D0>?*/???????????/*<D0>?*/??????????????????" />
  <tag k="class" v="node" />
</node>

The /*/*<D1>?*/*/  stuff that appears is two bytes D0 3F which is 
invalid UTF-8. This is why the XML parser is probably failing. It's 
definitely that character where it all dies.

The second one looks OK because you're not using UTF-8. The ø is a 
single byte F0 -- this is also invalid UTF-8.

So this is either an issue with the planet dump script, or the tag value 
is corrupted in the database itself. If it's the database then it needs 
clearing up, and the API needs to ensure that only good values are inserted.

Dave

Ralf Zimmermann wrote:
> I want to write some Perl scripts in order to filter OSM data. As a first attempt, I wrote the file osm_stats.pl, which only counts the amount of nodes, segments and ways.
>
> With a lot of OSM files, the script works just fine. But when I throw the planet file planet-061023.osm on this script, I get the following error message:
>
> not well-formed (invalid token) at line 587103, column 37, byte 45215417 at /usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/XML/Parser.pm line 187
>
> Looking at the planet file shows the following line as being problematic:
> 587102:   <node id="543408" lat="51.2714" lon="7.13737" timestamp="2006-02-16T16:43:38+00:00">
> 587103:     <tag k="name" v="Ð°Ð±Ð²Ð³Ð´ÐµÐ¶Ð·Ð¸ÐºÐ»Ð¼Ð½Ð¾Ð¿ÑÑ?ÑÑÑÑÑÑÑÑÑÑÑÑ?ÑÑ?Ð?ÐÐÐÐÐÐÐÐÐÐÐ?ÐÐÐ Ð¡Ð¢Ð£Ð¤Ð¥Ð¦Ð§Ð¨Ð©Ð¬Ð«ÐªÐÐ®Ð¯" />
> 587104:     <tag k="class" v="node" />
> 587105:   </node>
>
> I eliminated this node from the planet file and I get other lines that have the same issue, for example:
> 1729956:     <tag k="name" v="Handelshøjskole Syd" />
>
> Somehow, the parser does not like the special characters in the name tag. Whereas the first example seems somewhat misformed, the second example looks ok to me.
> To me it seems like the parser has a problem. But how can I solve that?
>
> Has anyone here used XML::Parser and experienced similar issues with special characters?
>
> Ralf
>
>
> --- osm_stats.pl -----------------------------
> #!/usr/bin/perl -w
>
> use strict;
>
> use XML::Parser;
> my $num_nodes = 0;
> my $num_segments = 0;
> my $num_ways = 0;
> my $p = new XML::Parser(Style => 'Subs');
> $p->parsefile($ARGV[0], ProtocolEncoding => 'UTF-8');
> print "Statistics of file $ARGV[0]:\n";
> print "Nodes:    $num_nodes\n";
> print "Segments: $num_segments\n";
> print "Ways:     $num_ways\n";
>
> sub node {
>    $num_nodes++;
> }
> sub segment {
>    $num_segments++;
> }
> sub way {
>    $num_ways++;
> }
>
>
>
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20061101/6dbf4251/attachment.html>