[OSM-dev] Problems parsing planet.osm with Perl XML::Parser

Ralf Zimmermann Ralf at Zimmermann.com
Wed Nov 1 15:19:36 GMT 2006


I want to write some Perl scripts in order to filter OSM data. As a first attempt, I wrote the file osm_stats.pl, which only counts the amount of nodes, segments and ways.

With a lot of OSM files, the script works just fine. But when I throw the planet file planet-061023.osm on this script, I get the following error message:

not well-formed (invalid token) at line 587103, column 37, byte 45215417 at /usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/XML/Parser.pm line 187

Looking at the planet file shows the following line as being problematic:
587102:   <node id="543408" lat="51.2714" lon="7.13737" timestamp="2006-02-16T16:43:38+00:00">
587103:     <tag k="name" v="абвгдежзиклмнопÑÑ?ÑÑÑÑÑÑÑÑÑÑÑÑ?ÑÑ?Ð?ÐÐÐÐÐÐÐÐÐÐÐ?ÐÐРСТУФХЦЧШЩЬЫЪЭЮЯ" />
587104:     <tag k="class" v="node" />
587105:   </node>

I eliminated this node from the planet file and I get other lines that have the same issue, for example:
1729956:     <tag k="name" v="Handelshøjskole Syd" />

Somehow, the parser does not like the special characters in the name tag. Whereas the first example seems somewhat misformed, the second example looks ok to me.
To me it seems like the parser has a problem. But how can I solve that?

Has anyone here used XML::Parser and experienced similar issues with special characters?

Ralf


--- osm_stats.pl -----------------------------
#!/usr/bin/perl -w

use strict;

use XML::Parser;
my $num_nodes = 0;
my $num_segments = 0;
my $num_ways = 0;
my $p = new XML::Parser(Style => 'Subs');
$p->parsefile($ARGV[0], ProtocolEncoding => 'UTF-8');
print "Statistics of file $ARGV[0]:\n";
print "Nodes:    $num_nodes\n";
print "Segments: $num_segments\n";
print "Ways:     $num_ways\n";

sub node {
   $num_nodes++;
}
sub segment {
   $num_segments++;
}
sub way {
   $num_ways++;
}






More information about the dev mailing list