[OSM-dev] Split osm line with perl

Sun Nov 29 18:43:08 GMT 2009

Hi,

Ævar Arnfjörð Bjarmason wrote:
> If I were to acquaint myself with
> it I'd be sure not to start by writing the millionth buggy tagsoup
> parser using regexes though.

As I said, a good craftsperson will know all available tools and choose 
the one that best suits the job, and not disregard a whole family of 
tools just because he believes them to be inferior (or "uncool").

If you are dealing with the kind of XML emitted by Osmosis, you can make 
assumptions about the structure. Assumptions that will break if you try 
to deal with other files of course, but assumptions that make things 
faster as long as you stay within the envelope.

Assume you want to count the different values for the "highway" tag.

The following millionth buggy tagsoup parser (which anyone familar with 
Perl can write without looking up the details of an XML parser library) 
  does this for Germany in 108 seconds:

perl -e 'while(<>) {$count{$1}++ if (/<tag k="highway" v="([^"]*)"/); }; 
foreach (sort { $count{$b}<=>$count{$a}} keys %count) { printf "%6d 
%s\n",$count{$_},$_; }'

Your XML parser based code into which I injected the same counting routine,

perl -CI -MXML::Parser -E 'my $x = XML::Parser->new(Handlers => {
Start => sub { my ($p, $e, %kv) = @_; return unless $e eq "tag"; return 
unless $kv{k} eq "highway"; $count{$kv{v}}++; } }); $x->parse(*STDIN); 
foreach (sort { $count{$b}<=>$count{$a}} keys %count) { printf "%6d 
%s\n",$count{$_},$_; }'

arrives at the same result in 915 seconds, that's a 850% performance 
penalty.

Yes, the primitive version will choke if there's a line break or if 
someone uses ' instead of "; it doesn't decode UTF-8 properly and it 
will not resolve entities. Your version does all this, and precisely 
because it does, takes four times longer.

A good programmer should be aware of this, and not pay for the XML 
parser bells and whistles if he doesn't need them.

I may be a bit old-fashioned but I had to take exception to the 
arrogance that spoke from your post. It is exactly that kind of attitude 
that I often see in young programmers: "I implemented this by the book 
and it doesn't go any faster." - "But how do you expect us to run this 
on a nightly basis when your code takes 28 hours to run?" - "Use more 
machines, dude. Never heard of map/reduce?" - and all that because they 
are too snotnosed to parse XML with a regex if required.

I'm not calling for premature optimisation, and nothing would be more 
stupid than trying to parse a 100-line user-written config file with 
anything else than a proper and tested XML parser. But discounting 
regex-based XML parsing outright, without having some knowledge about 
the cost incurred, is imprudent, and does not go well with the air of 
superiority that you gave off.

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00'09" E008°23'33"