[OSM-dev] Split osm line with perl

Sun Nov 29 21:03:53 GMT 2009

On Sun, Nov 29, 2009 at 18:43, Frederik Ramm <frederik at remote.org> wrote:
> Ævar Arnfjörð Bjarmason wrote:
>>
>> If I were to acquaint myself with
>> it I'd be sure not to start by writing the millionth buggy tagsoup
>> parser using regexes though.
>
> As I said, a good craftsperson will know all available tools and choose the
> one that best suits the job, and not disregard a whole family of tools just
> because he believes them to be inferior (or "uncool").
>
> If you are dealing with the kind of XML emitted by Osmosis, you can make
> assumptions about the structure. Assumptions that will break if you try to
> deal with other files of course, but assumptions that make things faster as
> long as you stay within the envelope.
>
> Assume you want to count the different values for the "highway" tag.
>
> The following millionth buggy tagsoup parser (which anyone familar with Perl
> can write without looking up the details of an XML parser library)  does
> this for Germany in 108 seconds:
>
> perl -e 'while(<>) {$count{$1}++ if (/<tag k="highway" v="([^"]*)"/); };
> foreach (sort { $count{$b}<=>$count{$a}} keys %count) { printf "%6d
> %s\n",$count{$_},$_; }'
>
> Your XML parser based code into which I injected the same counting routine,
>
> perl -CI -MXML::Parser -E 'my $x = XML::Parser->new(Handlers => {
> Start => sub { my ($p, $e, %kv) = @_; return unless $e eq "tag"; return
> unless $kv{k} eq "highway"; $count{$kv{v}}++; } }); $x->parse(*STDIN);
> foreach (sort { $count{$b}<=>$count{$a}} keys %count) { printf "%6d
> %s\n",$count{$_},$_; }'
>
> arrives at the same result in 915 seconds, that's a 850% performance
> penalty.
>
> Yes, the primitive version will choke if there's a line break or if someone
> uses ' instead of "; it doesn't decode UTF-8 properly and it will not
> resolve entities. Your version does all this, and precisely because it does,
> takes four times longer.
>
> A good programmer should be aware of this, and not pay for the XML parser
> bells and whistles if he doesn't need them.
>
> I may be a bit old-fashioned but I had to take exception to the arrogance
> that spoke from your post. It is exactly that kind of attitude that I often
> see in young programmers: "I implemented this by the book and it doesn't go
> any faster." - "But how do you expect us to run this on a nightly basis when
> your code takes 28 hours to run?" - "Use more machines, dude. Never heard of
> map/reduce?" - and all that because they are too snotnosed to parse XML with
> a regex if required.
>
> I'm not calling for premature optimisation, and nothing would be more stupid
> than trying to parse a 100-line user-written config file with anything else
> than a proper and tested XML parser. But discounting regex-based XML parsing
> outright, without having some knowledge about the cost incurred, is
> imprudent, and does not go well with the air of superiority that you gave
> off.

I think Perl's regex engine is cool, in fact if you're using it you're
using my code.

However when a self-admitted Perl newbie starts a thread saying he's
already split up an XML file by lines and inquires about how he can
parse those lines it's worth stepping back and asking if that's really
the approach he wants to be taking. In most cases the answers given in
this thread are the right answers to the wrong question.

Admittedly my response was a bit snotty mostly because I've spent
untold hours maintaining large swaths of Perl code which for no good
reason reinvented something for which there was a perfectly good
library in a buggy manner with no documentation.

A lot of Perl programmers really do have no idea how to use CPAN
judging by the amount of code they churn out which duplicates
well-known and tested CPAN modules with their own badly reinvented
wheels.

Of course there are cases where the libraries aren't sufficient as you
rightly point out but nothing about Maarten's question indicated that
this was the case. Sometimes you have to dig yourself into the hole of
implementing & maintaining your own tagsoup parser but I wouldn't help
a newbie dig that hole for himself unless I was certain that was what
he really needed.

And by the way your program would be slightly faster if you used
"(.*?)" instead of "([^"]*)". Minimally greedy matching is faster than
using negated character classes.