[OSM-dev] perl and special utf-8 characters

Robert Joop 5313501608656osm at rainbow.in-berlin.de
Sat Mar 19 07:58:27 GMT 2011


On 11-03-11 15:11:50 CET, Gary68 wrote:
> hi,
> 
> i want to find out if certain characters (german umlaute) are contained
> in a string that i work char by char.
> 
> 	my $text = "abc äöü" ;
> 	my $out = "" ;
> 	@chars = split //, $text ;
> 
> 	foreach my $c (@chars) {
> 		# here a condition is needed ! 
> 		if ( $c eq <umlaut> ) {
> 			$out .= $c ;
> 		}
> 	}

You have to tell perl the encoding of your script.
(That's because you use non-ASCII strings literals in your script.)
If your script is encoded in UTF-8, write "use utf8;".

> unfortunately the umlaute are represented as two bytes - or whatever is
> the correct term here.

an indication that you've got UTF-8.

> is there someone who could spend 3 lines of code. probably some encode
> and decode is needed...

You need to decode when you need to turn bytes into characters, e.g.
when you read bytes from a GGI parameter or from a file.
You need to encode when you need to turn characters into bytes, as one
can never really know what perl's current internal representation is.

:r /tmp/g
#!/usr/bin/perl

use utf8;
use Encode;

my $text = "abc äöü" ;
my $out = "" ;
@chars = split //, $text ;

foreach my $c (@chars) {
	# here a condition is needed ! 
	if ($c eq 'ä') {
		$out .= $c ;
	}
}
print "out='$out'\n";
print encode ('UTF-8', "out='$out'\n");
__END__

:r !perl /tmp/g
out='?'
out='ä'

The first line is from the internal representation of the characters
which perl happened to use latin1 for, the second is the UTF-8 bytes for
the external representation.

rj



More information about the dev mailing list