[OSM-dev] perl and special utf-8 characters
Robert Joop
5313501608656osm at rainbow.in-berlin.de
Sat Mar 19 07:58:27 GMT 2011
On 11-03-11 15:11:50 CET, Gary68 wrote:
> hi,
>
> i want to find out if certain characters (german umlaute) are contained
> in a string that i work char by char.
>
> my $text = "abc äöü" ;
> my $out = "" ;
> @chars = split //, $text ;
>
> foreach my $c (@chars) {
> # here a condition is needed !
> if ( $c eq <umlaut> ) {
> $out .= $c ;
> }
> }
You have to tell perl the encoding of your script.
(That's because you use non-ASCII strings literals in your script.)
If your script is encoded in UTF-8, write "use utf8;".
> unfortunately the umlaute are represented as two bytes - or whatever is
> the correct term here.
an indication that you've got UTF-8.
> is there someone who could spend 3 lines of code. probably some encode
> and decode is needed...
You need to decode when you need to turn bytes into characters, e.g.
when you read bytes from a GGI parameter or from a file.
You need to encode when you need to turn characters into bytes, as one
can never really know what perl's current internal representation is.
:r /tmp/g
#!/usr/bin/perl
use utf8;
use Encode;
my $text = "abc äöü" ;
my $out = "" ;
@chars = split //, $text ;
foreach my $c (@chars) {
# here a condition is needed !
if ($c eq 'ä') {
$out .= $c ;
}
}
print "out='$out'\n";
print encode ('UTF-8', "out='$out'\n");
__END__
:r !perl /tmp/g
out='?'
out='ä'
The first line is from the internal representation of the characters
which perl happened to use latin1 for, the second is the UTF-8 bytes for
the external representation.
rj
More information about the dev
mailing list