GG3.NET

Theme: GrayTheme: RedTheme: GreenTheme: BlueUS/UK flagJP flagBG flag

Unicode support in Perl

Why bother with Perl's Unicode support?[top]

You can of course go around without using Unicode in Perl at all. You could instead keep strings in any other encoding and Perl will handle them like binary data. You can print them to files as binary data and never bother about all the headache that Unicode in Perl causes.

Unicode support in Perl does allow you to do some very powerful things, though. It allows you to safely use extended characters in regular expressions for example, which is a big plus. Imagine you were storing ISO-2022-JP encoded string in a variable. Now, if you wanted to replace all occurrences of "B" with "b" you'd be in lots of trouble, because the ISO-2022-JP encoding happens to use the "B" a lot when encoding strings. However, if you had used extended characters internally in Perl, you wouldn't have had this problem.

How does it work in Perl[top]

Perl internally stores strings in one of two possible formats. It either uses the UTF-8 representation or the more popular ISO-8859-1 a.k.a. Latin1. It also keeps an UTF-8 bitflag for each string which, if set to true, means that the variable in question should be read as UTF-8. If the flag is FALSE, then the string is a Latin1 string.

If you wonder why it is so -- this was the one way to keep compatibility with old scripts that didn't know about unicode support in Perl. With the current implementation, these scripts can still be supported, because Perl stores strings in Latin1 for as long as it is possible. When it has no other choice (or when the user forces it to) it converts the string to UTF-8, sets the UTF8 bitflag and keeps working with it. When two strings are operated on and one of them has the UTF-8 flag set, then the other is automatically upgraded to UTF-8 status as well.

So, in order to explain how Unicode works in Perl, I will go with some examples. Imagine you want to store the word "Résumé" in a variable $foo. The code for the letter "é" in Unicode is 0xE9, so you can write it like this$foo = "R\x{e9}sum\x{e9}"and Perl will store this variable internally in a Latin1 encoding, because it can. Check for yourself:$ perl -e '$foo = "R\x{e9}sum\x{e9}"; print $foo;' | hexdump -C
00000000 52 e9 73 75 6d e9       |R.sum.|
00000006

OK, you say, but what happens if...

  • ... your terminal expects UTF-8?
  • ... you do not want to use \x{e9}, and you write your scripts in UTF-8?

If you want to write to files in UTF-8

If you want to write to files (this includes STDOUT, which is just another filehandle) in UTF-8, there are a few approaches.

The basic idea is that Perl has its own PerlIO. It associates an encoding with each filehandle and converts all output to the encoding for the given filehandle on output. This means that if your STDOUT has UTF-8 associated with it, then Perl will convert the internally stored in Latin1 string to UTF-8 when writing to STDOUT. Same applies for all filehandles.

One way to associate an encoding with an already-open filehandle is to use binmode like this:binmode STDOUT, ":utf8";And try it out:$ perl -e 'binmode STDOUT, ":utf8"; $foo = "R\x{e9}sum\x{e9}"; print $foo;' | hexdump -C
00000000 52 c3 a9 73 75 6d c3 a9         |R..sum..|
00000008
As you can see, instead of "E9", the output contained "C3 A9", which is the UTF-8 encoding of the "E9" Unicode character "é".

Another way is to use the -C command-line switch of perl. "perldoc perlrun" is your friend here. I will just give you an example:$ perl -CO -e '$foo = "R\x{e9}sum\x{e9}"; print $foo;' | hexdump -C
00000000 52 c3 a9 73 75 6d c3 a9        |R..sum..|
00000008

You want to use UTF-8 in your scripts

If you want to write your scripts in UTF-8, then you need to put "use utf8" in your script, before you use UTF-8 anywhere in it. This will allow you to use UTF-8 in variable names, as well as hard-coded strings, which also includes regular expressions. Here is an example:$ perl -CO -e 'use utf8; $foo = "Résumé"; print $foo;' | hexdump -C
00000000 52 c3 a9 73 75 6d c3 a9        |R..sum..|
00000008

It doesn't always work![top]

There can be a number of reasons. The most common one is that Perl by default assumes that your input is not UTF-8. In other words, you always have to explicitly tell Perl when you are handling UTF-8. If you for example wanted to read UTF-8 encoded data from a file, you have to tell Perl in advance that this filehandle handles UTF-8, or it will assume that your input is Latin1. And since any sequence of bytes is valid input in Latin1, it will never fail the assumption.

To illustrate with an example, imagine you had stored the character "é" in a file in the UTF-8 encoding. In this case the file would contain two bytes: 0xC3 0xA9. When perl reads the file, it will assume that this is Latin1 and will internally store the unicode characters 0x00c3 (Ã) and 0x00a9 (©), encoded in Latin1. If you then output this to a filehandle that has UTF-8 associated with it, Perl will convert these two characters to UTF-8, and will actually output the two characters I just mentioned (é) in the UTF-8 encoding (four bytes in total), which is definitely not what you want.

If you do not set the output encoding for your filehandle, then Perl will indeed output UTF-8, but it will constantly think that you are handling Latin1. If you were to run a regular expression on the input, it has a chance of matching the wrong data, which is undesirable

The best way to handle this kind of problem is by telling Perl that you are reading UTF-8 in advance. It will then know that 0xC3 0xA9 on input is the UTF-8 encoded character 0x00E9 "é". It will then internally store in Latin1, but that shouldn't be your concern, as long as you remember to set the encoding for your output handles.

CGI.pm doesn't work![top]

Form data from CGI.pm is a little tricky. When a browser submits data to a form it generally submits it in the encoding that the page is in. It also makes sure that bytes with the high-bit set (that is bytes above 0x80), are sent as %##. So if it is submitting the letter "é" from a Latin1 page, it will send %E9. If the page was in UTF-8, it will send %C3%A9 instead.

The CGI.pm module converts these %## occurrences, but since it doesn't know what encoding they are in, it reads them as-is. In other words Perl assumes it is a Laitn1 encoded data, and stores it internally as such. If you were using UTF-8, you would have to use the Encode module, and then run your text through the decode_utf8 function.

decode_utf8 is a shortcut for decode, and can be used to decode data that has been wrongly interpreted as Latin1 by Perl. For more information on its usage -- "perldoc Encode". There is no other workaround at the moment.

Get Firefox!Valid XHTML 1.0!Valid CSS!©2005 GG3.NET