Perl:use utf8::解码与编码::解码

时间:2020-12-29 10:03:38

I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.

我有一些有趣的结果,试图区分使用Encode: decode(“utf8”,$var)和utf8: decode($var)的区别。我已经发现,在一个变量上多次调用前几次会导致错误“不能解码字符串的宽字符”,而后一种方法将会很高兴地运行很多次,只是返回false。

What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:

我搞不懂的是长度函数是如何根据你用来解码的方法返回不同的结果的。问题出现是因为我正在处理来自外部文件的“双编码”utf8文本。为了演示这个问题,我创建了一个文本文件“测试”。在一行上有以下Unicode字符的txt: U+00e8, U+00ab, U+0086, U+000a。这些Unicode字符是Unicode字符U+8acb的双编码,以及换行字符。该文件以UTF8编码到磁盘。然后运行以下perl脚本:

#!/usr/bin/perl                                                                                                                                          
use strict;
use warnings;
require "Encode.pm";
require "utf8.pm";

open FILE, "test.txt" or die $!;
my @lines = <FILE>;
my $test =  $lines[0];

print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
my @unicode = (unpack('U*', $test));
print "Unicode:\n@unicode\n";
my @hex = (unpack('H*', $test));
print "Hex:\n@hex\n";

print "==============\n";

$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
@unicode = (unpack('U*', $test));
print "Unicode:\n@unicode\n";
@hex = (unpack('H*', $test));
print "Hex:\n@hex\n";

print "==============\n";

$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
@unicode = (unpack('U*', $test));
print "Unicode:\n@unicode\n";
@hex = (unpack('H*', $test));

print "Hex:\n@hex\n";

This gives the following output:

这就产生了以下输出:

Length: 7
utf8 flag: 
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a

This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))

这就是我所期望的。长度最初是7,因为perl认为$test只是一组字节。在解码一次之后,perl知道$test是一组utf8编码的字符(例如,虽然$test在内存中仍然是7字节,但perl并没有返回7字节的长度,而是返回4个字符的长度)。在第二个解码后,$test包含4个字节,被解释为2个字符,这是我所期望的编码::decode将4个代码点解释为utf8编码的字节,结果是2个字符。奇怪的是,当我修改代码以调用utf8::decode而不是(替换所有$test = Encode::decode(“utf8”,$test);与utf8:解码(测试)美元)

This gives almost identical output, only the result of length differs:

这个输出几乎相同,只是长度的结果不同:

Length: 7
utf8 flag: 
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a

It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?

似乎perl首先在解码前计算字节数(如预期的那样),然后在第一次解码后计算字符数,然后在第二次解码后再计算字节数(不是预期的)。为什么会发生这种转变?我对这些解码功能是如何工作的理解有错误吗?

Thanks,
Matt

谢谢,马特

2 个解决方案

#1


4  

You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:

不应该使用来自utf8 pragma模块的函数。它的文档这么说:

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

除了告诉Perl您的脚本是用UTF-8编写的之外,不要将此实用程序用于任何其他用途。

Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.

始终使用Encode模块,并查看使用Perl使用Unicode方式的问题检查表。解包的级别太低,它甚至不提供错误检查。

You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.

你会错误的认为octects E8 AB 86 0是utf - 8的结果double-encoding諆和换行符的字符。这是这些字符的一个UTF-8编码的表示。也许你这边的所有困惑都源于这个错误。

length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.

长度是不适当的重载,在某些时候它决定字符的长度,或者octets的长度。使用更好的工具,比如Devel: Peek。

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);

my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter

Dump $test;
#  FLAGS = (PADMY,POK,pPOK)
#  PV = 0x8d8520 "\350\253\206\n"\0

$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
#  FLAGS = (PADMY,POK,pPOK,UTF8)
#  PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]

#2


2  

Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.

原来这是一个错误:https://rt.org/rt3//public/bug/display.html?

#1


4  

You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:

不应该使用来自utf8 pragma模块的函数。它的文档这么说:

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

除了告诉Perl您的脚本是用UTF-8编写的之外,不要将此实用程序用于任何其他用途。

Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.

始终使用Encode模块,并查看使用Perl使用Unicode方式的问题检查表。解包的级别太低,它甚至不提供错误检查。

You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.

你会错误的认为octects E8 AB 86 0是utf - 8的结果double-encoding諆和换行符的字符。这是这些字符的一个UTF-8编码的表示。也许你这边的所有困惑都源于这个错误。

length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.

长度是不适当的重载,在某些时候它决定字符的长度,或者octets的长度。使用更好的工具,比如Devel: Peek。

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);

my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter

Dump $test;
#  FLAGS = (PADMY,POK,pPOK)
#  PV = 0x8d8520 "\350\253\206\n"\0

$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
#  FLAGS = (PADMY,POK,pPOK,UTF8)
#  PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]

#2


2  

Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.

原来这是一个错误:https://rt.org/rt3//public/bug/display.html?