I have a simple question which I do not know how to solve in Perl. I know how to convert from utf-8 to GBK, for example, from e4b8ad
to d6d0
. But I am not sure how to go backward, i.e. given d6d0
, how do I know e4b8ad
.
我有一个简单的问题,我不知道如何用Perl来解决。我知道如何从utf-8转换为GBK,例如,从e4b8ad到d6d0。但是我不确定如何向后,也就是给定d6d0,我怎么知道e4b8ad。
Please enlighten me! Many thanks.
请开导我!多谢。
3 个解决方案
#1
3
When you have hex digits, pack is your friend. Following is a REPL session. Notes:
当你有十六进制数字时,包是你的朋友。下面是一个REPL会话。注:
- To reverse the direction, pack the hex digits into octets, decode from GB octets to character string, encode character string to UTF-8 octets, unpack octets into hex digits.
- 为了反转这个方向,将十六进制数字打包成八进制,从GB octets解码到字符串,将字符字符串编码为UTF-8 octets,将unpack octets转换为十六进制数字。
- GBK is superseded. Use of GB18030 (provided by Encode::HanExtra in Perl) has been mandatory for five years already.
- GBK取代。使用GB18030(由编码提供的::HanExtra in Perl)已经被强制使用了5年。
$ use Encode qw(decode encode); use Encode::HanExtra; use Devel::Peek qw(Dump);
$ 'e4b8ad'
e4b8ad # hex digits
$ pack('H*', 'e4b8ad')
中
$ Dump(pack('H*', 'e4b8ad'))
SV = PV(0x3657680) at 0x36b7188
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x36c0768 "\344\270\255"\0 # octets of UTF-8 encoded data
CUR = 3
LEN = 8
$ decode('UTF-8', pack('H*', 'e4b8ad'))
中
$ Dump(decode('UTF-8', pack('H*', 'e4b8ad')))
SV = PV(0x326c3a0) at 0x36a50c8
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x3698a48 "\344\270\255"\0 [UTF8 "\x{4e2d}"] # character string
CUR = 3
LEN = 8
$ encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad')))
"\xd6\xd0"
$ Dump(encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad'))))
SV = PV(0x36a2da0) at 0x36b6d98
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x36db3e8 "\326\320"\0 # octets of GB18030 encoded data
CUR = 2
LEN = 8
$ unpack('H*', encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad'))))
d6d0 # hex digits
#2
1
The answer to the question asked:
这个问题的答案是:
use Encode qw( from_to );
my $gbk = "\xD6\xD0";
from_to(my $utf8 = $gbk, 'GB18030', 'UTF-8'); # E4 B8 AD
or
或
use Encode qw( decode encode );
my $gbk = "\xD6\xD0";
my $utf8 = encode('UTF-8', decode('GB18030', $gbk)); # E4 B8 AD
However, a more normal flow looks like the following:
然而,一个更正常的流看起来如下:
open(my $fh_in, '<:encoding(GB18030)', ...) or die ...;
open(my $fh_out, '>:encoding(UTF-8)', ...) or die ...;
while (<$fh_in>) {
...
print $fh_out ...;
...
}
Encode::HanExtra must be installed for Encode to find the encoding.
编码::HanExtra必须安装在编码中才能找到编码。
#3
0
use Encode qw/encode decode/;
$utf8 = decode("euc-cn", $euc_cn); # ditto
You can also normally specify the encoding when you open or close a FD and it will perform necessary conversions.
您还可以在打开或关闭FD时指定编码,并执行必要的转换。
Works like a charm:
就像一个魅力:
perl -e 'open(X,">","/tmp/x"); print X chr(0xd6).chr(0xd0);close(X)'
perl -mEncode -e 'open(X,"<","/tmp/x"); $x=<X>; print Encode::decode("euc-cn",$x);' > /tmp/xx
#1
3
When you have hex digits, pack is your friend. Following is a REPL session. Notes:
当你有十六进制数字时,包是你的朋友。下面是一个REPL会话。注:
- To reverse the direction, pack the hex digits into octets, decode from GB octets to character string, encode character string to UTF-8 octets, unpack octets into hex digits.
- 为了反转这个方向,将十六进制数字打包成八进制,从GB octets解码到字符串,将字符字符串编码为UTF-8 octets,将unpack octets转换为十六进制数字。
- GBK is superseded. Use of GB18030 (provided by Encode::HanExtra in Perl) has been mandatory for five years already.
- GBK取代。使用GB18030(由编码提供的::HanExtra in Perl)已经被强制使用了5年。
$ use Encode qw(decode encode); use Encode::HanExtra; use Devel::Peek qw(Dump);
$ 'e4b8ad'
e4b8ad # hex digits
$ pack('H*', 'e4b8ad')
中
$ Dump(pack('H*', 'e4b8ad'))
SV = PV(0x3657680) at 0x36b7188
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x36c0768 "\344\270\255"\0 # octets of UTF-8 encoded data
CUR = 3
LEN = 8
$ decode('UTF-8', pack('H*', 'e4b8ad'))
中
$ Dump(decode('UTF-8', pack('H*', 'e4b8ad')))
SV = PV(0x326c3a0) at 0x36a50c8
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x3698a48 "\344\270\255"\0 [UTF8 "\x{4e2d}"] # character string
CUR = 3
LEN = 8
$ encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad')))
"\xd6\xd0"
$ Dump(encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad'))))
SV = PV(0x36a2da0) at 0x36b6d98
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x36db3e8 "\326\320"\0 # octets of GB18030 encoded data
CUR = 2
LEN = 8
$ unpack('H*', encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad'))))
d6d0 # hex digits
#2
1
The answer to the question asked:
这个问题的答案是:
use Encode qw( from_to );
my $gbk = "\xD6\xD0";
from_to(my $utf8 = $gbk, 'GB18030', 'UTF-8'); # E4 B8 AD
or
或
use Encode qw( decode encode );
my $gbk = "\xD6\xD0";
my $utf8 = encode('UTF-8', decode('GB18030', $gbk)); # E4 B8 AD
However, a more normal flow looks like the following:
然而,一个更正常的流看起来如下:
open(my $fh_in, '<:encoding(GB18030)', ...) or die ...;
open(my $fh_out, '>:encoding(UTF-8)', ...) or die ...;
while (<$fh_in>) {
...
print $fh_out ...;
...
}
Encode::HanExtra must be installed for Encode to find the encoding.
编码::HanExtra必须安装在编码中才能找到编码。
#3
0
use Encode qw/encode decode/;
$utf8 = decode("euc-cn", $euc_cn); # ditto
You can also normally specify the encoding when you open or close a FD and it will perform necessary conversions.
您还可以在打开或关闭FD时指定编码,并执行必要的转换。
Works like a charm:
就像一个魅力:
perl -e 'open(X,">","/tmp/x"); print X chr(0xd6).chr(0xd0);close(X)'
perl -mEncode -e 'open(X,"<","/tmp/x"); $x=<X>; print Encode::decode("euc-cn",$x);' > /tmp/xx