如何将Unicode文本块转换为UTF-8(HEX)代码点?

时间:2023-01-22 20:15:14

I have a Unicode text-block, like this:

我有一个Unicode文本块,像这样:

ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ

Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:

现在,我想将此原始Unicode文本块转换为UTF-8(HEX)代码点的文本块(请参阅此页面上的十六进制UTF-8列:https://en.wikipedia.org/wiki / UTF-8),PHP;喜欢这个:

\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90

Not like this:

不是这样的:

0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110

Is there any way to do it, by PHP?

用PHP做任何方法吗?


I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.

我已经阅读了这个主题(PHP:将unicode codepoint转换为UTF-8)。但是,它与我的问题不相似。


I am sorry, I don't know much about Unicode.

对不起,我对Unicode知之甚少。

3 个解决方案

#1


13  

I think you're looking for the bin2hex() function:

我想你正在寻找bin2hex()函数:

Convert binary data into hexadecimal representation

将二进制数据转换为十六进制表示

And format by prepending \x to each byte (00-FF)

并通过将\ x加到每个字节(00-FF)进行格式化

function str_hex_format ($bin) {
  return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}

For your sample:

对于您的样品:

// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];

foreach($arr AS $v)
  echo $v . " => " . str_hex_format($v) . "\n";

See test at eval.in (link expires)

请参阅eval.in上的测试(链接到期)

ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90

Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;

解码示例:$ str = str_hex_format(“ụưứỲỶỴĐ”); echo $ str;

\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90

\ XE1 \ XBB \ xa5 \ XC6 \ XB0 \ XE1 \ XBB版权所有\ xA9 \ XE1 \ XBB \ XB2 \ XE1 \ XBB \ XB6 \ XE1 \ XBB \ XB4 \ XC4 \ X90

echo hex2bin(str_replace('\x', "", $str));

ụưứỲỶỴĐ

ụưứỲỶỴĐ


For more info about escape sequence \x in double quoted strings see php manual.

有关双引号字符串中的转义序列\ x的更多信息,请参阅php手册。

#2


3  

PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:

无论编码如何,PHP都将字符串视为字符数组。如果您不需要分隔UTF8字符,那么这样的工作:

$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
  echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);

Output:

输出:

\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90

If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:

如果你需要分隔UTF8字符(即使用换行符),那么你需要这样的东西:

$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
  foreach(str_split($UTF8char) as $char)
    echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
  echo "\n"; // delimiter
}

Output:

输出:

\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90

This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.

这会使用preg_split和u标志将字符串拆分为UTF8字符。由于preg_split在第一个字符之前返回空字符串而在最后一个字符之后返回空字符串,因此我们需要array_slice第一个和最后一个字符。例如,可以很容易地修改它以返回数组。

Edit: A more "correct" way to do this is this:

编辑:这样做的更“正确”的方法是:

echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');

#3


1  

The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.

您需要做的主要事情是告诉PHP正确解释传入的Unicode字符。完成后,您可以根据需要将它们转换为UTF-8然后转换为十六进制。

This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.

此代码片段采用Unicode中的示例字符,将它们转换为UTF-8,然后转储这些字符的十六进制表示形式。

<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";

// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";

for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
    $c = mb_substr($utf8str, $i, 1, 'UTF-8');
    $hex = bin2hex($c);
    echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}

?>

Produces

产生

length=7
ụưứỲỶỴĐ
ụ   e1bba5  \xe1\xbb\xa5
ư   c6b0    \xc6\xb0
ứ   e1bba9  \xe1\xbb\xa9
Ỳ   e1bbb2  \xe1\xbb\xb2
Ỷ   e1bbb6  \xe1\xbb\xb6
Ỵ   e1bbb4  \xe1\xbb\xb4
Đ   c490    \xc4\x90

#1


13  

I think you're looking for the bin2hex() function:

我想你正在寻找bin2hex()函数:

Convert binary data into hexadecimal representation

将二进制数据转换为十六进制表示

And format by prepending \x to each byte (00-FF)

并通过将\ x加到每个字节(00-FF)进行格式化

function str_hex_format ($bin) {
  return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}

For your sample:

对于您的样品:

// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];

foreach($arr AS $v)
  echo $v . " => " . str_hex_format($v) . "\n";

See test at eval.in (link expires)

请参阅eval.in上的测试(链接到期)

ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90

Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;

解码示例:$ str = str_hex_format(“ụưứỲỶỴĐ”); echo $ str;

\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90

\ XE1 \ XBB \ xa5 \ XC6 \ XB0 \ XE1 \ XBB版权所有\ xA9 \ XE1 \ XBB \ XB2 \ XE1 \ XBB \ XB6 \ XE1 \ XBB \ XB4 \ XC4 \ X90

echo hex2bin(str_replace('\x', "", $str));

ụưứỲỶỴĐ

ụưứỲỶỴĐ


For more info about escape sequence \x in double quoted strings see php manual.

有关双引号字符串中的转义序列\ x的更多信息,请参阅php手册。

#2


3  

PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:

无论编码如何,PHP都将字符串视为字符数组。如果您不需要分隔UTF8字符,那么这样的工作:

$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
  echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);

Output:

输出:

\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90

If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:

如果你需要分隔UTF8字符(即使用换行符),那么你需要这样的东西:

$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
  foreach(str_split($UTF8char) as $char)
    echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
  echo "\n"; // delimiter
}

Output:

输出:

\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90

This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.

这会使用preg_split和u标志将字符串拆分为UTF8字符。由于preg_split在第一个字符之前返回空字符串而在最后一个字符之后返回空字符串,因此我们需要array_slice第一个和最后一个字符。例如,可以很容易地修改它以返回数组。

Edit: A more "correct" way to do this is this:

编辑:这样做的更“正确”的方法是:

echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');

#3


1  

The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.

您需要做的主要事情是告诉PHP正确解释传入的Unicode字符。完成后,您可以根据需要将它们转换为UTF-8然后转换为十六进制。

This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.

此代码片段采用Unicode中的示例字符,将它们转换为UTF-8,然后转储这些字符的十六进制表示形式。

<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";

// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";

for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
    $c = mb_substr($utf8str, $i, 1, 'UTF-8');
    $hex = bin2hex($c);
    echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}

?>

Produces

产生

length=7
ụưứỲỶỴĐ
ụ   e1bba5  \xe1\xbb\xa5
ư   c6b0    \xc6\xb0
ứ   e1bba9  \xe1\xbb\xa9
Ỳ   e1bbb2  \xe1\xbb\xb2
Ỷ   e1bbb6  \xe1\xbb\xb6
Ỵ   e1bbb4  \xe1\xbb\xb4
Đ   c490    \xc4\x90