如何将“\ u00ed”等Unicode转义序列解码为正确的UTF-8编码字符?

时间:2021-03-19 00:19:34

Is there a function in PHP that can decode Unicode escape sequences like "\u00ed" to "í" and all other similar occurrences?

在PHP中是否有一个函数可以解码Unicode转义序列,如“\ u00ed”到“í”以及所有其他类似的事件?

I found similar question here but is doesn't seem to work.

我在这里发现了类似的问题,但似乎没有用。

7 个解决方案

#1


Try this:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');}, $str);

In case it's UTF-16 based C/C++/Java/Json-style:

如果它是基于UTF-16的C / C ++ / Java / Json风格:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UTF-16BE');}, $str);

#2


print_r(json_decode('{"t":"\u00ed"}')); // -> stdClass Object ( [t] => í )

#3


PHP 7+

As of PHP 7, you can use the Unicode codepoint escape syntax to do this.

从PHP 7开始,您可以使用Unicode代码点转义语法来执行此操作。

echo "\u{00ed}"; outputs í.

回声“\ u {00ed}”;输出í。

#4


$str = '\u0063\u0061\u0074'.'\ud83d\ude38';$str2 = '\u0063\u0061\u0074'.'\ud83d';// U+1F638var_dump(    "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),    "cat\xEF\xBF\xBD" === escape_sequence_decode($str2));function escape_sequence_decode($str) {    // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]    $regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})              |\\\u([\da-fA-F]{4})/sx';    return preg_replace_callback($regex, function($matches) {        if (isset($matches[3])) {            $cp = hexdec($matches[3]);        } else {            $lead = hexdec($matches[1]);            $trail = hexdec($matches[2]);            // http://unicode.org/faq/utf_bom.html#utf16-4            $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;        }        // https://tools.ietf.org/html/rfc3629#section-3        // Characters between U+D800 and U+DFFF are not allowed in UTF-8        if ($cp > 0xD7FF && 0xE000 > $cp) {            $cp = 0xFFFD;        }        // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471        // php_utf32_utf8(unsigned char *buf, unsigned k)        if ($cp < 0x80) {            return chr($cp);        } else if ($cp < 0xA0) {            return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);        }        return html_entity_decode('&#'.$cp.';');    }, $str);}

#5


This is a sledgehammer approach to replacing raw UNICODE with HTML. I haven't seen any other place to put this solution, but I assume others have had this problem.

这是用HTML替换原始UNICODE的大锤方法。我没有看到任何其他地方提出这个解决方案,但我认为其他人有这个问题。

Apply this str_replace function to the RAW JSON, before doing anythingelse.

在执行任何操作之前,将此str_replace函数应用于RAW JSON。

function unicode2html($str){    $i=65535;    while($i>0){        $hex=dechex($i);        $str=str_replace("\u$hex","&#$i;",$str);        $i--;     }     return $str;}

This won't take as long as you think, and this will replace ANY unicode with HTML.

这不会花费你想象的时间,这将用HTML取代任何unicode。

Of course this can be reduced if you know the unicode types that are being returned in the JSON.

当然,如果您知道在JSON中返回的unicode类型,则可以减少这种情况。

For example my code was getting lots of arrows and dingbat unicode. These are between 8448 an 11263. So my production code looks like:

例如,我的代码获得了大量的箭头和dingbat unicode。它们介于8448和11263之间。所以我的生产代码如下:

$i=11263;while($i>08448){    ...etc...

You can look up the blocks of Unicode by type here: http://unicode-table.com/en/If you know you're translating Arabic or Telegu or whatever, you can just replace those codes, not all 65,000.

您可以在此处按类型查找Unicode块:http://unicode-table.com/en/如果您知道您正在翻译阿拉伯语或Telegu或其他任何内容,您只需替换这些代码,而不是全部65,000。

You could apply this same sledgehammer to simple encoding:

您可以将同样的大锤应用于简单编码:

 $str=str_replace("\u$hex",chr($i),$str);

#6


There is also a solution:
http://www.welefen.com/php-unicode-to-utf8.html

还有一个解决方案:http://www.welefen.com/php-unicode-to-utf8.html

function entity2utf8onechar($unicode_c){    $unicode_c_val = intval($unicode_c);    $f=0x80; // 10000000    $str = "";    // U-00000000 - U-0000007F:   0xxxxxxx    if($unicode_c_val <= 0x7F){         $str = chr($unicode_c_val);     }     //U-00000080 - U-000007FF:  110xxxxx 10xxxxxx    else if($unicode_c_val >= 0x80 && $unicode_c_val <= 0x7FF){         $h=0xC0; // 11000000        $c1 = $unicode_c_val >> 6 | $h;        $c2 = ($unicode_c_val & 0x3F) | $f;        $str = chr($c1).chr($c2);    } else if($unicode_c_val >= 0x800 && $unicode_c_val <= 0xFFFF){         $h=0xE0; // 11100000        $c1 = $unicode_c_val >> 12 | $h;        $c2 = (($unicode_c_val & 0xFC0) >> 6) | $f;        $c3 = ($unicode_c_val & 0x3F) | $f;        $str=chr($c1).chr($c2).chr($c3);    }    //U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    else if($unicode_c_val >= 0x10000 && $unicode_c_val <= 0x1FFFFF){         $h=0xF0; // 11110000        $c1 = $unicode_c_val >> 18 | $h;        $c2 = (($unicode_c_val & 0x3F000) >>12) | $f;        $c3 = (($unicode_c_val & 0xFC0) >>6) | $f;        $c4 = ($unicode_c_val & 0x3F) | $f;        $str = chr($c1).chr($c2).chr($c3).chr($c4);    }    //U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx    else if($unicode_c_val >= 0x200000 && $unicode_c_val <= 0x3FFFFFF){         $h=0xF8; // 11111000        $c1 = $unicode_c_val >> 24 | $h;        $c2 = (($unicode_c_val & 0xFC0000)>>18) | $f;        $c3 = (($unicode_c_val & 0x3F000) >>12) | $f;        $c4 = (($unicode_c_val & 0xFC0) >>6) | $f;        $c5 = ($unicode_c_val & 0x3F) | $f;        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5);    }    //U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx    else if($unicode_c_val >= 0x4000000 && $unicode_c_val <= 0x7FFFFFFF){         $h=0xFC; // 11111100        $c1 = $unicode_c_val >> 30 | $h;        $c2 = (($unicode_c_val & 0x3F000000)>>24) | $f;        $c3 = (($unicode_c_val & 0xFC0000)>>18) | $f;        $c4 = (($unicode_c_val & 0x3F000) >>12) | $f;        $c5 = (($unicode_c_val & 0xFC0) >>6) | $f;        $c6 = ($unicode_c_val & 0x3F) | $f;        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5).chr($c6);    }    return $str;}function entities2utf8($unicode_c){    $unicode_c = preg_replace("/\&\#([\da-f]{5})\;/es", "entity2utf8onechar('\\1')", $unicode_c);    return $unicode_c;}

#7


fix json values, it's add \ before u{xxx} to all +" "

修复json值,在u {xxx}之前添加\到所有+“”

  $item = preg_replace_callback('/"(.+?)":"(u.+?)",/', function ($matches) {        $matches[2] = preg_replace('/(u)/', '\u', $matches[2]);            $matches[2] = preg_replace('/(")/', '&quot;', $matches[2]);             $matches[2] = json_decode('"' . $matches[2] . '"');             return '"' . $matches[1] . '":"' . $matches[2] . '",';        }, $item);

#1


Try this:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');}, $str);

In case it's UTF-16 based C/C++/Java/Json-style:

如果它是基于UTF-16的C / C ++ / Java / Json风格:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UTF-16BE');}, $str);

#2


print_r(json_decode('{"t":"\u00ed"}')); // -> stdClass Object ( [t] => í )

#3


PHP 7+

As of PHP 7, you can use the Unicode codepoint escape syntax to do this.

从PHP 7开始,您可以使用Unicode代码点转义语法来执行此操作。

echo "\u{00ed}"; outputs í.

回声“\ u {00ed}”;输出í。

#4


$str = '\u0063\u0061\u0074'.'\ud83d\ude38';$str2 = '\u0063\u0061\u0074'.'\ud83d';// U+1F638var_dump(    "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),    "cat\xEF\xBF\xBD" === escape_sequence_decode($str2));function escape_sequence_decode($str) {    // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]    $regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})              |\\\u([\da-fA-F]{4})/sx';    return preg_replace_callback($regex, function($matches) {        if (isset($matches[3])) {            $cp = hexdec($matches[3]);        } else {            $lead = hexdec($matches[1]);            $trail = hexdec($matches[2]);            // http://unicode.org/faq/utf_bom.html#utf16-4            $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;        }        // https://tools.ietf.org/html/rfc3629#section-3        // Characters between U+D800 and U+DFFF are not allowed in UTF-8        if ($cp > 0xD7FF && 0xE000 > $cp) {            $cp = 0xFFFD;        }        // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471        // php_utf32_utf8(unsigned char *buf, unsigned k)        if ($cp < 0x80) {            return chr($cp);        } else if ($cp < 0xA0) {            return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);        }        return html_entity_decode('&#'.$cp.';');    }, $str);}

#5


This is a sledgehammer approach to replacing raw UNICODE with HTML. I haven't seen any other place to put this solution, but I assume others have had this problem.

这是用HTML替换原始UNICODE的大锤方法。我没有看到任何其他地方提出这个解决方案,但我认为其他人有这个问题。

Apply this str_replace function to the RAW JSON, before doing anythingelse.

在执行任何操作之前,将此str_replace函数应用于RAW JSON。

function unicode2html($str){    $i=65535;    while($i>0){        $hex=dechex($i);        $str=str_replace("\u$hex","&#$i;",$str);        $i--;     }     return $str;}

This won't take as long as you think, and this will replace ANY unicode with HTML.

这不会花费你想象的时间,这将用HTML取代任何unicode。

Of course this can be reduced if you know the unicode types that are being returned in the JSON.

当然,如果您知道在JSON中返回的unicode类型,则可以减少这种情况。

For example my code was getting lots of arrows and dingbat unicode. These are between 8448 an 11263. So my production code looks like:

例如,我的代码获得了大量的箭头和dingbat unicode。它们介于8448和11263之间。所以我的生产代码如下:

$i=11263;while($i>08448){    ...etc...

You can look up the blocks of Unicode by type here: http://unicode-table.com/en/If you know you're translating Arabic or Telegu or whatever, you can just replace those codes, not all 65,000.

您可以在此处按类型查找Unicode块:http://unicode-table.com/en/如果您知道您正在翻译阿拉伯语或Telegu或其他任何内容,您只需替换这些代码,而不是全部65,000。

You could apply this same sledgehammer to simple encoding:

您可以将同样的大锤应用于简单编码:

 $str=str_replace("\u$hex",chr($i),$str);

#6


There is also a solution:
http://www.welefen.com/php-unicode-to-utf8.html

还有一个解决方案:http://www.welefen.com/php-unicode-to-utf8.html

function entity2utf8onechar($unicode_c){    $unicode_c_val = intval($unicode_c);    $f=0x80; // 10000000    $str = "";    // U-00000000 - U-0000007F:   0xxxxxxx    if($unicode_c_val <= 0x7F){         $str = chr($unicode_c_val);     }     //U-00000080 - U-000007FF:  110xxxxx 10xxxxxx    else if($unicode_c_val >= 0x80 && $unicode_c_val <= 0x7FF){         $h=0xC0; // 11000000        $c1 = $unicode_c_val >> 6 | $h;        $c2 = ($unicode_c_val & 0x3F) | $f;        $str = chr($c1).chr($c2);    } else if($unicode_c_val >= 0x800 && $unicode_c_val <= 0xFFFF){         $h=0xE0; // 11100000        $c1 = $unicode_c_val >> 12 | $h;        $c2 = (($unicode_c_val & 0xFC0) >> 6) | $f;        $c3 = ($unicode_c_val & 0x3F) | $f;        $str=chr($c1).chr($c2).chr($c3);    }    //U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    else if($unicode_c_val >= 0x10000 && $unicode_c_val <= 0x1FFFFF){         $h=0xF0; // 11110000        $c1 = $unicode_c_val >> 18 | $h;        $c2 = (($unicode_c_val & 0x3F000) >>12) | $f;        $c3 = (($unicode_c_val & 0xFC0) >>6) | $f;        $c4 = ($unicode_c_val & 0x3F) | $f;        $str = chr($c1).chr($c2).chr($c3).chr($c4);    }    //U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx    else if($unicode_c_val >= 0x200000 && $unicode_c_val <= 0x3FFFFFF){         $h=0xF8; // 11111000        $c1 = $unicode_c_val >> 24 | $h;        $c2 = (($unicode_c_val & 0xFC0000)>>18) | $f;        $c3 = (($unicode_c_val & 0x3F000) >>12) | $f;        $c4 = (($unicode_c_val & 0xFC0) >>6) | $f;        $c5 = ($unicode_c_val & 0x3F) | $f;        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5);    }    //U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx    else if($unicode_c_val >= 0x4000000 && $unicode_c_val <= 0x7FFFFFFF){         $h=0xFC; // 11111100        $c1 = $unicode_c_val >> 30 | $h;        $c2 = (($unicode_c_val & 0x3F000000)>>24) | $f;        $c3 = (($unicode_c_val & 0xFC0000)>>18) | $f;        $c4 = (($unicode_c_val & 0x3F000) >>12) | $f;        $c5 = (($unicode_c_val & 0xFC0) >>6) | $f;        $c6 = ($unicode_c_val & 0x3F) | $f;        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5).chr($c6);    }    return $str;}function entities2utf8($unicode_c){    $unicode_c = preg_replace("/\&\#([\da-f]{5})\;/es", "entity2utf8onechar('\\1')", $unicode_c);    return $unicode_c;}

#7


fix json values, it's add \ before u{xxx} to all +" "

修复json值,在u {xxx}之前添加\到所有+“”

  $item = preg_replace_callback('/"(.+?)":"(u.+?)",/', function ($matches) {        $matches[2] = preg_replace('/(u)/', '\u', $matches[2]);            $matches[2] = preg_replace('/(")/', '&quot;', $matches[2]);             $matches[2] = json_decode('"' . $matches[2] . '"');             return '"' . $matches[1] . '":"' . $matches[2] . '",';        }, $item);