如何在PHP中将字节(UTF-8)转换为Unicode?

时间:2021-01-21 20:13:22

How can i convert

我怎么能转换

\xF0\x9F\x98\x83

to

\u1F603

in php?

PS: it's a Emoji -> ????, i need Unicode to use Twemoji.

PS:它是一个表情符号 - >????,我需要Unicode才能使用Twemoji。

2 个解决方案

#1


Interesting, not much is out there for PHP. There seems to be a promising post, but unfortunately the accepted answer gives incorrect results in Your case.

有趣的是,对于PHP来说并不多。似乎有一个很有前途的帖子,但不幸的是,接受的答案在你的案例中给出了不正确的结果。

So here's a revised version of Adam's solution rewritten in PHP.

所以这是用PHP重写的Adam解决方案的修订版。

/**
 * Translates a sequence of UTF-8 bytes to their equivalent unicode code points.
 * Each code point is prefixed with "\u".
 *
 * @param string $utf8
 *
 * @return string
 */
function utf8_to_unicode($utf8) {
    $i = 0;
    $l = strlen($utf8);

    $out = '';

    while ($i < $l) {
        if ((ord($utf8[$i]) & 0x80) === 0x00) {
            // 0xxxxxxx
            $n = ord($utf8[$i++]);
        } elseif ((ord($utf8[$i]) & 0xE0) === 0xC0) {
            // 110xxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x1F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xF0) === 0xE0) {
            // 1110xxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x0F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xF8) === 0xF0) {
            // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x07) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xFC) === 0xF8) {
            // 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x03) << 24) |
                ((ord($utf8[$i++]) & 0x3F) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xFE) === 0xFC) {
            // 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x01) << 30) |
                ((ord($utf8[$i++]) & 0x3F) << 24) |
                ((ord($utf8[$i++]) & 0x3F) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } else {
            throw new \Exception('Invalid utf-8 code point');
        }

        $n = strtoupper(dechex($n));
        $pad = strlen($n) <= 4 ? strlen($n) + strlen($n) %2 : 0;
        $n = str_pad($n, $pad, "0", STR_PAD_LEFT);

        $out .= sprintf("\u%s", $n);
    }

    return $out;
}

And in your case

在你的情况下

php > var_dump(utf8_to_unicode("\xF0\x9F\x98\x83"));
string(7) "\u1F603"

#2


Use a combination of:

使用以下组合:

  1. stripcslashes() to convert \xFF byte escapes.
    That'll result in a string of UTF-8, because that's what it seemingly was originally.

    stripcslashes()转换\ xFF字节转义。这将导致一串UTF-8,因为这是它原本看来的样子。

  2. json_encode() to convert "????" back to an \uFFFF Unicode escape.
    If that's what you want to end up with. (Not enough context in your question to tell.)

    json_encode()将“????”转换回\ uFFFF Unicode转义。如果这就是你想要的结果。 (在你的问题中没有足够的背景来讲述。)

#1


Interesting, not much is out there for PHP. There seems to be a promising post, but unfortunately the accepted answer gives incorrect results in Your case.

有趣的是,对于PHP来说并不多。似乎有一个很有前途的帖子,但不幸的是,接受的答案在你的案例中给出了不正确的结果。

So here's a revised version of Adam's solution rewritten in PHP.

所以这是用PHP重写的Adam解决方案的修订版。

/**
 * Translates a sequence of UTF-8 bytes to their equivalent unicode code points.
 * Each code point is prefixed with "\u".
 *
 * @param string $utf8
 *
 * @return string
 */
function utf8_to_unicode($utf8) {
    $i = 0;
    $l = strlen($utf8);

    $out = '';

    while ($i < $l) {
        if ((ord($utf8[$i]) & 0x80) === 0x00) {
            // 0xxxxxxx
            $n = ord($utf8[$i++]);
        } elseif ((ord($utf8[$i]) & 0xE0) === 0xC0) {
            // 110xxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x1F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xF0) === 0xE0) {
            // 1110xxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x0F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xF8) === 0xF0) {
            // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x07) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xFC) === 0xF8) {
            // 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x03) << 24) |
                ((ord($utf8[$i++]) & 0x3F) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xFE) === 0xFC) {
            // 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x01) << 30) |
                ((ord($utf8[$i++]) & 0x3F) << 24) |
                ((ord($utf8[$i++]) & 0x3F) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } else {
            throw new \Exception('Invalid utf-8 code point');
        }

        $n = strtoupper(dechex($n));
        $pad = strlen($n) <= 4 ? strlen($n) + strlen($n) %2 : 0;
        $n = str_pad($n, $pad, "0", STR_PAD_LEFT);

        $out .= sprintf("\u%s", $n);
    }

    return $out;
}

And in your case

在你的情况下

php > var_dump(utf8_to_unicode("\xF0\x9F\x98\x83"));
string(7) "\u1F603"

#2


Use a combination of:

使用以下组合:

  1. stripcslashes() to convert \xFF byte escapes.
    That'll result in a string of UTF-8, because that's what it seemingly was originally.

    stripcslashes()转换\ xFF字节转义。这将导致一串UTF-8,因为这是它原本看来的样子。

  2. json_encode() to convert "????" back to an \uFFFF Unicode escape.
    If that's what you want to end up with. (Not enough context in your question to tell.)

    json_encode()将“????”转换回\ uFFFF Unicode转义。如果这就是你想要的结果。 (在你的问题中没有足够的背景来讲述。)