如何将UTF8字符串转换为字节数组?

时间:2022-12-12 23:00:00

The .charCodeAt function returns with the unicode code of the caracter. But I would like to get the byte array instead. I know, if the charcode is over 127, then the character is stored in two or more bytes.

.charCodeAt函数返回caracter的unicode代码。但我希望得到字节数组。我知道,如果charcode超过127,那么该字符将存储在两个或更多字节中。

var arr=[];
for(var i=0; i<str.length; i++) {
    arr.push(str.charCodeAt(i))
}

6 个解决方案

#1


44  

The logic of encoding Unicode in UTF-8 is basically:

在UTF-8中编码Unicode的逻辑基本上是:

  • Up to 4 bytes per character can be used. The fewest number of bytes possible is used.
  • 每个字符最多可使用4个字节。使用尽可能少的字节数。

  • Characters up to U+007F are encoded with a single byte.
  • 直到U + 007F的字符用单个字节编码。

  • For multibyte sequences, the number of leading 1 bits in the first byte gives the number of bytes for the character. The rest of the bits of the first byte can be used to encode bits of the character.
  • 对于多字节序列,第一个字节中前导1位的数量给出了字符的字节数。第一个字节的其余位可用于编码字符的位。

  • The continuation bytes begin with 10, and the other 6 bits encode bits of the character.
  • 连续字节以10开头,其他6位编码字符的位。

Here's a function I wrote a while back for encoding a JavaScript UTF-16 string in UTF-8:

这是我在UTF-8中编写JavaScript UTF-16字符串时编写的函数:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18), 
                      0x80 | ((charcode>>12) & 0x3f), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

#2


27  

JavaScript Strings are stored in UTF-16. To get UTF-8, you'll have to convert the String yourself.

JavaScript字符串以UTF-16存储。要获得UTF-8,您必须自己转换String。

One way is to mix encodeURIComponent(), which will output UTF-8 bytes URL-encoded, with unescape, as mentioned on ecmanaut.

一种方法是混合encodeURIComponent(),它将输出URL编码的UTF-8字节与unescape,如ecmanaut所述。

var utf8 = unescape(encodeURIComponent(str));

var arr = [];
for (var i = 0; i < utf8.length; i++) {
    arr.push(utf8.charCodeAt(i));
}

#3


7  

The Google Closure library has functions to convert to/from UTF-8 and byte arrays. If you don't want to use the whole library, you can copy the functions from here. For completeness, the code to convert to a string to a UTF-8 byte array is:

Google Closure库具有转换为UTF-8和字节数组的功能。如果您不想使用整个库,可以从此处复制这些功能。为了完整起见,将字符串转换为UTF-8字节数组的代码是:

goog.crypt.stringToUtf8ByteArray = function(str) {
  // TODO(user): Use native implementations if/when available
  var out = [], p = 0;
  for (var i = 0; i < str.length; i++) {
    var c = str.charCodeAt(i);
    if (c < 128) {
      out[p++] = c;
    } else if (c < 2048) {
      out[p++] = (c >> 6) | 192;
      out[p++] = (c & 63) | 128;
    } else if (
        ((c & 0xFC00) == 0xD800) && (i + 1) < str.length &&
        ((str.charCodeAt(i + 1) & 0xFC00) == 0xDC00)) {
      // Surrogate Pair
      c = 0x10000 + ((c & 0x03FF) << 10) + (str.charCodeAt(++i) & 0x03FF);
      out[p++] = (c >> 18) | 240;
      out[p++] = ((c >> 12) & 63) | 128;
      out[p++] = ((c >> 6) & 63) | 128;
      out[p++] = (c & 63) | 128;
    } else {
      out[p++] = (c >> 12) | 224;
      out[p++] = ((c >> 6) & 63) | 128;
      out[p++] = (c & 63) | 128;
    }
  }
  return out;
};

#4


6  

Assuming the question is about a DOMString as input and the goal is to get an Array, that when interpreted as string (e.g. written to a file on disk), would be UTF-8 encoded:

假设问题是关于DOMString作为输入,目标是获取一个数组,当解释为字符串(例如,写入磁盘上的文件)时,将采用UTF-8编码:

Now that nearly all modern browsers support Typed Arrays, it'd be ashamed if this approach is not listed:

现在几乎所有的现代浏览器都支持Typed Arrays,如果没有列出这种方法,那就太惭愧了:

  • According to the W3C, software supporting the File API should accept DOMStrings in their Blob constructor (see also: String encoding when constructing a Blob)
  • 根据W3C,支持File API的软件应该在其Blob构造函数中接受DOMStrings(另请参阅:构造Blob时的字符串编码)

  • Blobs can be converted to an ArrayBuffer using the .readAsArrayBuffer() function of a File Reader
  • 可以使用文件读取器的.readAsArrayBuffer()函数将Blob转换为ArrayBuffer

  • Using a DataView or constructing a Typed Array with the buffer read by the File Reader, one can access every single byte of the ArrayBuffer
  • 使用DataView或使用File Reader读取的缓冲区构建Typed Array,可以访问ArrayBuffer的每个字节

Example:

// Create a Blob with an Euro-char (U+20AC)
var b = new Blob(['€']);
var fr = new FileReader();

fr.onload = function() {
    ua = new Uint8Array(fr.result);
    // This will log "3|226|130|172"
    //                  E2  82  AC
    // In UTF-16, it would be only 2 bytes long
    console.log(
        fr.result.byteLength + '|' + 
        ua[0]  + '|' + 
        ua[1] + '|' + 
        ua[2] + ''
    );
};
fr.readAsArrayBuffer(b);

Play with that on JSFiddle. I haven't benchmarked this yet but I can imagine this being efficient for large DOMStrings as input.

在JSFiddle上玩它。我还没有对此进行基准测试,但我可以想象这对于大型DOMStrings作为输入是有效的。

#5


4  

The new Encoding API seems to let you both encode and decode UTF-8 easily (using typed arrays):

新的Encoding API似乎可以让您轻松编码和解码UTF-8(使用类型化数组):

var encoded = new TextEncoder("utf-8").encode("Γεια σου κόσμε");
var decoded = new TextDecoder("utf-8").decode(encoded);

console.log(encoded, decoded);

Browser support isn't too bad, but it isn't currently supported in Microsoft Edge. There's a polyfill that should work in IE11 and Edge.

浏览器支持并不算太糟糕,但Microsoft Edge目前不支持它。有一个polyfill应该在IE11和Edge中工作。

#6


2  

You can save a string raw as is by using FileReader.

您可以使用FileReader将字符串保存为原始字符串。

Save the string in a blob and call readAsArrayBuffer(). Then the onload-event results an arraybuffer, which can converted in a Uint8Array. Unfortunately this call is asynchronous.

将字符串保存在blob中并调用readAsArrayBuffer()。然后onload-event会产生一个arraybuffer,它可以在Uint8Array中转换。不幸的是,这个调用是异步

This little function will help you:

这个小功能将帮助您:

function stringToBytes(str)
{
    let reader = new FileReader();
    let done = () => {};

    reader.onload = event =>
    {
        done(new Uint8Array(event.target.result), str);
    };
    reader.readAsArrayBuffer(new Blob([str], { type: "application/octet-stream" }));

    return { done: callback => { done = callback; } };
}

Call it like this:

这样叫:

stringToBytes("\u{1f4a9}").done(bytes =>
{
    console.log(bytes);
});

output: [240, 159, 146, 169]

输出:[240,159,146,169]

explanation:

JavaScript use UTF-16 and surrogate-pairs to store unicode characters in memory. To save unicode character in raw binary byte streams an encoding is necessary. Usually and in the most case, UTF-8 is used for this. If you not use an enconding you can't save unicode character, just ASCII up to 0x7f.

JavaScript使用UTF-16和代理对在内存中存储unicode字符。要在原始二进制字节流中保存unicode字符,必须进行编码。通常在大多数情况下,使用UTF-8。如果你不使用enconding你不能保存unicode字符,只有ASCII到0x7f。

FileReader.readAsArrayBuffer() uses UTF-8.

FileReader.readAsArrayBuffer()使用UTF-8。

#1


44  

The logic of encoding Unicode in UTF-8 is basically:

在UTF-8中编码Unicode的逻辑基本上是:

  • Up to 4 bytes per character can be used. The fewest number of bytes possible is used.
  • 每个字符最多可使用4个字节。使用尽可能少的字节数。

  • Characters up to U+007F are encoded with a single byte.
  • 直到U + 007F的字符用单个字节编码。

  • For multibyte sequences, the number of leading 1 bits in the first byte gives the number of bytes for the character. The rest of the bits of the first byte can be used to encode bits of the character.
  • 对于多字节序列,第一个字节中前导1位的数量给出了字符的字节数。第一个字节的其余位可用于编码字符的位。

  • The continuation bytes begin with 10, and the other 6 bits encode bits of the character.
  • 连续字节以10开头,其他6位编码字符的位。

Here's a function I wrote a while back for encoding a JavaScript UTF-16 string in UTF-8:

这是我在UTF-8中编写JavaScript UTF-16字符串时编写的函数:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18), 
                      0x80 | ((charcode>>12) & 0x3f), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

#2


27  

JavaScript Strings are stored in UTF-16. To get UTF-8, you'll have to convert the String yourself.

JavaScript字符串以UTF-16存储。要获得UTF-8,您必须自己转换String。

One way is to mix encodeURIComponent(), which will output UTF-8 bytes URL-encoded, with unescape, as mentioned on ecmanaut.

一种方法是混合encodeURIComponent(),它将输出URL编码的UTF-8字节与unescape,如ecmanaut所述。

var utf8 = unescape(encodeURIComponent(str));

var arr = [];
for (var i = 0; i < utf8.length; i++) {
    arr.push(utf8.charCodeAt(i));
}

#3


7  

The Google Closure library has functions to convert to/from UTF-8 and byte arrays. If you don't want to use the whole library, you can copy the functions from here. For completeness, the code to convert to a string to a UTF-8 byte array is:

Google Closure库具有转换为UTF-8和字节数组的功能。如果您不想使用整个库,可以从此处复制这些功能。为了完整起见,将字符串转换为UTF-8字节数组的代码是:

goog.crypt.stringToUtf8ByteArray = function(str) {
  // TODO(user): Use native implementations if/when available
  var out = [], p = 0;
  for (var i = 0; i < str.length; i++) {
    var c = str.charCodeAt(i);
    if (c < 128) {
      out[p++] = c;
    } else if (c < 2048) {
      out[p++] = (c >> 6) | 192;
      out[p++] = (c & 63) | 128;
    } else if (
        ((c & 0xFC00) == 0xD800) && (i + 1) < str.length &&
        ((str.charCodeAt(i + 1) & 0xFC00) == 0xDC00)) {
      // Surrogate Pair
      c = 0x10000 + ((c & 0x03FF) << 10) + (str.charCodeAt(++i) & 0x03FF);
      out[p++] = (c >> 18) | 240;
      out[p++] = ((c >> 12) & 63) | 128;
      out[p++] = ((c >> 6) & 63) | 128;
      out[p++] = (c & 63) | 128;
    } else {
      out[p++] = (c >> 12) | 224;
      out[p++] = ((c >> 6) & 63) | 128;
      out[p++] = (c & 63) | 128;
    }
  }
  return out;
};

#4


6  

Assuming the question is about a DOMString as input and the goal is to get an Array, that when interpreted as string (e.g. written to a file on disk), would be UTF-8 encoded:

假设问题是关于DOMString作为输入,目标是获取一个数组,当解释为字符串(例如,写入磁盘上的文件)时,将采用UTF-8编码:

Now that nearly all modern browsers support Typed Arrays, it'd be ashamed if this approach is not listed:

现在几乎所有的现代浏览器都支持Typed Arrays,如果没有列出这种方法,那就太惭愧了:

  • According to the W3C, software supporting the File API should accept DOMStrings in their Blob constructor (see also: String encoding when constructing a Blob)
  • 根据W3C,支持File API的软件应该在其Blob构造函数中接受DOMStrings(另请参阅:构造Blob时的字符串编码)

  • Blobs can be converted to an ArrayBuffer using the .readAsArrayBuffer() function of a File Reader
  • 可以使用文件读取器的.readAsArrayBuffer()函数将Blob转换为ArrayBuffer

  • Using a DataView or constructing a Typed Array with the buffer read by the File Reader, one can access every single byte of the ArrayBuffer
  • 使用DataView或使用File Reader读取的缓冲区构建Typed Array,可以访问ArrayBuffer的每个字节

Example:

// Create a Blob with an Euro-char (U+20AC)
var b = new Blob(['€']);
var fr = new FileReader();

fr.onload = function() {
    ua = new Uint8Array(fr.result);
    // This will log "3|226|130|172"
    //                  E2  82  AC
    // In UTF-16, it would be only 2 bytes long
    console.log(
        fr.result.byteLength + '|' + 
        ua[0]  + '|' + 
        ua[1] + '|' + 
        ua[2] + ''
    );
};
fr.readAsArrayBuffer(b);

Play with that on JSFiddle. I haven't benchmarked this yet but I can imagine this being efficient for large DOMStrings as input.

在JSFiddle上玩它。我还没有对此进行基准测试,但我可以想象这对于大型DOMStrings作为输入是有效的。

#5


4  

The new Encoding API seems to let you both encode and decode UTF-8 easily (using typed arrays):

新的Encoding API似乎可以让您轻松编码和解码UTF-8(使用类型化数组):

var encoded = new TextEncoder("utf-8").encode("Γεια σου κόσμε");
var decoded = new TextDecoder("utf-8").decode(encoded);

console.log(encoded, decoded);

Browser support isn't too bad, but it isn't currently supported in Microsoft Edge. There's a polyfill that should work in IE11 and Edge.

浏览器支持并不算太糟糕,但Microsoft Edge目前不支持它。有一个polyfill应该在IE11和Edge中工作。

#6


2  

You can save a string raw as is by using FileReader.

您可以使用FileReader将字符串保存为原始字符串。

Save the string in a blob and call readAsArrayBuffer(). Then the onload-event results an arraybuffer, which can converted in a Uint8Array. Unfortunately this call is asynchronous.

将字符串保存在blob中并调用readAsArrayBuffer()。然后onload-event会产生一个arraybuffer,它可以在Uint8Array中转换。不幸的是,这个调用是异步

This little function will help you:

这个小功能将帮助您:

function stringToBytes(str)
{
    let reader = new FileReader();
    let done = () => {};

    reader.onload = event =>
    {
        done(new Uint8Array(event.target.result), str);
    };
    reader.readAsArrayBuffer(new Blob([str], { type: "application/octet-stream" }));

    return { done: callback => { done = callback; } };
}

Call it like this:

这样叫:

stringToBytes("\u{1f4a9}").done(bytes =>
{
    console.log(bytes);
});

output: [240, 159, 146, 169]

输出:[240,159,146,169]

explanation:

JavaScript use UTF-16 and surrogate-pairs to store unicode characters in memory. To save unicode character in raw binary byte streams an encoding is necessary. Usually and in the most case, UTF-8 is used for this. If you not use an enconding you can't save unicode character, just ASCII up to 0x7f.

JavaScript使用UTF-16和代理对在内存中存储unicode字符。要在原始二进制字节流中保存unicode字符,必须进行编码。通常在大多数情况下,使用UTF-8。如果你不使用enconding你不能保存unicode字符,只有ASCII到0x7f。

FileReader.readAsArrayBuffer() uses UTF-8.

FileReader.readAsArrayBuffer()使用UTF-8。