
时间:2021-05-07 07:30:40

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?


The array may be generated by code similar to:


byte[] utf8 = "Hello World".getBytes("UTF-8");

Alternatively it may have been generated by code similar to:


byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
    messageContent[i] = (byte) i;

The key point is that we don't know what the array contains but need to find out in order to fill in the following function:


public final String getString(final byte[] dataToProcess) {
    // Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
    // If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
    // If dataToProcess contains an encoded string then we will decode it and return.

How would this be extended to also cover UTF-16 or other encoding mechanisms?


7 个解决方案



It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.


If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.


However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.




Here's a way to use the UTF-8 "binary" regex from the W3C site


static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException 
  Pattern p = Pattern.compile("\\A(\n" +
    "  [\\x09\\x0A\\x0D\\x20-\\x7E]             # ASCII\\n" +
    "| [\\xC2-\\xDF][\\x80-\\xBF]               # non-overlong 2-byte\n" +
    "|  \\xE0[\\xA0-\\xBF][\\x80-\\xBF]         # excluding overlongs\n" +
    "| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}  # straight 3-byte\n" +
    "|  \\xED[\\x80-\\x9F][\\x80-\\xBF]         # excluding surrogates\n" +
    "|  \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}      # planes 1-3\n" +
    "| [\\xF1-\\xF3][\\x80-\\xBF]{3}            # planes 4-15\n" +
    "|  \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}      # plane 16\n" +
    ")*\\z", Pattern.COMMENTS);

  String phonyString = new String(utf8, "ISO-8859-1");
  return p.matcher(phonyString).matches();

As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.

正如最初编写的那样,正则表达式是用在字节数组上的,但你不能用Java的正则表达式做到这一点;目标必须是实现CharSequence接口的东西(所以char []也是出来的)。通过将byte []解码为ISO-8859-1,您可以创建一个String,其中每个char具有与原始数组中相应字节相同的无符号数值。

As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.

正如其他人所指出的那样,这样的测试只能告诉你byte []可能包含UTF-8文本,而不是它。但正则表达式是如此详尽,原始二进制数据似乎不太可能通过它。即使是全零的数组也不匹配,因为正则表达式永远不会与NUL匹配。如果唯一的可能性是UTF-8和二进制,我愿意相信这个测试。

And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.

当你在它的时候,你可以剥离UTF-8 BOM(如果有的话);否则,UTF-8 CharsetDecoder将传递它,就好像它是文本一样。

UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.




The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.


A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.

Java String是16位量的序列,对应于(几乎)2 ** 16个Unicode基本代码点之一。但是如果你看那些16位'字符',每个字符可以同样代表一个整数,一对字节,一个像素,等等。位模式没有任何关于它们代表什么的固有内容。

Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)

现在假设您将问题重新表述为要求区分UTF-8编码的TEXT与任意二进制数据的方法。这有帮助吗?理论上没有,因为编码任何书面文本的位模式也可以是数字序列。 (很难说“任意”在这里真正意味着什么。你能告诉我如何测试一个数字是否“任意”?)

The best we can do here is the following:


  1. Test if the bytes are a valid UTF-8 encoding.
  2. 测试字节是否是有效的UTF-8编码。
  3. Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
  4. 测试解码的16位数量是否合法,“分配”UTF-8代码点。 (某些16位数量是非法的(例如0xffff),而其他数量当前未分配给任何字符。)但是,如果文本文档确实使用了未分配的代码点,该怎么办?
  5. Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
  6. 根据文档的假定语言测试Unicode代码点是否属于您期望的“平面”。但是,如果您不知道期望的语言,或者使用多种语言的文档,该怎么办?
  7. Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?
  8. 测试是代码点的序列看起来像单词,句子或其他什么。但是如果我们有一些碰巧包含嵌入式文本序列的“二进制数据”呢?

In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.


IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.




If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.


If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).

如果您的字节数据中没有BOM,那将非常困难 - .NET类可以执行统计分析以尝试编制编码,但我认为这是假设您知道您正在处理文本数据(只是不知道使用了哪种编码)。

If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.




In the original question: How can I check whether a byte array contains a Unicode string in Java?; I found that the term Java Unicode is essentially referring to Utf16 Code Units. I went through this problem myself and created some code that could help anyone with this type of question on their mind find some answers.

在原始问题中:如何检查字节数组是否包含Java中的Unicode字符串?我发现术语Java Unicode本质上是指Utf16代码单元。我自己解决了这个问题并创建了一些代码,可以帮助任何有这类问题的人找到答案。

I have created 2 main methods, one will display Utf-8 Code Units and the other will create Utf-16 Code Units. Utf-16 Code Units is what you will encounter with Java and JavaScript...commonly seen in the form "\ud83d"

我创建了两个主要方法,一个将显示Utf-8代码单元,另一个将创建Utf-16代码单元。 Utf-16代码单元是您将遇到的Java和JavaScript ...常见于“\ ud83d”形式

For more help with Code Units and conversion try the website;




Here is code...


    byte[] array_bytes = text.toString().getBytes();
    char[] array_chars = text.toString().toCharArray();

public static void byteArrayToUtf8CodeUnits(byte[] byte_array)
    /*for (int k = 0; k < array.length; k++)
        System.out.println(name + "[" + k + "] = " + "0x" + byteToHex(array[k]));
    System.out.println("array.length: = " + byte_array.length);
    for (int k = 0; k < byte_array.length; k++)
        System.out.println("array byte: " + "[" + k + "]" + " converted to hex" + " = " + byteToHex(byte_array[k]));
public static void charArrayToUtf16CodeUnits(char[] char_array)
    /*Utf16 code units are also known as Java Unicode*/
    System.out.println("array.length: = " + char_array.length);
    for (int i = 0; i < char_array.length; i++)
        System.out.println("array char: " + "[" + i + "]" + " converted to hex" + " = " + charToHex(char_array[i]));
static public String byteToHex(byte b)
    //Returns hex String representation of byte b
    char hexDigit[] =
                    '0', '1', '2', '3', '4', '5', '6', '7',
                    '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
    char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
    return new String(array);
static public String charToHex(char c)
    //Returns hex String representation of char c
    byte hi = (byte) (c >>> 8);
    byte lo = (byte) (c & 0xff);

    return byteToHex(hi) + byteToHex(lo);



Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.




I think Michael has explained it well in his answer this may be the only way to find out if a byte array contains all valid utf-8 sequences. I am using following code in php


function is_utf8($string) {

    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);


Taken it from W3.org




It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.


If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.


However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.




Here's a way to use the UTF-8 "binary" regex from the W3C site


static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException 
  Pattern p = Pattern.compile("\\A(\n" +
    "  [\\x09\\x0A\\x0D\\x20-\\x7E]             # ASCII\\n" +
    "| [\\xC2-\\xDF][\\x80-\\xBF]               # non-overlong 2-byte\n" +
    "|  \\xE0[\\xA0-\\xBF][\\x80-\\xBF]         # excluding overlongs\n" +
    "| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}  # straight 3-byte\n" +
    "|  \\xED[\\x80-\\x9F][\\x80-\\xBF]         # excluding surrogates\n" +
    "|  \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}      # planes 1-3\n" +
    "| [\\xF1-\\xF3][\\x80-\\xBF]{3}            # planes 4-15\n" +
    "|  \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}      # plane 16\n" +
    ")*\\z", Pattern.COMMENTS);

  String phonyString = new String(utf8, "ISO-8859-1");
  return p.matcher(phonyString).matches();

As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.

正如最初编写的那样,正则表达式是用在字节数组上的,但你不能用Java的正则表达式做到这一点;目标必须是实现CharSequence接口的东西(所以char []也是出来的)。通过将byte []解码为ISO-8859-1,您可以创建一个String,其中每个char具有与原始数组中相应字节相同的无符号数值。

As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.

正如其他人所指出的那样,这样的测试只能告诉你byte []可能包含UTF-8文本,而不是它。但正则表达式是如此详尽,原始二进制数据似乎不太可能通过它。即使是全零的数组也不匹配,因为正则表达式永远不会与NUL匹配。如果唯一的可能性是UTF-8和二进制,我愿意相信这个测试。

And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.

当你在它的时候,你可以剥离UTF-8 BOM(如果有的话);否则,UTF-8 CharsetDecoder将传递它,就好像它是文本一样。

UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.




The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.


A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.

Java String是16位量的序列,对应于(几乎)2 ** 16个Unicode基本代码点之一。但是如果你看那些16位'字符',每个字符可以同样代表一个整数,一对字节,一个像素,等等。位模式没有任何关于它们代表什么的固有内容。

Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)

现在假设您将问题重新表述为要求区分UTF-8编码的TEXT与任意二进制数据的方法。这有帮助吗?理论上没有,因为编码任何书面文本的位模式也可以是数字序列。 (很难说“任意”在这里真正意味着什么。你能告诉我如何测试一个数字是否“任意”?)

The best we can do here is the following:


  1. Test if the bytes are a valid UTF-8 encoding.
  2. 测试字节是否是有效的UTF-8编码。
  3. Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
  4. 测试解码的16位数量是否合法,“分配”UTF-8代码点。 (某些16位数量是非法的(例如0xffff),而其他数量当前未分配给任何字符。)但是,如果文本文档确实使用了未分配的代码点,该怎么办?
  5. Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
  6. 根据文档的假定语言测试Unicode代码点是否属于您期望的“平面”。但是,如果您不知道期望的语言,或者使用多种语言的文档,该怎么办?
  7. Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?
  8. 测试是代码点的序列看起来像单词,句子或其他什么。但是如果我们有一些碰巧包含嵌入式文本序列的“二进制数据”呢?

In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.


IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.




If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.


If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).

如果您的字节数据中没有BOM,那将非常困难 - .NET类可以执行统计分析以尝试编制编码,但我认为这是假设您知道您正在处理文本数据(只是不知道使用了哪种编码)。

If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.




In the original question: How can I check whether a byte array contains a Unicode string in Java?; I found that the term Java Unicode is essentially referring to Utf16 Code Units. I went through this problem myself and created some code that could help anyone with this type of question on their mind find some answers.

在原始问题中:如何检查字节数组是否包含Java中的Unicode字符串?我发现术语Java Unicode本质上是指Utf16代码单元。我自己解决了这个问题并创建了一些代码,可以帮助任何有这类问题的人找到答案。

I have created 2 main methods, one will display Utf-8 Code Units and the other will create Utf-16 Code Units. Utf-16 Code Units is what you will encounter with Java and JavaScript...commonly seen in the form "\ud83d"

我创建了两个主要方法,一个将显示Utf-8代码单元,另一个将创建Utf-16代码单元。 Utf-16代码单元是您将遇到的Java和JavaScript ...常见于“\ ud83d”形式

For more help with Code Units and conversion try the website;




Here is code...


    byte[] array_bytes = text.toString().getBytes();
    char[] array_chars = text.toString().toCharArray();

public static void byteArrayToUtf8CodeUnits(byte[] byte_array)
    /*for (int k = 0; k < array.length; k++)
        System.out.println(name + "[" + k + "] = " + "0x" + byteToHex(array[k]));
    System.out.println("array.length: = " + byte_array.length);
    for (int k = 0; k < byte_array.length; k++)
        System.out.println("array byte: " + "[" + k + "]" + " converted to hex" + " = " + byteToHex(byte_array[k]));
public static void charArrayToUtf16CodeUnits(char[] char_array)
    /*Utf16 code units are also known as Java Unicode*/
    System.out.println("array.length: = " + char_array.length);
    for (int i = 0; i < char_array.length; i++)
        System.out.println("array char: " + "[" + i + "]" + " converted to hex" + " = " + charToHex(char_array[i]));
static public String byteToHex(byte b)
    //Returns hex String representation of byte b
    char hexDigit[] =
                    '0', '1', '2', '3', '4', '5', '6', '7',
                    '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
    char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
    return new String(array);
static public String charToHex(char c)
    //Returns hex String representation of char c
    byte hi = (byte) (c >>> 8);
    byte lo = (byte) (c & 0xff);

    return byteToHex(hi) + byteToHex(lo);



Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.




I think Michael has explained it well in his answer this may be the only way to find out if a byte array contains all valid utf-8 sequences. I am using following code in php


function is_utf8($string) {

    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);


Taken it from W3.org
