如何检查字节数组是否包含Java中的Unicode字符串？

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?

给定一个UTF-8编码字符串或任意二进制数据的字节数组，可以在Java中使用哪些方法来确定它是什么？

The array may be generated by code similar to:

该数组可以通过类似于以下的代码生成：

byte[] utf8 = "Hello World".getBytes("UTF-8");

Alternatively it may have been generated by code similar to:

或者它可能是由类似于以下的代码生成的：

byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
    messageContent[i] = (byte) i;
}

The key point is that we don't know what the array contains but need to find out in order to fill in the following function:

关键是我们不知道数组包含什么，但需要找出以填写以下函数：

public final String getString(final byte[] dataToProcess) {
    // Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
    // If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
    // If dataToProcess contains an encoded string then we will decode it and return.
}

How would this be extended to also cover UTF-16 or other encoding mechanisms?

如何扩展到覆盖UTF-16或其他编码机制？

7 个解决方案

#1

It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.

在所有情况下都不可能完全准确地做出决定，因为UTF-8编码的字符串是一种任意二进制数据，但您可以查找UTF-8中无效的字节序列。如果你发现任何，你知道它不是UTF-8。

If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.

如果数组足够大，这应该很好，因为这样的序列很可能出现在“随机”二进制数据中，如压缩数据或图像文件。

However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.

但是，有可能获得有效的UTF-8数据，这些数据解码为完全无意义的字符串（可能来自各种不同的脚本）。短序列更可能发生这种情况。如果您对此感到担心，可能需要进行更仔细的分析，以查看字母字符是否都属于同一代码图表。然后，当您具有混合脚本的有效文本输入时，这可能会产生错误否定。

#2

Here's a way to use the UTF-8 "binary" regex from the W3C site

这是一种使用W3C站点的UTF-8“二进制”正则表达式的方法

static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException 
{
  Pattern p = Pattern.compile("\\A(\n" +
    "  [\\x09\\x0A\\x0D\\x20-\\x7E]             # ASCII\\n" +
    "| [\\xC2-\\xDF][\\x80-\\xBF]               # non-overlong 2-byte\n" +
    "|  \\xE0[\\xA0-\\xBF][\\x80-\\xBF]         # excluding overlongs\n" +
    "| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}  # straight 3-byte\n" +
    "|  \\xED[\\x80-\\x9F][\\x80-\\xBF]         # excluding surrogates\n" +
    "|  \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}      # planes 1-3\n" +
    "| [\\xF1-\\xF3][\\x80-\\xBF]{3}            # planes 4-15\n" +
    "|  \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}      # plane 16\n" +
    ")*\\z", Pattern.COMMENTS);

  String phonyString = new String(utf8, "ISO-8859-1");
  return p.matcher(phonyString).matches();
}

As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.

正如最初编写的那样，正则表达式是用在字节数组上的，但你不能用Java的正则表达式做到这一点;目标必须是实现CharSequence接口的东西（所以char []也是出来的）。通过将byte []解码为ISO-8859-1，您可以创建一个String，其中每个char具有与原始数组中相应字节相同的无符号数值。

As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.

正如其他人所指出的那样，这样的测试只能告诉你byte []可能包含UTF-8文本，而不是它。但正则表达式是如此详尽，原始二进制数据似乎不太可能通过它。即使是全零的数组也不匹配，因为正则表达式永远不会与NUL匹配。如果唯一的可能性是UTF-8和二进制，我愿意相信这个测试。

And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.

当你在它的时候，你可以剥离UTF-8 BOM（如果有的话）;否则，UTF-8 CharsetDecoder将传递它，就好像它是文本一样。

UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.

UTF-16会更加困难，因为很少有字节序列始终无效。我唯一可以想到的就是那些缺少低代理伴侣的高代理人物，反之亦然。除此之外，您还需要一些上下文来确定给定序列是否有效。你可能会有一个西里尔字母，然后是一个中文表意文字，然后是一个笑脸dingbat，但它将是完全有效的UTF-16。

#3

The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.

问题假设字符串和二进制数据之间存在根本区别。虽然这是直观的，但几乎不可能准确定义这种差异是什么。

A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.

Java String是16位量的序列，对应于（几乎）2 ** 16个Unicode基本代码点之一。但是如果你看那些16位'字符'，每个字符可以同样代表一个整数，一对字节，一个像素，等等。位模式没有任何关于它们代表什么的固有内容。

Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)

现在假设您将问题重新表述为要求区分UTF-8编码的TEXT与任意二进制数据的方法。这有帮助吗？理论上没有，因为编码任何书面文本的位模式也可以是数字序列。（很难说“任意”在这里真正意味着什么。你能告诉我如何测试一个数字是否“任意”？）

The best we can do here is the following:

我们在这里做的最好的是：

Test if the bytes are a valid UTF-8 encoding.
测试字节是否是有效的UTF-8编码。
Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
测试解码的16位数量是否合法，“分配”UTF-8代码点。（某些16位数量是非法的（例如0xffff），而其他数量当前未分配给任何字符。）但是，如果文本文档确实使用了未分配的代码点，该怎么办？
Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
根据文档的假定语言测试Unicode代码点是否属于您期望的“平面”。但是，如果您不知道期望的语言，或者使用多种语言的文档，该怎么办？
Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?
测试是代码点的序列看起来像单词，句子或其他什么。但是如果我们有一些碰巧包含嵌入式文本序列的“二进制数据”呢？

In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.

总之，如果解码失败，您可以确定字节序列绝对不是UTF-8。除此之外，如果您对语言做出假设，您可以说字节序列可能或可能不是UTF-8编码的文本文档。

IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.

IMO，你能做的最好的事情就是避免陷入程序需要做出这个决定的情况。如果无法避免，请认识到您的程序可能会出错。通过思考和努力工作，你可以做到这一点，但概率永远不会为零。

#4

If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.

如果字节数组以字节顺序标记（BOM）开头，那么很容易区分使用的编码。用于处理文本流的标准Java类可能会自动为您处理。

If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).

如果您的字节数据中没有BOM，那将非常困难 - .NET类可以执行统计分析以尝试编制编码，但我认为这是假设您知道您正在处理文本数据（只是不知道使用了哪种编码）。

If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.

如果您可以控制输入数据的格式，最好的选择是确保它包含字节顺序标记。

#5

In the original question: How can I check whether a byte array contains a Unicode string in Java?; I found that the term Java Unicode is essentially referring to Utf16 Code Units. I went through this problem myself and created some code that could help anyone with this type of question on their mind find some answers.

在原始问题中：如何检查字节数组是否包含Java中的Unicode字符串？我发现术语Java Unicode本质上是指Utf16代码单元。我自己解决了这个问题并创建了一些代码，可以帮助任何有这类问题的人找到答案。

I have created 2 main methods, one will display Utf-8 Code Units and the other will create Utf-16 Code Units. Utf-16 Code Units is what you will encounter with Java and JavaScript...commonly seen in the form "\ud83d"

我创建了两个主要方法，一个将显示Utf-8代码单元，另一个将创建Utf-16代码单元。 Utf-16代码单元是您将遇到的Java和JavaScript ...常见于“\ ud83d”形式

For more help with Code Units and conversion try the website;

有关代码单元和转换的更多帮助，请尝试使用该网站;

https://r12a.github.io/apps/conversion/

Here is code...

这是代码......

    byte[] array_bytes = text.toString().getBytes();
    char[] array_chars = text.toString().toCharArray();
    System.out.println();
    byteArrayToUtf8CodeUnits(array_bytes);
    System.out.println();
    charArrayToUtf16CodeUnits(array_chars);


public static void byteArrayToUtf8CodeUnits(byte[] byte_array)
{
    /*for (int k = 0; k < array.length; k++)
    {
        System.out.println(name + "[" + k + "] = " + "0x" + byteToHex(array[k]));
    }*/
    System.out.println("array.length: = " + byte_array.length);
    //------------------------------------------------------------------------------------------
    for (int k = 0; k < byte_array.length; k++)
    {
        System.out.println("array byte: " + "[" + k + "]" + " converted to hex" + " = " + byteToHex(byte_array[k]));
    }
    //------------------------------------------------------------------------------------------
}
public static void charArrayToUtf16CodeUnits(char[] char_array)
{
    /*Utf16 code units are also known as Java Unicode*/
    System.out.println("array.length: = " + char_array.length);
    //------------------------------------------------------------------------------------------
    for (int i = 0; i < char_array.length; i++)
    {
        System.out.println("array char: " + "[" + i + "]" + " converted to hex" + " = " + charToHex(char_array[i]));
    }
    //------------------------------------------------------------------------------------------
}
static public String byteToHex(byte b)
{
    //Returns hex String representation of byte b
    char hexDigit[] =
            {
                    '0', '1', '2', '3', '4', '5', '6', '7',
                    '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
            };
    char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
    return new String(array);
}
static public String charToHex(char c)
{
    //Returns hex String representation of char c
    byte hi = (byte) (c >>> 8);
    byte lo = (byte) (c & 0xff);

    return byteToHex(hi) + byteToHex(lo);
}

#6

-1

Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.

尝试解码它。如果您没有收到任何错误，那么它是一个有效的UTF-8字符串。

#7

-1

I think Michael has explained it well in his answer this may be the only way to find out if a byte array contains all valid utf-8 sequences. I am using following code in php

我认为迈克尔在他的回答中已经很好地解释了这可能是找出字节数组是否包含所有有效utf-8序列的唯一方法。我在php中使用以下代码

function is_utf8($string) {

    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);

}

Taken it from W3.org

从W3.org获取

#1