为什么在Java 8中split有时会在结果数组开始时删除空字符串?

时间:2021-10-25 17:20:12

Before Java 8 when we split on empty string like

在Java 8之前,当我们像这样分割空字符串时

String[] tokens = "abc".split("");

split mechanism would split in places marked with |

分割机制将在标记为|的地方进行分割

|a|b|c|

because empty space "" exists before and after each character. So as result it would generate at first this array

因为空空间“”存在于每个字符之前和之后。所以它首先会生成这个数组

["", "a", "b", "c", ""]

and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit argument) so it will finally return

稍后将删除尾空字符串(因为我们没有显式地提供负值来限制参数),因此它将最终返回

["", "a", "b", "c"]

In Java 8 split mechanism seems to have changed. Now when we use

在Java 8中,分割机制似乎已经改变。现在,当我们使用

"abc".split("")

we will get ["a", "b", "c"] array instead of ["", "a", "b", "c"] so it looks like empty strings at start are also removed. But this theory fails because for instance

我们会得到[a", "b", "c"]数组而不是["" "a", "b", "c"]所以一开始看起来像是空字符串也被删除了。但是这个理论失败了,因为举例来说

"abc".split("a")

is returning array with empty string at start ["", "bc"].

在开始时返回带有空字符串的数组["","bc"]。

Can someone explain what is going on here and how rules of split for this cases have changed in Java 8?

有人能解释一下这里发生了什么吗?在Java 8中,对于这种情况的拆分规则是如何改变的?

3 个解决方案

#1


69  

The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.

字符串的行为。split(调用Pattern.split)在Java 7和Java 8之间进行更改。

Documentation

Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:

模式文档的比较。将Java 7和Java 8分成两部分,我们注意到添加了以下条款:

When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

当在输入序列的开头有一个正宽度匹配时,结果数组的开头就包含一个空的前导子字符串。但是,开始时的零宽度匹配不会产生这样的空前导子字符串。

The same clause is also added to String.split in Java 8, compared to Java 7.

同样的子句也被添加到字符串中。与Java 7相比,在Java 8中进行了拆分。

Reference implementation

Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

让我们比较一下模式的代码。Java 7和Java 8中引用实现的分离。代码从grepcode中检索,用于版本7u40-b43和8-b132。

Java 7

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Java 8

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

Java 8中添加的以下代码排除了输入字符串开头的零长度匹配,这解释了上面的行为。

            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }

Maintaining compatibility

Following behavior in Java 8 and above

To make split behaves consistently across versions and compatible with the behavior in Java 8:

为了使split行为在不同版本间一致并与Java 8中的行为兼容:

  1. If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
  2. 如果您的regex可以匹配零长度的字符串,只需在regex末尾添加(?!\A),并将原始的regex封装在非捕获组(?:…)中(如果必要)。
  3. If your regex can't match zero-length string, you don't need to do anything.
  4. 如果regex不能匹配零长度字符串,则不需要做任何事情。
  5. If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.
  6. 如果您不知道regex是否可以匹配零长度字符串,请执行步骤1中的两个操作。

(?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

(?!\A)检查字符串在字符串的开头没有结束,这意味着匹配是字符串开头的空匹配。

Following behavior in Java 7 and prior

There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.

没有通用的解决方案可以使split back -compatible与Java 7和prior兼容,除非将所有split的实例替换为指向您自己的自定义实现。

#2


29  

This has been specified in the documentation of split(String regex, limit).

这在split(String regex, limit)文档中已经指定。

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

当在这个字符串的开头有一个正宽度匹配时,结果数组的开头就包含一个空的前导子字符串。但是,开始时的零宽度匹配不会产生这样的空前导子字符串。

In "abc".split("") you got a zero-width match at the beginning so the leading empty substring is not included in the resulting array.

在“abc”.split(“”)中,您在开始时得到了一个零宽度的匹配,因此在结果数组中没有包含领先的空子字符串。

However in your second snippet when you split on "a" you got a positive width match (1 in this case), so the empty leading substring is included as expected.

然而,在您的第二个片段中,当您在“a”上拆分时,您得到了一个正值的宽度匹配(在本例中为1),因此如预期的那样包含了空的前导子字符串。

(Removed irrelevant source code)

(删除不相关的源代码)

#3


12  

There was a slight change in the docs for split() from Java 7 to Java 8. Specifically, the following statement was added:

从Java 7到Java 8, split()的文档略有变化。具体地说,增加了下列说明:

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

当在这个字符串的开头有一个正宽度匹配时,结果数组的开头就包含一个空的前导子字符串。但是,开始时的零宽度匹配不会产生这样的空前导子字符串。

(emphasis mine)

(强调我的)

The empty string split generates a zero-width match at the beginning, so an empty string is not included at the start of the resulting array in accordance with what is specified above. By contrast, your second example which splits on "a" generates a positive-width match at the start of the string, so an empty string is in fact included at the start of the resulting array.

空字符串split在开始时生成一个零宽度匹配,因此根据上面指定的内容,结果数组的开始不包含空字符串。相比之下,在“a”上分割的第二个示例在字符串的开头生成一个正宽度匹配,因此实际上在结果数组的开头包含一个空字符串。

#1


69  

The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.

字符串的行为。split(调用Pattern.split)在Java 7和Java 8之间进行更改。

Documentation

Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:

模式文档的比较。将Java 7和Java 8分成两部分,我们注意到添加了以下条款:

When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

当在输入序列的开头有一个正宽度匹配时,结果数组的开头就包含一个空的前导子字符串。但是,开始时的零宽度匹配不会产生这样的空前导子字符串。

The same clause is also added to String.split in Java 8, compared to Java 7.

同样的子句也被添加到字符串中。与Java 7相比,在Java 8中进行了拆分。

Reference implementation

Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

让我们比较一下模式的代码。Java 7和Java 8中引用实现的分离。代码从grepcode中检索,用于版本7u40-b43和8-b132。

Java 7

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Java 8

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

Java 8中添加的以下代码排除了输入字符串开头的零长度匹配,这解释了上面的行为。

            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }

Maintaining compatibility

Following behavior in Java 8 and above

To make split behaves consistently across versions and compatible with the behavior in Java 8:

为了使split行为在不同版本间一致并与Java 8中的行为兼容:

  1. If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
  2. 如果您的regex可以匹配零长度的字符串,只需在regex末尾添加(?!\A),并将原始的regex封装在非捕获组(?:…)中(如果必要)。
  3. If your regex can't match zero-length string, you don't need to do anything.
  4. 如果regex不能匹配零长度字符串,则不需要做任何事情。
  5. If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.
  6. 如果您不知道regex是否可以匹配零长度字符串,请执行步骤1中的两个操作。

(?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

(?!\A)检查字符串在字符串的开头没有结束,这意味着匹配是字符串开头的空匹配。

Following behavior in Java 7 and prior

There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.

没有通用的解决方案可以使split back -compatible与Java 7和prior兼容,除非将所有split的实例替换为指向您自己的自定义实现。

#2


29  

This has been specified in the documentation of split(String regex, limit).

这在split(String regex, limit)文档中已经指定。

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

当在这个字符串的开头有一个正宽度匹配时,结果数组的开头就包含一个空的前导子字符串。但是,开始时的零宽度匹配不会产生这样的空前导子字符串。

In "abc".split("") you got a zero-width match at the beginning so the leading empty substring is not included in the resulting array.

在“abc”.split(“”)中,您在开始时得到了一个零宽度的匹配,因此在结果数组中没有包含领先的空子字符串。

However in your second snippet when you split on "a" you got a positive width match (1 in this case), so the empty leading substring is included as expected.

然而,在您的第二个片段中,当您在“a”上拆分时,您得到了一个正值的宽度匹配(在本例中为1),因此如预期的那样包含了空的前导子字符串。

(Removed irrelevant source code)

(删除不相关的源代码)

#3


12  

There was a slight change in the docs for split() from Java 7 to Java 8. Specifically, the following statement was added:

从Java 7到Java 8, split()的文档略有变化。具体地说,增加了下列说明:

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

当在这个字符串的开头有一个正宽度匹配时,结果数组的开头就包含一个空的前导子字符串。但是,开始时的零宽度匹配不会产生这样的空前导子字符串。

(emphasis mine)

(强调我的)

The empty string split generates a zero-width match at the beginning, so an empty string is not included at the start of the resulting array in accordance with what is specified above. By contrast, your second example which splits on "a" generates a positive-width match at the start of the string, so an empty string is in fact included at the start of the resulting array.

空字符串split在开始时生成一个零宽度匹配,因此根据上面指定的内容,结果数组的开始不包含空字符串。相比之下,在“a”上分割的第二个示例在字符串的开头生成一个正宽度匹配,因此实际上在结果数组的开头包含一个空字符串。