Java String.split()有时提供空白字符串。

时间:2022-09-29 16:06:43

I'm making a text based dice roller. It takes in strings like "2d10+5" and returns a string as a result of the roll(s). My problem is showing up in the tokenizer that splits the string into useful parts for me to parse into information.

我在做一个基于文本的掷骰子游戏。它采用“2d10+5”字符串,并返回一个字符串作为卷的结果。我的问题出现在标记器中,它将字符串分割为有用的部分,以便我将其解析为信息。

String[] tokens = message.split("(?=[dk\\+\\-])");

String[]令牌= message.split(“(? =[dk \ \ + \ \ -])”);

This is yielding strange, unexpected results. I don't know exactly what is causing them. It could be the regex, my misunderstanding, or Java just being Java. Here's what's happening:

这产生了奇怪的、意想不到的结果。我不知道是什么引起的。可能是regex,我的误解,或者Java只是Java。发生了什么:

  • 3d6+4 yields the string array [3, d6, +4]. This is correct.
  • 3d6+4产生了字符串数组[3,d6, +4]。这是正确的。
  • d% yields the string array [d%]. This is correct.
  • d%生成字符串数组[d%]。这是正确的。
  • d20 yields the string array [d20]. This is correct.
  • d20产生字符串数组[d20]。这是正确的。
  • d%+3 yields the string array [, d%, +3]. This is incorrect.
  • d%+3产生字符串数组[,d%, +3]。这是不正确的。
  • d20+2 yields the string array [, d20, +2]. This is incorrect.
  • d20+2产生字符串数组[,d20, +2]。这是不正确的。

In the fourth and fifth example, something strange is causing an extra empty string to appear at the front of the array. It's not the lack of number at the front of the string, as other examples disprove that. It's not the presence of the percentage sign, nor the the plus sign.

在第四个和第五个示例中,奇怪的事情导致数组前面出现一个额外的空字符串。这不是字符串前面缺少数字,因为其他的例子证明了这一点。不是百分号,也不是加号。

For now I'm just continuing through the for loop on blank strings, but that feels sorta like a band-aid solution. Does anyone have any idea what causes the blank string at the front of the array? How can I fix it?

现在我只是继续在空白弦上循环,但这感觉像一个创可贴解决方案。有人知道是什么导致了数组前面的空字符串吗?我怎样才能修好它呢?

3 个解决方案

#1


13  

Digging through the source code, I got the exact issue behind this behaviour.

仔细研究源代码,我得到了这种行为背后的确切问题。

The String.split() method internally uses Pattern.split(). The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is 0, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element.

String.split()方法在内部使用Pattern.split()。在返回结果数组检查最后匹配的索引或是否有匹配项之前的split方法。如果最后匹配的索引是0,这意味着您的模式在字符串的开头只匹配一个空字符串,或者根本不匹配,在这种情况下,返回的数组是一个包含相同元素的单个元素数组。

Here's the source code:

源代码:

public String[] split(CharSequence input, int limit) {
        int index = 0;
        boolean matchLimited = limit > 0;
        ArrayList<String> matchList = new ArrayList<String>();
        Matcher m = matcher(input);

        // Add segments before each match found
        while(m.find()) {
            if (!matchLimited || matchList.size() < limit - 1) {
                String match = input.subSequence(index, m.start()).toString();
                matchList.add(match);

                // Consider this assignment. For a single empty string match
                // m.end() will be 0, and hence index will also be 0
                index = m.end();
            } else if (matchList.size() == limit - 1) { // last one
                String match = input.subSequence(index,
                                                 input.length()).toString();
                matchList.add(match);
                index = m.end();
            }
        }

        // If no match was found, return this
        if (index == 0)
            return new String[] {input.toString()};

        // Rest of them is not required

If the last condition in the above code - index == 0, is true, then the single element array is returned with the input string.

如果上述代码- index = 0中的最后一个条件为true,则返回带有输入字符串的单个元素数组。

Now, consider the cases when the index can be 0.

现在,考虑指数可以为0的情况。

  1. When there is no match at all. (As already in the comment above that condition)
  2. 当根本没有对手的时候。(正如上面的评论中已经提到的那样)
  3. If the match is found at the beginning, and the length of matched string is 0, then the value of index in the if block (inside the while loop) -

    如果在开始时找到匹配,且匹配字符串的长度为0,则If块(while循环内)中的索引值为-

    index = m.end();
    

    will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else index would be updated to a different index.

    将0。唯一可能的匹配字符串是一个空字符串(长度= 0)。而且也不应该有任何进一步的匹配,否则索引将被更新到一个不同的索引。

So, considering your cases:

因此,考虑到你的情况下:

  • For d%, there is just a single match for the pattern, before the first d. Hence the index value would be 0. But since there isn't any further matches, the index value is not updated, and the if condition becomes true, and returns the single element array with original string.

    对于d%,在第一个d之前,只有一个模式匹配,因此索引值为0。但是由于没有进一步的匹配,索引值没有更新,if条件变为true,并返回具有原始字符串的单个元素数组。

  • For d20+2 there would be two matches, one before d, and one before +. So index value will be updated, and hence the ArrayList in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.

    对于d20+2有两个匹配,一个在d之前,一个在+之前。因此,索引值将被更新,因此上面代码中的ArrayList将被返回,其中包含空字符串,因为分隔符分隔符是字符串的第一个字符,这已经在@Stema的回答中解释过了。

So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern):

因此,为了得到您想要的行为(只有在开始时才对分隔符进行拆分,您可以在regex模式中添加一个负面的查找):

"(?<!^)(?=[dk+-])"  // You don't need to escape + and hyphen(when at the end)

this will split on empty string followed by your character class, but not preceded by the beginning of the string.

这将在空字符串上分割,后跟字符类,但不前面加上字符串的开头。


Consider the case of splitting the string "ad%" on regex pattern - "a(?=[dk+-])". This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with a:

考虑一下在regex模式上分割字符串“ad%”的情况——“a(?=[dk+-])”。这将给您一个数组,第一个元素为空字符串。这里唯一的改变是,空字符串被替换为:

"ad%".split("a(?=[dk+-])");  // Prints - `[, d%]`

Why? That's because the length of the matched string is 1. So the index value after the first match - m.end() wouldn't be 0 but 1, and hence the single element array won't be returned.

为什么?这是因为匹配字符串的长度是1。因此,第一个匹配后的索引值——m.end()不会是0,而是1,因此不会返回单个元素数组。

#2


5  

I was surprised that it does not happen for case 2 and 3, so the real question here is

我很惊讶第二和第三种情况没有发生,所以真正的问题是

Why is there NO empty string at the start for "d20" and "d%"?

为什么“d20”和“d%”开头没有空字符串?

as Rohit Jain explained in his detailed analyses, this happens, when there is only one match found at the start of the string and the match.end index is 0. (This can only happen, when only a lookaround assertion is used for finding the match).

正如Rohit Jain在他的详细分析中解释的那样,当在字符串的开始和匹配中只有一个匹配时,就会发生这种情况。结束索引为0。(这只有在查找匹配时才会发生)。

The problem is, that d%+3 starts with a char you are splitting on. So your regex matches before the first character and you get an empty string at the start.

问题是,d%+3从您正在分割的字符开始。所以你的regex匹配在第一个字符之前,并且在开始时得到一个空字符串。

You can add a lookbehind, to ensure that your expression is not matching at the start of the string,so that it is not splitted there:

您可以添加一个lookbehind,以确保您的表达式在字符串的开头不匹配,这样它就不会被分割:

String[] tokens = message.split("(?<!^)(?=[dk\\+\\-])");

(?<!^) is a lookbehind assertion that is true, when it is not at the start of the string.

(? < ! ^)是一个向后插入这是真的,当它不是字符串的开始。

#3


0  

I'd recommend simple matching rather than splitting:

我建议简单匹配而不是拆分:

Matcher matcher = Pattern.compile("([1-9]*)(d[0-9%]+)([+-][0-9]+)?").matcher(string);
if(matcher.matches()) {
    String first = matcher.group(1);
    // etc
}

No guarantee for the regex, but I think it will do...

不能保证regex,但我认为它会…

#1


13  

Digging through the source code, I got the exact issue behind this behaviour.

仔细研究源代码,我得到了这种行为背后的确切问题。

The String.split() method internally uses Pattern.split(). The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is 0, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element.

String.split()方法在内部使用Pattern.split()。在返回结果数组检查最后匹配的索引或是否有匹配项之前的split方法。如果最后匹配的索引是0,这意味着您的模式在字符串的开头只匹配一个空字符串,或者根本不匹配,在这种情况下,返回的数组是一个包含相同元素的单个元素数组。

Here's the source code:

源代码:

public String[] split(CharSequence input, int limit) {
        int index = 0;
        boolean matchLimited = limit > 0;
        ArrayList<String> matchList = new ArrayList<String>();
        Matcher m = matcher(input);

        // Add segments before each match found
        while(m.find()) {
            if (!matchLimited || matchList.size() < limit - 1) {
                String match = input.subSequence(index, m.start()).toString();
                matchList.add(match);

                // Consider this assignment. For a single empty string match
                // m.end() will be 0, and hence index will also be 0
                index = m.end();
            } else if (matchList.size() == limit - 1) { // last one
                String match = input.subSequence(index,
                                                 input.length()).toString();
                matchList.add(match);
                index = m.end();
            }
        }

        // If no match was found, return this
        if (index == 0)
            return new String[] {input.toString()};

        // Rest of them is not required

If the last condition in the above code - index == 0, is true, then the single element array is returned with the input string.

如果上述代码- index = 0中的最后一个条件为true,则返回带有输入字符串的单个元素数组。

Now, consider the cases when the index can be 0.

现在,考虑指数可以为0的情况。

  1. When there is no match at all. (As already in the comment above that condition)
  2. 当根本没有对手的时候。(正如上面的评论中已经提到的那样)
  3. If the match is found at the beginning, and the length of matched string is 0, then the value of index in the if block (inside the while loop) -

    如果在开始时找到匹配,且匹配字符串的长度为0,则If块(while循环内)中的索引值为-

    index = m.end();
    

    will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else index would be updated to a different index.

    将0。唯一可能的匹配字符串是一个空字符串(长度= 0)。而且也不应该有任何进一步的匹配,否则索引将被更新到一个不同的索引。

So, considering your cases:

因此,考虑到你的情况下:

  • For d%, there is just a single match for the pattern, before the first d. Hence the index value would be 0. But since there isn't any further matches, the index value is not updated, and the if condition becomes true, and returns the single element array with original string.

    对于d%,在第一个d之前,只有一个模式匹配,因此索引值为0。但是由于没有进一步的匹配,索引值没有更新,if条件变为true,并返回具有原始字符串的单个元素数组。

  • For d20+2 there would be two matches, one before d, and one before +. So index value will be updated, and hence the ArrayList in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.

    对于d20+2有两个匹配,一个在d之前,一个在+之前。因此,索引值将被更新,因此上面代码中的ArrayList将被返回,其中包含空字符串,因为分隔符分隔符是字符串的第一个字符,这已经在@Stema的回答中解释过了。

So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern):

因此,为了得到您想要的行为(只有在开始时才对分隔符进行拆分,您可以在regex模式中添加一个负面的查找):

"(?<!^)(?=[dk+-])"  // You don't need to escape + and hyphen(when at the end)

this will split on empty string followed by your character class, but not preceded by the beginning of the string.

这将在空字符串上分割,后跟字符类,但不前面加上字符串的开头。


Consider the case of splitting the string "ad%" on regex pattern - "a(?=[dk+-])". This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with a:

考虑一下在regex模式上分割字符串“ad%”的情况——“a(?=[dk+-])”。这将给您一个数组,第一个元素为空字符串。这里唯一的改变是,空字符串被替换为:

"ad%".split("a(?=[dk+-])");  // Prints - `[, d%]`

Why? That's because the length of the matched string is 1. So the index value after the first match - m.end() wouldn't be 0 but 1, and hence the single element array won't be returned.

为什么?这是因为匹配字符串的长度是1。因此,第一个匹配后的索引值——m.end()不会是0,而是1,因此不会返回单个元素数组。

#2


5  

I was surprised that it does not happen for case 2 and 3, so the real question here is

我很惊讶第二和第三种情况没有发生,所以真正的问题是

Why is there NO empty string at the start for "d20" and "d%"?

为什么“d20”和“d%”开头没有空字符串?

as Rohit Jain explained in his detailed analyses, this happens, when there is only one match found at the start of the string and the match.end index is 0. (This can only happen, when only a lookaround assertion is used for finding the match).

正如Rohit Jain在他的详细分析中解释的那样,当在字符串的开始和匹配中只有一个匹配时,就会发生这种情况。结束索引为0。(这只有在查找匹配时才会发生)。

The problem is, that d%+3 starts with a char you are splitting on. So your regex matches before the first character and you get an empty string at the start.

问题是,d%+3从您正在分割的字符开始。所以你的regex匹配在第一个字符之前,并且在开始时得到一个空字符串。

You can add a lookbehind, to ensure that your expression is not matching at the start of the string,so that it is not splitted there:

您可以添加一个lookbehind,以确保您的表达式在字符串的开头不匹配,这样它就不会被分割:

String[] tokens = message.split("(?<!^)(?=[dk\\+\\-])");

(?<!^) is a lookbehind assertion that is true, when it is not at the start of the string.

(? < ! ^)是一个向后插入这是真的,当它不是字符串的开始。

#3


0  

I'd recommend simple matching rather than splitting:

我建议简单匹配而不是拆分:

Matcher matcher = Pattern.compile("([1-9]*)(d[0-9%]+)([+-][0-9]+)?").matcher(string);
if(matcher.matches()) {
    String first = matcher.group(1);
    // etc
}

No guarantee for the regex, but I think it will do...

不能保证regex,但我认为它会…