匹配方括号内的内容,包括嵌套的方括号

时间:2022-12-01 21:42:57

I am attempting to write a spoiler identification system so that any spoilers in a string are replaced with a specified spoiler character.

我正在尝试编写一个剧透识别系统,以便字符串中的任何破坏者都被指定的剧透字符替换。

I want to match a string surrounded by square brackets, such that the contents within the square brackets is capture group 1, and the whole string including the surrounding brackets is the match.

我想匹配方括号括起来的字符串,这样方括号内的内容就是捕获组1,包含周围括号的整个字符串就是匹配。

I am currently using \[(.*?]*)\], a slight modification of the expression found in this answer here, as I also want nested square brackets to be a part of capture group 1.

我目前正在使用\ [(。*?] *)\],这里对此答案中的表达式略有修改,因为我还希望嵌套方括号成为捕获组1的一部分。

The problem with that expression is that, although it works and matches the following:

该表达式的问题在于,虽然它可以工作并匹配以下内容:

  • Jim ate a [sandwich] matches [sandwich] with sandwich as group 1
  • 吉姆吃了三明治[三明治]和三明治作为第1组
  • Jim ate a [sandwich with [pickles and onions]] matches [sandwich with [pickles and onions]] with sandwich with [pickles and onions] as group 1
  • 吉姆吃了[三明治配[泡菜和洋葱]]火柴[三明治加[泡菜和洋葱]]三明治配[泡菜和洋葱]作为第1组
  • [[[[] matches [[[[] with [[[ as group 1
  • [[[[]匹配[[[[]与[[[作为组1
  • []]]] matches []]]] with ]]] as group 1
  • []]]]将[]]]]与]]]匹配为组1

However, if I want to match the following, it does not work as expected:

但是,如果我想匹配以下内容,它将无法按预期工作:

  • Jim ate a [sandwich with [pickles] and [onions]] matches both:
    • [sandwich with [pickles] with sandwich with [pickles as group 1
    • [三明治配[泡菜]三明治配[泡菜]第1组
    • [onions]] with onions] as group 1
    • [洋葱]]洋葱]作为第1组
  • 吉姆吃了[三明治配[泡菜]和[洋葱]]两者兼备:[夹心与[泡菜]三明治配[泡菜作为第1组[洋葱]]和洋葱]作为第1组

What expression should I use such that it matches [sandwich with [pickles] and [onions]] with sandwich with [pickles] and [onions] as group 1?

我应该使用什么表达方式使它与[泡菜]和[洋葱]三明治配三明治配[泡菜]和[洋葱]作为第1组?

EDIT:

编辑:

As it seems impossible to achieve this in Java using regex, is there an alternative solution?

由于使用正则表达式在Java中实现这一点似乎是不可能的,还有其他解决方案吗?

EDIT 2:

编辑2:

I also want to be able to split the string by each match found, so an alternative to regular expressions would be harder to implement due to String.split(regex) being convenient. Here's an example:

我还希望能够通过找到的每个匹配来拆分字符串,因此正常表达式的替代方法将更难实现,因为String.split(正则表达式)很方便。这是一个例子:

  • Jim ate a [sandwich] with [pickles] and [dried [onions]] matches all:
    • [sandwich] with sandwich as group 1
    • [三明治]夹心作为第1组
    • [pickles] with pickles as group 1
    • [泡菜]泡菜作为第1组
    • [dried [onions]] with dried [onions] as group 1
    • [干[洋葱]]与干[洋葱]作为第1组
  • 吉姆吃了[三明治] [咸菜]和[干[洋葱]]匹配所有:[三明治]三明治作为第1组[泡菜]与泡菜作为第1组[干[洋葱]]与干[洋葱]作为第1组

And the split sentence should look like:

分句应该如下:

Jim ate a
with
and

1 个解决方案

#1


2  

More direct solution

This solution will omit empty or whitespace only substrings

此解决方案将省略空或仅空白的子字符串

public static List<String> getStrsBetweenBalancedSubstrings(String s, Character markStart, Character markEnd) {
    List<String> subTreeList = new ArrayList<String>();
    int level = 0;
    int lastCloseBracket= 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
            if (c == markStart) {
                    level++;
                    if (level == 1 && i != 0 && i!=lastCloseBracket &&
                        !s.substring(lastCloseBracket, i).trim().isEmpty()) {
                            subTreeList.add(s.substring(lastCloseBracket, i).trim());
                }
            }
        } else if (c == markEnd) {
            if (level > 0) { 
                level--;
                lastCloseBracket = i+1;
            }
            }
    }
    if (lastCloseBracket != s.length() && !s.substring(lastCloseBracket).trim().isEmpty()) {
        subTreeList.add(s.substring(lastCloseBracket).trim());  
    }
    return subTreeList;
}

Then, use it as

然后,用它作为

String input = "Jim ate a [sandwich][ooh] with [pickles] and [dried [onions]] and ] [an[other] match] and more here";
List<String> between_balanced =  getStrsBetweenBalancedSubstrings(input, '[', ']');
System.out.println("Result: " + between_balanced);
// => Result: [Jim ate a, with, and, and ], and more here]

Original answer (more complex, shows a way to extract nested parentheses)

You can also extract all substrings inside balanced parentheses and then split with them:

您还可以提取平衡括号内的所有子字符串,然后使用它们进行拆分:

String input = "Jim ate a [sandwich] with [pickles] and [dried [onions]] and ] [an[other] match]";
List<String> balanced = getBalancedSubstrings(input, '[', ']', true);
System.out.println("Balanced ones: " + balanced);
List<String> rx_split = new ArrayList<String>();
for (String item : balanced) {
    rx_split.add("\\s*" + Pattern.quote(item) + "\\s*");
}
String rx = String.join("|", rx_split);
System.out.println("In-betweens: " + Arrays.toString(input.split(rx)));

And this function will find all []-balanced substrings:

此函数将找到所有[] - 平衡子串:

public static List<String> getBalancedSubstrings(String s, Character markStart, 
                                     Character markEnd, Boolean includeMarkers) {
    List<String> subTreeList = new ArrayList<String>();
    int level = 0;
    int lastOpenBracket = -1;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c == markStart) {
            level++;
            if (level == 1) {
                lastOpenBracket = (includeMarkers ? i : i + 1);
            }
        }
        else if (c == markEnd) {
            if (level == 1) {
                subTreeList.add(s.substring(lastOpenBracket, (includeMarkers ? i + 1 : i)));
            }
            if (level > 0) level--;
        }
    }
    return subTreeList;
}

See IDEONE demo

请参阅IDEONE演示

Result of the code execution:

代码执行结果:

Balanced ones: ['[sandwich], [pickles], [dried [onions]]', '[an[other] match]']
In-betweens: ['Jim ate a', 'with', 'and', 'and ]']

Credits: the getBalancedSubstrings is based on the peter.murray.rust's answer for How to split this “Tree-like” string in Java regex? post.

致谢:getBalancedSubstrings是基于peter.murray.rust的答案,如何在Java正则表达式中拆分这个“树状”字符串?帖子。

#1


2  

More direct solution

This solution will omit empty or whitespace only substrings

此解决方案将省略空或仅空白的子字符串

public static List<String> getStrsBetweenBalancedSubstrings(String s, Character markStart, Character markEnd) {
    List<String> subTreeList = new ArrayList<String>();
    int level = 0;
    int lastCloseBracket= 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
            if (c == markStart) {
                    level++;
                    if (level == 1 && i != 0 && i!=lastCloseBracket &&
                        !s.substring(lastCloseBracket, i).trim().isEmpty()) {
                            subTreeList.add(s.substring(lastCloseBracket, i).trim());
                }
            }
        } else if (c == markEnd) {
            if (level > 0) { 
                level--;
                lastCloseBracket = i+1;
            }
            }
    }
    if (lastCloseBracket != s.length() && !s.substring(lastCloseBracket).trim().isEmpty()) {
        subTreeList.add(s.substring(lastCloseBracket).trim());  
    }
    return subTreeList;
}

Then, use it as

然后,用它作为

String input = "Jim ate a [sandwich][ooh] with [pickles] and [dried [onions]] and ] [an[other] match] and more here";
List<String> between_balanced =  getStrsBetweenBalancedSubstrings(input, '[', ']');
System.out.println("Result: " + between_balanced);
// => Result: [Jim ate a, with, and, and ], and more here]

Original answer (more complex, shows a way to extract nested parentheses)

You can also extract all substrings inside balanced parentheses and then split with them:

您还可以提取平衡括号内的所有子字符串,然后使用它们进行拆分:

String input = "Jim ate a [sandwich] with [pickles] and [dried [onions]] and ] [an[other] match]";
List<String> balanced = getBalancedSubstrings(input, '[', ']', true);
System.out.println("Balanced ones: " + balanced);
List<String> rx_split = new ArrayList<String>();
for (String item : balanced) {
    rx_split.add("\\s*" + Pattern.quote(item) + "\\s*");
}
String rx = String.join("|", rx_split);
System.out.println("In-betweens: " + Arrays.toString(input.split(rx)));

And this function will find all []-balanced substrings:

此函数将找到所有[] - 平衡子串:

public static List<String> getBalancedSubstrings(String s, Character markStart, 
                                     Character markEnd, Boolean includeMarkers) {
    List<String> subTreeList = new ArrayList<String>();
    int level = 0;
    int lastOpenBracket = -1;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c == markStart) {
            level++;
            if (level == 1) {
                lastOpenBracket = (includeMarkers ? i : i + 1);
            }
        }
        else if (c == markEnd) {
            if (level == 1) {
                subTreeList.add(s.substring(lastOpenBracket, (includeMarkers ? i + 1 : i)));
            }
            if (level > 0) level--;
        }
    }
    return subTreeList;
}

See IDEONE demo

请参阅IDEONE演示

Result of the code execution:

代码执行结果:

Balanced ones: ['[sandwich], [pickles], [dried [onions]]', '[an[other] match]']
In-betweens: ['Jim ate a', 'with', 'and', 'and ]']

Credits: the getBalancedSubstrings is based on the peter.murray.rust's answer for How to split this “Tree-like” string in Java regex? post.

致谢:getBalancedSubstrings是基于peter.murray.rust的答案,如何在Java正则表达式中拆分这个“树状”字符串?帖子。