使用高级regex在java中分割字符串

I'm trying to use String split in java, to split a whole document in substrings between the tabs spaces and newlines but I want to exclude the cases where words exist between quotes.

我尝试使用java中的字符串分割，在制表符空间和换行之间分割子字符串中的整个文档，但是我想排除引号之间存在单词的情况。

Example:

例子:

this file

这个文件

CATEGORYTYPE1
{
    CATEGORYSUBTYPE1
    {
        OPTION1 “ABcd efg1234”
        OPTION2 ABCdefg12345
        OPTION3 15
    }
    CATEGORYSUBTYPE2
    {
        OPTION1 “Blah Blah 123”
        OPTION2 Blah
        OPTION3 10
        OPTION4 "Blah"
    }
}

splits to these substrings(like shown in Eclipse debugger):

分割到这些子字符串(如Eclipse调试器中所示):

[CATEGORYTYPE1, {, CATEGORYTYPE1, {, OPTION1, “ABcd, efg1234”, OPTION2....

when I use my current regular expression which is this:

当我使用我当前的正则表达式时，它是:

    String regex = "([\\n\\r\\s\\t]+)";

    String[] tokens = data.split(regex);

but what I want to achieve is to split it like this:

但是我想要实现的是像这样分割它:

[CATEGORYTYPE1, {, CATEGORYTYPE1, {, OPTION1, “ABcd efg1234”, OPTION2....

(to not split the contents between quotes)

(不将内容分成引号)

Is this possible to do with regular expressions and how?

这可能与正则表达式有关吗?

3 个解决方案

#1

Here is one way of doing this:

这里有一种方法:

str = "CATEGORYTYPE1\n" + 
"{\n" + 
"    CATEGORYSUBTYPE1\n" + 
"    {\n" + 
"        OPTION1 \"ABcd efg1234\"\n" + 
"        OPTION2 ABCdefg12345\n" + 
"        OPTION3 15\n" + 
"    }\n" + 
"    CATEGORYSUBTYPE2\n" + 
"    {\n" + 
"        OPTION1 \"Blah Blah 123\"\n" + 
"        OPTION2 Blah\n" + 
"        OPTION3 10\n" + 
"        OPTION4 \"Blah\"\n" + 
"    }\n" + 
"}\n";

String[] arr = str.split("(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+");
System.out.println(Arrays.toString(arr));

// OUTPUT
[CATEGORYTYPE1, {, CATEGORYSUBTYPE1, {, OPTION1, "ABcd efg1234", OPTION2, ABCdefg12345, ...

Explanation: It means match space or new line (\s) followed by EVEN number of double quotes ("). Hence \s between 2 double quotes characters will NOT be used in split and outside ones will be matched (since those are followed by even number of double quotes characters).

说明:它表示匹配空间或新行(\s)，然后是偶数双引号(")。因此，两个双引号字符之间的\s将不会在分割中使用，而外部双引号字符将被匹配(因为这些字符后面连有偶数双引号字符)。

#2

It seems complexe or even inadequate to use a split here, using a find is much easier, try this:

在这里使用分割似乎是复杂的，甚至是不适当的，使用查找要容易得多，试试以下方法:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] argv) {

        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile("\"[^\"]+\"|\\S+");
        Matcher m = pattern.matcher(yourstring);

        while (matcher.find()) {
            result.add(m.group(0));
        }
    }
}

if you need to add other types of quotes (for example: “xxxxx xxxxx”) you can easily add them to the pattern:

如果您需要添加其他类型的引号(例如:“xxxxx xxxxx”)，您可以轻松地将它们添加到模式中:

Pattern pattern = Pattern.compile("“[^”]+”|\"[^\"]+\"|\\S+");

you can allow escaped double quotes ("xxx \"xxx\"") with this:

您可以允许转义双引号("xxx \"xxx\")与以下:

Pattern pattern = Pattern.compile("\"(?:[^\"]+|(?<!\\)\")+\"|\\S+");

#3

I know I joined the party rather late, but if you are looking for a fancy regex to "understand" escaped " as well, this one should work for you:

我知道我很晚才参加聚会，但是如果你想找一个漂亮的regex来“理解”逃跑的话，这个人应该为你工作:

Pattern p = Pattern.compile("(\\S*?\".*?(?<!\\\\)\")+\\S*|\\S+");
Matcher m = p.matcher(str);
while (m.find()) { ... }

It will also parse something like this:
ab "cd \"ef\" gh" ij "kl \"no pq\"\" rs"
to:
ab, "cd \"ef\" gh", ij, "kl \"no pq\"\" rs" (not getting confused by the odd number of escaped quotes (\").

它还将解析如下内容:ab“cd \”ef\“gh”\“ij”\“kl”\“no pq\”\“rs”to: ab、“cd \”ef\“gh”、“ij”、“kl \”\“no pq\”\“\”\“rs”

(Probably irrelevant, but this one will also "understand" " in the middle of a string, so it will parse this: ab c" "d ef to: ab, c" "d, ef - not that such a pattern is likely to emerge.)

(可能不相关，但这一项也将在字符串中间“理解”，因此它将解析这个:ab c“d ef to: ab, c”d, ef——而不是这样的模式可能会出现。)

Anyway, you can also take a look at this short demo.

无论如何，你也可以看看这个简短的演示。

#1