I'm trying to use String split in java, to split a whole document in substrings between the tabs spaces and newlines but I want to exclude the cases where words exist between quotes.
我尝试使用java中的字符串分割,在制表符空间和换行之间分割子字符串中的整个文档,但是我想排除引号之间存在单词的情况。
Example:
例子:
this file
这个文件
CATEGORYTYPE1
{
CATEGORYSUBTYPE1
{
OPTION1 “ABcd efg1234”
OPTION2 ABCdefg12345
OPTION3 15
}
CATEGORYSUBTYPE2
{
OPTION1 “Blah Blah 123”
OPTION2 Blah
OPTION3 10
OPTION4 "Blah"
}
}
splits to these substrings(like shown in Eclipse debugger):
分割到这些子字符串(如Eclipse调试器中所示):
[CATEGORYTYPE1, {, CATEGORYTYPE1, {, OPTION1, “ABcd, efg1234”, OPTION2....
when I use my current regular expression which is this:
当我使用我当前的正则表达式时,它是:
String regex = "([\\n\\r\\s\\t]+)";
String[] tokens = data.split(regex);
but what I want to achieve is to split it like this:
但是我想要实现的是像这样分割它:
[CATEGORYTYPE1, {, CATEGORYTYPE1, {, OPTION1, “ABcd efg1234”, OPTION2....
(to not split the contents between quotes)
(不将内容分成引号)
Is this possible to do with regular expressions and how?
这可能与正则表达式有关吗?
3 个解决方案
#1
2
Here is one way of doing this:
这里有一种方法:
str = "CATEGORYTYPE1\n" +
"{\n" +
" CATEGORYSUBTYPE1\n" +
" {\n" +
" OPTION1 \"ABcd efg1234\"\n" +
" OPTION2 ABCdefg12345\n" +
" OPTION3 15\n" +
" }\n" +
" CATEGORYSUBTYPE2\n" +
" {\n" +
" OPTION1 \"Blah Blah 123\"\n" +
" OPTION2 Blah\n" +
" OPTION3 10\n" +
" OPTION4 \"Blah\"\n" +
" }\n" +
"}\n";
String[] arr = str.split("(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+");
System.out.println(Arrays.toString(arr));
// OUTPUT
[CATEGORYTYPE1, {, CATEGORYSUBTYPE1, {, OPTION1, "ABcd efg1234", OPTION2, ABCdefg12345, ...
Explanation: It means match space or new line (\s
) followed by EVEN number of double quotes ("
). Hence \s
between 2 double quotes characters will NOT be used in split and outside ones will be matched (since those are followed by even number of double quotes characters).
说明:它表示匹配空间或新行(\s),然后是偶数双引号(")。因此,两个双引号字符之间的\s将不会在分割中使用,而外部双引号字符将被匹配(因为这些字符后面连有偶数双引号字符)。
#2
1
It seems complexe or even inadequate to use a split here, using a find is much easier, try this:
在这里使用分割似乎是复杂的,甚至是不适当的,使用查找要容易得多,试试以下方法:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] argv) {
List<String> result = new ArrayList<String>();
Pattern pattern = Pattern.compile("\"[^\"]+\"|\\S+");
Matcher m = pattern.matcher(yourstring);
while (matcher.find()) {
result.add(m.group(0));
}
}
}
if you need to add other types of quotes (for example: “xxxxx xxxxx”
) you can easily add them to the pattern:
如果您需要添加其他类型的引号(例如:“xxxxx xxxxx”),您可以轻松地将它们添加到模式中:
Pattern pattern = Pattern.compile("“[^”]+”|\"[^\"]+\"|\\S+");
you can allow escaped double quotes ("xxx \"xxx\""
) with this:
您可以允许转义双引号("xxx \"xxx\")与以下:
Pattern pattern = Pattern.compile("\"(?:[^\"]+|(?<!\\)\")+\"|\\S+");
#3
0
I know I joined the party rather late, but if you are looking for a fancy regex to "understand" escaped "
as well, this one should work for you:
我知道我很晚才参加聚会,但是如果你想找一个漂亮的regex来“理解”逃跑的话,这个人应该为你工作:
Pattern p = Pattern.compile("(\\S*?\".*?(?<!\\\\)\")+\\S*|\\S+");
Matcher m = p.matcher(str);
while (m.find()) { ... }
It will also parse something like this:ab "cd \"ef\" gh" ij "kl \"no pq\"\" rs"
to:ab
, "cd \"ef\" gh"
, ij
, "kl \"no pq\"\" rs"
(not getting confused by the odd number of escaped quotes (\"
).
它还将解析如下内容:ab“cd \”ef\“gh”\“ij”\“kl”\“no pq\”\“rs”to: ab、“cd \”ef\“gh”、“ij”、“kl \”\“no pq\”\“\”\“rs”
(Probably irrelevant, but this one will also "understand" "
in the middle of a string, so it will parse this: ab c" "d ef
to: ab
, c" "d
, ef
- not that such a pattern is likely to emerge.)
(可能不相关,但这一项也将在字符串中间“理解”,因此它将解析这个:ab c“d ef to: ab, c”d, ef——而不是这样的模式可能会出现。)
Anyway, you can also take a look at this short demo.
无论如何,你也可以看看这个简短的演示。
#1
2
Here is one way of doing this:
这里有一种方法:
str = "CATEGORYTYPE1\n" +
"{\n" +
" CATEGORYSUBTYPE1\n" +
" {\n" +
" OPTION1 \"ABcd efg1234\"\n" +
" OPTION2 ABCdefg12345\n" +
" OPTION3 15\n" +
" }\n" +
" CATEGORYSUBTYPE2\n" +
" {\n" +
" OPTION1 \"Blah Blah 123\"\n" +
" OPTION2 Blah\n" +
" OPTION3 10\n" +
" OPTION4 \"Blah\"\n" +
" }\n" +
"}\n";
String[] arr = str.split("(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+");
System.out.println(Arrays.toString(arr));
// OUTPUT
[CATEGORYTYPE1, {, CATEGORYSUBTYPE1, {, OPTION1, "ABcd efg1234", OPTION2, ABCdefg12345, ...
Explanation: It means match space or new line (\s
) followed by EVEN number of double quotes ("
). Hence \s
between 2 double quotes characters will NOT be used in split and outside ones will be matched (since those are followed by even number of double quotes characters).
说明:它表示匹配空间或新行(\s),然后是偶数双引号(")。因此,两个双引号字符之间的\s将不会在分割中使用,而外部双引号字符将被匹配(因为这些字符后面连有偶数双引号字符)。
#2
1
It seems complexe or even inadequate to use a split here, using a find is much easier, try this:
在这里使用分割似乎是复杂的,甚至是不适当的,使用查找要容易得多,试试以下方法:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] argv) {
List<String> result = new ArrayList<String>();
Pattern pattern = Pattern.compile("\"[^\"]+\"|\\S+");
Matcher m = pattern.matcher(yourstring);
while (matcher.find()) {
result.add(m.group(0));
}
}
}
if you need to add other types of quotes (for example: “xxxxx xxxxx”
) you can easily add them to the pattern:
如果您需要添加其他类型的引号(例如:“xxxxx xxxxx”),您可以轻松地将它们添加到模式中:
Pattern pattern = Pattern.compile("“[^”]+”|\"[^\"]+\"|\\S+");
you can allow escaped double quotes ("xxx \"xxx\""
) with this:
您可以允许转义双引号("xxx \"xxx\")与以下:
Pattern pattern = Pattern.compile("\"(?:[^\"]+|(?<!\\)\")+\"|\\S+");
#3
0
I know I joined the party rather late, but if you are looking for a fancy regex to "understand" escaped "
as well, this one should work for you:
我知道我很晚才参加聚会,但是如果你想找一个漂亮的regex来“理解”逃跑的话,这个人应该为你工作:
Pattern p = Pattern.compile("(\\S*?\".*?(?<!\\\\)\")+\\S*|\\S+");
Matcher m = p.matcher(str);
while (m.find()) { ... }
It will also parse something like this:ab "cd \"ef\" gh" ij "kl \"no pq\"\" rs"
to:ab
, "cd \"ef\" gh"
, ij
, "kl \"no pq\"\" rs"
(not getting confused by the odd number of escaped quotes (\"
).
它还将解析如下内容:ab“cd \”ef\“gh”\“ij”\“kl”\“no pq\”\“rs”to: ab、“cd \”ef\“gh”、“ij”、“kl \”\“no pq\”\“\”\“rs”
(Probably irrelevant, but this one will also "understand" "
in the middle of a string, so it will parse this: ab c" "d ef
to: ab
, c" "d
, ef
- not that such a pattern is likely to emerge.)
(可能不相关,但这一项也将在字符串中间“理解”,因此它将解析这个:ab c“d ef to: ab, c”d, ef——而不是这样的模式可能会出现。)
Anyway, you can also take a look at this short demo.
无论如何,你也可以看看这个简短的演示。