使用String.split()提取单词对

时间:2022-07-04 21:38:08

Given:

考虑到:

String input = "one two three four five six seven";

Is there a regex that works with String.split() to grab (up to) two words at a time, such that:

是否有一个正则表达式可以使用字符串。split()可以一次抓取(最多)两个单词,例如:

String[] pairs = input.split("some regex");
System.out.println(Arrays.toString(pairs));

results in this:

结果:

[one two, three four, five six, seven]

This question is about the split regex. It is not about "finding a work-around" or other "making it work in another way" solutions.

这个问题是关于分裂的正则表达式。它不是关于“找到一个变通方案”或其他“让它以另一种方式工作”的解决方案。

4 个解决方案

#1


77  

Currently (including Java 8) it is possible to do it with split(), but in real world don't use this approach since it looks like it is based on bug (look-behind in Java should have obvious maximum length, but this solution uses \w+ which doesn't respect this limitation). Instead use Pattern and Matcher classes to avoid overcomplicating thins and maintenance hell since this behaviour may change in next versions of Java or in Java-like environments like Android.

目前(包括Java 8)使用split()是可行的,但是在现实世界中不使用这种方法,因为它看起来是基于bug的(Java中的look-behind应该有明显的最大长度,但是这个解决方案使用了\w+,不考虑这个限制)。相反,应该使用Pattern和Matcher类,以避免使thins和维护变得过于复杂,因为这种行为在Java的下一个版本或Android这样的Java类环境中可能会发生变化。


Is this what you are looking for?
(you can replace \\w with \\S to include all non-space characters but for this example I will leave \\w since it is easier to read regex with \\w\\s then \\S\\s)

这就是你要找的吗?(你可以用\S来代替\w来包含所有的非空格字符,但在这个例子中,我将留下\w,因为用\w\ S\\ S\\ \S\\ S\\ S\ S\ S\ S\ S\\ S\\ S\ S\\ \\ \S\\ \\S\\ \\ \\ \S\ S\ S\\ \\ \)来阅读regex更容易些))

String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));

output:

输出:

[one two, three four, five six, seven]

\G is previous match, (?<!regex) is negative lookbehind.

\G是先前的匹配,(?

In split we are trying to

在斯普利特,我们正在努力

  1. find spaces -> \\s
  2. 发现空间- > \ \ s
  3. that are not predicted -> (?<!negativeLookBehind)
  4. 这不是预测的-> (?
  5. by some word -> \\w+
  6. 通过一些单词> \\w+。
  7. with previously matched (space) -> \\G
  8. 与先前匹配的(空格)-> \G
  9. before it ->\\G\\w+.
  10. 在G - > \ \ \ \ w +。

Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G at start matches start of the String ^.

唯一让我感到困惑的是,既然我们想要这个空间被忽略,那么它在第一空间是如何工作的。重要的信息是,\ \ G ^开始比赛开始的字符串。

So before first iteration regex in negative look-behind will look like (?<!^\\w+) and since first space do have ^\\w+ before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input String) will be stored in \\G and used later in next negative look-behind.

所以在第一次迭代正则表达式-向后看的样子(? < ! ^ \ \ w +),自首次太空有^ \ \ w +之前,它不能与分裂。下一个空间不会有这个问题,因此它将被匹配,关于它的信息(比如它在输入字符串中的位置)将被存储在\\G中,并在后面的负面查找中使用。

So for 3rd space regex will check if there is previously matched space \\G and word \\w+ before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G (it will have different position in input String).

因此,对于第三空间regex将检查是否有以前匹配的空间\\G和word \\w+在它之前。由于这个测试的结果将是积极的,消极的观察不会接受它,所以这个空间不会被匹配,但是第四空间不会有这个问题,因为在它不会和存储在\\G(它在输入字符串中有不同的位置)之前的空间。


Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)

另外,如果有人想要分开假设每隔3个空格,你可以使用这个表单(基于@maybeWeCouldStealAVan的答案,当我发布这个答案片段时,这个答案被删除了)

input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")

Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.

你可以使用一个更大的值,至少是字符串中最长单词的长度。


I just noticed that we can also use + instead of {1,maxWordLength} if we want to split with every odd number like every 3rd, 5th, 7th for example

我刚刚注意到,如果我们想分割每一个奇数,比如每3 5 7个,我们也可以使用+而不是{1,maxWordLength}

String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\\G\\d+,\\d+,\\d+,\\d+,\\d+),");//every 5th comma 

#2


8  

This will work, but maximum word length needs to be set in advance:

这是可行的,但是最大字长需要提前设置:

String input = "one two three four five six seven eight nine ten eleven";
String[] pairs = input.split("(?<=\\G\\S{1,30}\\s\\S{1,30})\\s");
System.out.println(Arrays.toString(pairs));

I like Pshemo's answer better, being shorter and usable on arbitrary word lengths, but this (as @Pshemo pointed out) has the advantage of being adaptable to groups of more than 2 words.

我更喜欢Pshemo的答案,它更短,在任意的单词长度上都可以使用,但是这个(正如@Pshemo所指出的)有一个优点,就是能够适应超过两个单词的组。

#3


0  

this worked for me (\w+\s*){2}\K\s example here

这对我(\w+\s*){2}\K\s示例是有效的

  • a required word followed by an optional space (\w+\s*)
  • 一个必需的单词后面跟着一个可选的空格(\w+\s*)
  • repeated two times {2}
  • 重复两次{ 2 }
  • ignore previously matched characters \K
  • 忽略先前匹配的字符\K
  • the required space \s
  • 所需的空间\ s

#4


-1  

You can try this:

你可以试试这个:

[a-z]+\s[a-z]+

Updated:

更新:

([a-z]+\s[a-z]+)|[a-z]+

使用String.split()提取单词对

Updated:

更新:

 String pattern = "([a-z]+\\s[a-z]+)|[a-z]+";
 String input = "one two three four five six seven";

 Pattern splitter = Pattern.compile(pattern);
 String[] results = splitter.split(input);

 for (String pair : results) {
 System.out.println("Output = \"" + pair + "\"");

#1


77  

Currently (including Java 8) it is possible to do it with split(), but in real world don't use this approach since it looks like it is based on bug (look-behind in Java should have obvious maximum length, but this solution uses \w+ which doesn't respect this limitation). Instead use Pattern and Matcher classes to avoid overcomplicating thins and maintenance hell since this behaviour may change in next versions of Java or in Java-like environments like Android.

目前(包括Java 8)使用split()是可行的,但是在现实世界中不使用这种方法,因为它看起来是基于bug的(Java中的look-behind应该有明显的最大长度,但是这个解决方案使用了\w+,不考虑这个限制)。相反,应该使用Pattern和Matcher类,以避免使thins和维护变得过于复杂,因为这种行为在Java的下一个版本或Android这样的Java类环境中可能会发生变化。


Is this what you are looking for?
(you can replace \\w with \\S to include all non-space characters but for this example I will leave \\w since it is easier to read regex with \\w\\s then \\S\\s)

这就是你要找的吗?(你可以用\S来代替\w来包含所有的非空格字符,但在这个例子中,我将留下\w,因为用\w\ S\\ S\\ \S\\ S\\ S\ S\ S\ S\ S\\ S\\ S\ S\\ \\ \S\\ \\S\\ \\ \\ \S\ S\ S\\ \\ \)来阅读regex更容易些))

String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));

output:

输出:

[one two, three four, five six, seven]

\G is previous match, (?<!regex) is negative lookbehind.

\G是先前的匹配,(?

In split we are trying to

在斯普利特,我们正在努力

  1. find spaces -> \\s
  2. 发现空间- > \ \ s
  3. that are not predicted -> (?<!negativeLookBehind)
  4. 这不是预测的-> (?
  5. by some word -> \\w+
  6. 通过一些单词> \\w+。
  7. with previously matched (space) -> \\G
  8. 与先前匹配的(空格)-> \G
  9. before it ->\\G\\w+.
  10. 在G - > \ \ \ \ w +。

Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G at start matches start of the String ^.

唯一让我感到困惑的是,既然我们想要这个空间被忽略,那么它在第一空间是如何工作的。重要的信息是,\ \ G ^开始比赛开始的字符串。

So before first iteration regex in negative look-behind will look like (?<!^\\w+) and since first space do have ^\\w+ before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input String) will be stored in \\G and used later in next negative look-behind.

所以在第一次迭代正则表达式-向后看的样子(? < ! ^ \ \ w +),自首次太空有^ \ \ w +之前,它不能与分裂。下一个空间不会有这个问题,因此它将被匹配,关于它的信息(比如它在输入字符串中的位置)将被存储在\\G中,并在后面的负面查找中使用。

So for 3rd space regex will check if there is previously matched space \\G and word \\w+ before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G (it will have different position in input String).

因此,对于第三空间regex将检查是否有以前匹配的空间\\G和word \\w+在它之前。由于这个测试的结果将是积极的,消极的观察不会接受它,所以这个空间不会被匹配,但是第四空间不会有这个问题,因为在它不会和存储在\\G(它在输入字符串中有不同的位置)之前的空间。


Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)

另外,如果有人想要分开假设每隔3个空格,你可以使用这个表单(基于@maybeWeCouldStealAVan的答案,当我发布这个答案片段时,这个答案被删除了)

input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")

Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.

你可以使用一个更大的值,至少是字符串中最长单词的长度。


I just noticed that we can also use + instead of {1,maxWordLength} if we want to split with every odd number like every 3rd, 5th, 7th for example

我刚刚注意到,如果我们想分割每一个奇数,比如每3 5 7个,我们也可以使用+而不是{1,maxWordLength}

String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\\G\\d+,\\d+,\\d+,\\d+,\\d+),");//every 5th comma 

#2


8  

This will work, but maximum word length needs to be set in advance:

这是可行的,但是最大字长需要提前设置:

String input = "one two three four five six seven eight nine ten eleven";
String[] pairs = input.split("(?<=\\G\\S{1,30}\\s\\S{1,30})\\s");
System.out.println(Arrays.toString(pairs));

I like Pshemo's answer better, being shorter and usable on arbitrary word lengths, but this (as @Pshemo pointed out) has the advantage of being adaptable to groups of more than 2 words.

我更喜欢Pshemo的答案,它更短,在任意的单词长度上都可以使用,但是这个(正如@Pshemo所指出的)有一个优点,就是能够适应超过两个单词的组。

#3


0  

this worked for me (\w+\s*){2}\K\s example here

这对我(\w+\s*){2}\K\s示例是有效的

  • a required word followed by an optional space (\w+\s*)
  • 一个必需的单词后面跟着一个可选的空格(\w+\s*)
  • repeated two times {2}
  • 重复两次{ 2 }
  • ignore previously matched characters \K
  • 忽略先前匹配的字符\K
  • the required space \s
  • 所需的空间\ s

#4


-1  

You can try this:

你可以试试这个:

[a-z]+\s[a-z]+

Updated:

更新:

([a-z]+\s[a-z]+)|[a-z]+

使用String.split()提取单词对

Updated:

更新:

 String pattern = "([a-z]+\\s[a-z]+)|[a-z]+";
 String input = "one two three four five six seven";

 Pattern splitter = Pattern.compile(pattern);
 String[] results = splitter.split(input);

 for (String pair : results) {
 System.out.println("Output = \"" + pair + "\"");