I would like to extract sub-string between certain two words using java.
我想用java在两个词之间提取子字符串。
For example:
例如:
This is an important example about regex for my work.
I would like to extract everything between "an
" and "for
".
我想提取“an”和“for”之间的所有东西。
What I did so far is:
到目前为止,我所做的是:
String sentence = "This is an important example about regex for my work and for me";
Pattern pattern = Pattern.compile("(?<=an).*.(?=for)");
Matcher matcher = pattern.matcher(sentence);
boolean found = false;
while (matcher.find()) {
System.out.println("I found the text: " + matcher.group().toString());
found = true;
}
if (!found) {
System.out.println("I didn't found the text");
}
It works well.
它的工作原理。
But I want to do two additional things
但是我想再做两件事
-
If the sentence is:
This is an important example about regex for my work and for me.
I want to extract till the first "for
" i.e.important example about regex
如果句子是:这是关于regex的一个重要示例,对于我的工作和我来说都是如此。我想要提取到第一个“for”,即关于regex的重要示例。
-
Some times I want to limit the number of words between the pattern to 3 words i.e.
important example about
有时我想把字数限制在3个字之间,也就是重要的例子
Any ideas please?
有什么想法吗?
3 个解决方案
#1
8
For your first question, make it lazy. You can put a question mark after the quantifier and then the quantifier will match as less as possible.
对于你的第一个问题,让它变得懒惰。你可以在量词后面加上问号,然后量词就会越少越好。
(?<=an).*?(?=for)
I have no idea what the additional .
at the end is good for in .*.
its unnecessary.
我不知道附加的是什么。最后是in。不必要的。
For your second question you have to define what a "word" is. I would say here probably just a sequence of non whitespace followed by a whitespace. Something like this
对于第二个问题,你必须定义一个单词是什么。我想说,这里可能只是一个非空格序列,后面跟着一个空格。像这样的东西
\S+\s
and repeat this 3 times like this
像这样重复三次
(?<=an)\s(\S+\s){3}(?=for)
To ensure that the pattern mathces on whole words use word boundaries
为了确保整个单词的模式计算使用单词边界
(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)
See it online here on Regexr
在Regexr上在线观看。
{3}
will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3}
{3}将匹配3,最小值为1,最大值为3,执行{1,3}
Alternative:
选择:
As dma_k correctly stated in your case here its not necessary to use look behind and look ahead. See here the Matcher documentation about groups
正如dma_k在这里正确地指出的那样,不需要使用look behind和look forward。请参阅关于组的Matcher文档
You can use capturing groups instead. Just put the part you want to extract in brackets and it will be put into a capturing group.
您可以使用捕获组代替。只要把要提取的部分放在括号中,它就会被放入一个捕获组。
\ban\b(.*?)\bfor\b
See it online here on Regexr
在Regexr上在线观看。
You can than access this group like this
你可以像这样访问这个组
System.out.println("I found the text: " + matcher.group(1).toString());
^
You have only one pair of brackets, so its simple, just put a 1
into matcher.group(1)
to access the first capturing group.
只有一对括号,因此很简单,只需将1放入matcher.group(1)以访问第一个捕获组。
#2
3
Your regex is "an\\s+(.*?)\\s+for
". It extracts all characters between an and for ignoring white spaces (\s+
). The question mark means "greedy". It is needed to prevent pattern .*
to eat everything including word "for".
你的正则表达式是“一个\ \ s +(. * ?)\ \ s +“。它提取an和之间的所有字符,用于忽略空格(\s+)。问号表示“贪婪”。需要防止模式。*吃所有的东西,包括单词“for”。
#3
2
public class SubStringBetween {
公开课SubStringBetween {
public static String subStringBetween(String sentence, String before, String after) {
int startSub = SubStringBetween.subStringStartIndex(sentence, before);
int stopSub = SubStringBetween.subStringEndIndex(sentence, after);
String newWord = sentence.substring(startSub, stopSub);
return newWord;
}
public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {
int startIndex = 0;
String newWord = "";
int x = 0, y = 0;
for (int i = 0; i < sentence.length(); i++) {
newWord = "";
if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
startIndex = i;
for (int j = 0; j < delimiterBeforeWord.length(); j++) {
try {
if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
newWord = newWord + sentence.charAt(startIndex);
}
startIndex++;
} catch (Exception e) {
}
}
if (newWord.equals(delimiterBeforeWord)) {
x = startIndex;
}
}
}
return x;
}
public static int subStringEndIndex(String sentence, String delimiterAfterWord) {
int startIndex = 0;
String newWord = "";
int x = 0;
for (int i = 0; i < sentence.length(); i++) {
newWord = "";
if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
startIndex = i;
for (int j = 0; j < delimiterAfterWord.length(); j++) {
try {
if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
newWord = newWord + sentence.charAt(startIndex);
}
startIndex++;
} catch (Exception e) {
}
}
if (newWord.equals(delimiterAfterWord)) {
x = startIndex;
x = x - delimiterAfterWord.length();
}
}
}
return x;
}
}
}
#1
8
For your first question, make it lazy. You can put a question mark after the quantifier and then the quantifier will match as less as possible.
对于你的第一个问题,让它变得懒惰。你可以在量词后面加上问号,然后量词就会越少越好。
(?<=an).*?(?=for)
I have no idea what the additional .
at the end is good for in .*.
its unnecessary.
我不知道附加的是什么。最后是in。不必要的。
For your second question you have to define what a "word" is. I would say here probably just a sequence of non whitespace followed by a whitespace. Something like this
对于第二个问题,你必须定义一个单词是什么。我想说,这里可能只是一个非空格序列,后面跟着一个空格。像这样的东西
\S+\s
and repeat this 3 times like this
像这样重复三次
(?<=an)\s(\S+\s){3}(?=for)
To ensure that the pattern mathces on whole words use word boundaries
为了确保整个单词的模式计算使用单词边界
(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)
See it online here on Regexr
在Regexr上在线观看。
{3}
will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3}
{3}将匹配3,最小值为1,最大值为3,执行{1,3}
Alternative:
选择:
As dma_k correctly stated in your case here its not necessary to use look behind and look ahead. See here the Matcher documentation about groups
正如dma_k在这里正确地指出的那样,不需要使用look behind和look forward。请参阅关于组的Matcher文档
You can use capturing groups instead. Just put the part you want to extract in brackets and it will be put into a capturing group.
您可以使用捕获组代替。只要把要提取的部分放在括号中,它就会被放入一个捕获组。
\ban\b(.*?)\bfor\b
See it online here on Regexr
在Regexr上在线观看。
You can than access this group like this
你可以像这样访问这个组
System.out.println("I found the text: " + matcher.group(1).toString());
^
You have only one pair of brackets, so its simple, just put a 1
into matcher.group(1)
to access the first capturing group.
只有一对括号,因此很简单,只需将1放入matcher.group(1)以访问第一个捕获组。
#2
3
Your regex is "an\\s+(.*?)\\s+for
". It extracts all characters between an and for ignoring white spaces (\s+
). The question mark means "greedy". It is needed to prevent pattern .*
to eat everything including word "for".
你的正则表达式是“一个\ \ s +(. * ?)\ \ s +“。它提取an和之间的所有字符,用于忽略空格(\s+)。问号表示“贪婪”。需要防止模式。*吃所有的东西,包括单词“for”。
#3
2
public class SubStringBetween {
公开课SubStringBetween {
public static String subStringBetween(String sentence, String before, String after) {
int startSub = SubStringBetween.subStringStartIndex(sentence, before);
int stopSub = SubStringBetween.subStringEndIndex(sentence, after);
String newWord = sentence.substring(startSub, stopSub);
return newWord;
}
public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {
int startIndex = 0;
String newWord = "";
int x = 0, y = 0;
for (int i = 0; i < sentence.length(); i++) {
newWord = "";
if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
startIndex = i;
for (int j = 0; j < delimiterBeforeWord.length(); j++) {
try {
if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
newWord = newWord + sentence.charAt(startIndex);
}
startIndex++;
} catch (Exception e) {
}
}
if (newWord.equals(delimiterBeforeWord)) {
x = startIndex;
}
}
}
return x;
}
public static int subStringEndIndex(String sentence, String delimiterAfterWord) {
int startIndex = 0;
String newWord = "";
int x = 0;
for (int i = 0; i < sentence.length(); i++) {
newWord = "";
if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
startIndex = i;
for (int j = 0; j < delimiterAfterWord.length(); j++) {
try {
if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
newWord = newWord + sentence.charAt(startIndex);
}
startIndex++;
} catch (Exception e) {
}
}
if (newWord.equals(delimiterAfterWord)) {
x = startIndex;
x = x - delimiterAfterWord.length();
}
}
}
return x;
}
}
}