如何有效地找到两个给定子串之间的字符串?

时间:2020-11-28 19:21:21

I have a string and I know two unique substrings and which one precedes which. What would be the most efficient way of finding the string in between?
Right now I am doing this, which works well:

我有一个字符串,我知道两个独特的子串,其中一个先于哪个。在两者之间找到字符串的最有效方法是什么?现在我正在做这个,效果很好:

middleString = line.split(firstSubstr)[1].split(secondSubstr)[0];

I need to do this for every single line in a huge amount of big files and I don't find this way very elegant. I was wondering if there is another way to do this more efficiently and elegantly.
If this line were evaluated lazily, I assume the code would be very efficient, but I don't think that is the case for this expression. Assuming a string of hundreds of characters starting by abc, being "a" the first substring and "c" the second, the code would look for all a's and c's in the whole string before returning "b".
Another possibility would be to write my own method, iterate the original string character by character until the first substring is found and then append all the characters until the second is found; but I think there should be a way simpler than this.

我需要为大量大文件中的每一行执行此操作,并且我发现这种方式并不优雅。我想知道是否有另一种方法可以更有效和更优雅地完成这项工作。如果这条线被懒惰地评估,我认为代码会非常有效,但我不认为这个表达式就是这种情况。假设一个由abc开头的数百个字符的字符串,“a”是第一个子字符串而“c”是第二个字符串,代码将在返回“b”之前查找整个字符串中的所有a和c。另一种可能性是编写我自己的方法,逐个字符地迭代原始字符串,直到找到第一个子字符串,然后追加所有字符,直到找到第二个字符;但我认为应该有一种比这更简单的方法。

1 个解决方案

#1


You can solve this using indexOf instead of split, as follows:

您可以使用indexOf而不是split来解决此问题,如下所示:

String in = "abcdefghij";
String part1 = "cd";
String part2 = "gh";

int i1 = in.indexOf(part1) + part1.length();  // end of first match
int i2 = in.indexOf(part2, i1);               // start of second match

System.out.println(in.substring(i1, i2));     // "ef"

Here's one solution using regular expressions and capturing groups:

这是使用正则表达式和捕获组的一种解决方案:

Pattern p = Pattern.compile(Pattern.quote(part1)
                         + "(.*?)"
                         + Pattern.quote(part2));

Matcher m = p.matcher(in);

if (m.find()) {
    System.out.println(m.group(1));  // "ef"
}

Regarding which one is fastest, I'd say it depends on various factors. Which JRE are you using? Would the same pattern be used over and over again (can you compile the regex once and reuse it)? Since the code is just a few lines, I suggest you simply experiment with it a bit, and profile if necessary.

关于哪一个最快,我会说这取决于各种因素。您使用的是哪种JRE?是否会一次又一次地使用相同的模式(你可以编译一次正则表达式并重用它)吗?由于代码只有几行,我建议您稍微尝试一下,并在必要时进行配置。


Note that the solution you suggest:

请注意您建议的解决方案:

middleString = line.split(firstSubstr)[1].split(secondSubstr)[0];

could have a devastating memory footprint. See this Q/A: Java String.split memory leak?

可能会造成毁灭性的内存占用。看到这个Q / A:Java String.split内存泄漏?

#1


You can solve this using indexOf instead of split, as follows:

您可以使用indexOf而不是split来解决此问题,如下所示:

String in = "abcdefghij";
String part1 = "cd";
String part2 = "gh";

int i1 = in.indexOf(part1) + part1.length();  // end of first match
int i2 = in.indexOf(part2, i1);               // start of second match

System.out.println(in.substring(i1, i2));     // "ef"

Here's one solution using regular expressions and capturing groups:

这是使用正则表达式和捕获组的一种解决方案:

Pattern p = Pattern.compile(Pattern.quote(part1)
                         + "(.*?)"
                         + Pattern.quote(part2));

Matcher m = p.matcher(in);

if (m.find()) {
    System.out.println(m.group(1));  // "ef"
}

Regarding which one is fastest, I'd say it depends on various factors. Which JRE are you using? Would the same pattern be used over and over again (can you compile the regex once and reuse it)? Since the code is just a few lines, I suggest you simply experiment with it a bit, and profile if necessary.

关于哪一个最快,我会说这取决于各种因素。您使用的是哪种JRE?是否会一次又一次地使用相同的模式(你可以编译一次正则表达式并重用它)吗?由于代码只有几行,我建议您稍微尝试一下,并在必要时进行配置。


Note that the solution you suggest:

请注意您建议的解决方案:

middleString = line.split(firstSubstr)[1].split(secondSubstr)[0];

could have a devastating memory footprint. See this Q/A: Java String.split memory leak?

可能会造成毁灭性的内存占用。看到这个Q / A:Java String.split内存泄漏?