I want to find number of times a word appears in a string in a fast and efficient way using Java.
我希望使用Java以快速有效的方式查找单词出现在字符串中的次数。
The words are separated by space and I am looking for complete words.
这些单词是由空格分隔的,我正在寻找完整的单词。
Example:
string: "the colored port should be black or white or brown"
word: "or"
output: 2
for the above example, "colored" and "port" are not counted, but "or" is counted.
对于上面的例子,“有色”和“端口”不计算在内,但计算“或”。
I considered using substring() and contains() and iterating over the string. But then we need to check for the surrounding spaces which I suppose is not efficient. Also StringUtils.countMatches() is not efficient.
我考虑使用substring()和contains()并迭代字符串。但是我们需要检查周围的空间,我认为这些空间效率不高。 StringUtils.countMatches()也没有效率。
The best way I tried is splitting the string over space and iterating over the words, and then matching them against the given word:
我尝试过的最好方法是将字符串拆分到空格上并迭代单词,然后将它们与给定单词进行匹配:
String string = "the colored port should be black or white or brown";
String[] words = string.split(" ");
String word = "or";
int occurrences = 0;
for (int i=0; i<words.length; i++)
if (words[i].equals(word))
occurrences++;
System.out.println(occurrences);
But I am expecting some efficient way using Matcher and regex.
但我期待使用Matcher和regex的一些有效方法。
So I tested the following code:
所以我测试了以下代码:
String string1 = "the colored port should be black or white or brown or";
//String string2 = "the color port should be black or white or brown or";
String word = "or";
Pattern pattern = Pattern.compile("\\s(" + word + ")|\\s(" + word + ")|(" + word + ")\\s");
Matcher matcher = pattern.matcher(string1);
//Matcher matcher = pattern.matcher(string2);
int count = 0;
while (matcher.find()){
match=matcher.group();
count++;
}
System.out.println("The word \"" + word + "\" is mentioned " + count + " times.");
It is supposed to be fast enough, and gives me the right answer for string1, but not for string2 (commented). There seems to need a little change in the regex.
它应该足够快,并给我正确的string1答案,但不是string2(评论)。正则表达式似乎需要稍微改变一下。
Any ideas?
3 个解决方案
#1
0
public class Test {
public static void main(String[] args) {
String str= "the colored port should be black or white or brown";
Pattern pattern = Pattern.compile(" or ");
Matcher matcher = pattern.matcher(str);
int count = 0;
while (matcher.find())
count++;
System.out.println(count);
}
}
#2
0
How about this? Assuming word
wont have spaces.
这个怎么样?假设单词不会有空格。
string.split("\\s"+word+"\\s").length - 1;
#3
0
I experimented and evaluated three answers; split based and Matcher based (as mentioned in the question), and Collections.frequency() based (as mentioned in a comment above by @4castle). Each time I measured the total time in a loop repeated 10 million times. As a result, the split based answer tends to be the most efficient way:
我试验并评估了三个答案;基于分裂和基于匹配器(在问题中提到)和基于Collections.frequency()(如上面的评论中提到的@ 4castle)。每次我测量循环中的总时间重复1000万次。因此,基于拆分的答案往往是最有效的方式:
String string = "the colored port should be black or white or brown";
String[] words = string.split(" ");
String word = "or";
int occurrences = 0;
for (int i=0; i<words.length; i++)
if (words[i].equals(word))
occurrences++;
System.out.println(occurrences);
Then there is Collections.frequency() based answer with a little longer running time (~5% slower):
然后有基于Collections.frequency()的答案,运行时间稍长(慢约5%):
String string = "the colored port should be black or white or brown or";
String word = "or";
int count = Collections.frequency(Arrays.asList(string.split(" ")), word);
System.out.println("The word \"" + word + "\" is mentioned " + count + " times.");
The Matcher based solution (mentioned in the question) is a lot slower (~5 times more running time).
基于匹配器的解决方案(在问题中提到)要慢很多(运行时间大约多5倍)。
#1
0
public class Test {
public static void main(String[] args) {
String str= "the colored port should be black or white or brown";
Pattern pattern = Pattern.compile(" or ");
Matcher matcher = pattern.matcher(str);
int count = 0;
while (matcher.find())
count++;
System.out.println(count);
}
}
#2
0
How about this? Assuming word
wont have spaces.
这个怎么样?假设单词不会有空格。
string.split("\\s"+word+"\\s").length - 1;
#3
0
I experimented and evaluated three answers; split based and Matcher based (as mentioned in the question), and Collections.frequency() based (as mentioned in a comment above by @4castle). Each time I measured the total time in a loop repeated 10 million times. As a result, the split based answer tends to be the most efficient way:
我试验并评估了三个答案;基于分裂和基于匹配器(在问题中提到)和基于Collections.frequency()(如上面的评论中提到的@ 4castle)。每次我测量循环中的总时间重复1000万次。因此,基于拆分的答案往往是最有效的方式:
String string = "the colored port should be black or white or brown";
String[] words = string.split(" ");
String word = "or";
int occurrences = 0;
for (int i=0; i<words.length; i++)
if (words[i].equals(word))
occurrences++;
System.out.println(occurrences);
Then there is Collections.frequency() based answer with a little longer running time (~5% slower):
然后有基于Collections.frequency()的答案,运行时间稍长(慢约5%):
String string = "the colored port should be black or white or brown or";
String word = "or";
int count = Collections.frequency(Arrays.asList(string.split(" ")), word);
System.out.println("The word \"" + word + "\" is mentioned " + count + " times.");
The Matcher based solution (mentioned in the question) is a lot slower (~5 times more running time).
基于匹配器的解决方案(在问题中提到)要慢很多(运行时间大约多5倍)。