在另一个字符串中搜索字符串数组的最有效方法

时间:2022-10-22 00:18:42

I have a large arrray of strings that looks something like this: String temp[] = new String[200000].

我有一大串字符串,看起来是这样的:String temp[] = new String[200000]。

I have another String, let's call it bigtext. What I need to do is iterate through each entry of temp, checking to see if that entry is found in bigtext and then do some work based on it. So, the skeletal code looks something like this:

我有另一个字符串,我们叫它bigtext。我需要做的是遍历temp的每个条目,检查该条目是否在bigtext中找到,然后根据它做一些工作。骨架代码是这样的

for (int x = 0; x < temp.length; x++) {
  if (bigtext.indexOf(temp[x]) > -1 {

  //do some stuff
  } else continue;
}

Because there are so many entries in temp and there are many instances of bigtext as well, I want to do this in the most efficient way. I am wondering if what I've outlined is the most efficient way to iterate through this search of if there are better ways to do this.

因为在temp中有很多条目,并且也有很多bigtext实例,所以我想以最有效的方式来做这件事。我想知道我所概述的是遍历搜索是否有更好的方法来实现这一点的最有效的方法。

Thanks,

谢谢,

Elliott

艾略特

8 个解决方案

#1


14  

I think you're looking for an algorithm like Rabin-Karp or Aho–Corasick which are designed to search in parallel for a large number of sub-strings in a text.

我认为你正在寻找一种算法,比如Rabin-Karp或Aho-Corasick,它被设计成在文本中并行搜索大量子字符串。

#2


10  

Note that your current complexity is O(|S1|*n), where |S1| is the length of bigtext and n is the number of elements in your array, since each search is actually O(|S1|).

注意,当前的复杂性是O(|S1|*n),其中|S1|是bigtext的长度,n是数组中的元素数量,因为每个搜索实际上是O(|S1|)。

By building a suffix tree from bigtext, and iterating on elements in the array, you could bring this complexity down to O(|S1| + |S2|*n), where |S2| is the length of the longest string in the array. Assuming |S2| << |S1|, it could be much faster!

通过从bigtext构建后缀树,并对数组中的元素进行迭代,可以将这种复杂性降低到O(|S1| + | |*n),其中|S2|是数组中最长字符串的长度。假设|S2| < |S1|,可能会快得多!

Building a suffix tree is O(|S1|), and each search is O(|S2|). You don't have to go through bigtext to find it, just on the relevant piece of the suffix tree. Since it is done n times, you get total of O(|S1| + n*|S2|), which is asymptotically better then the naive implementation.

构建一个后缀树是O(|S1|),每个搜索是O(|S2|)。你不需要通过bigtext找到它,只需要在后缀树的相关部分。因为它被做了n次,你会得到O的总数(|S1| + n*|S2|),这比单纯的实现要好得多。

#3


8  

If you have additional information about temp, you can maybe improve the iteration.

如果您有关于temp的其他信息,您可能可以改进迭代。

You can also reduce the time spent, if you parallelize the iteration.

如果将迭代并行化,还可以减少花费的时间。

#4


5  

Efficency depends heavily on what is valuable to you.

效率在很大程度上取决于什么对你有价值。

Are you willing to increase memory for reduced time? Are you willing to increase time for efficent handling of large data sets? Are you willing to increase contention for CPU cores? Are you willing to do pre-processing (perhaps one or more forms of indexing) to reduce the lookup time in a critical section.

你愿意为了减少时间而增加记忆吗?您是否愿意增加有效处理大型数据集的时间?您是否愿意增加CPU内核的争用?您是否愿意进行预处理(可能是一种或多种形式的索引),以减少关键部分中的查找时间。

With you offering, you indicate the entire portion you want made more efficent, but that means you have excluded any portion of the code or system where the trade-off can be made. This forces one to imagine what you care about and what you don't care about. Odds are excellent that all the posted answers are both correct and incorrect depending on one's point of view.

通过提供,您可以指出您想要使整个部分更有效,但是这意味着您已经排除了代码或系统中可以进行权衡的任何部分。这迫使人们去想象你在乎什么,不关心什么。根据一个人的观点,所有公布的答案都是正确的和错误的可能性是非常大的。

#5


3  

An alternative approach would be to tokenize the text - let's say split by common punctuation. Then put these tokens in to a Set and then find the intersect with the main container.

另一种方法是对文本进行标记——比方说用常见的标点符号进行拆分。然后将这些令牌放入一个集合中,然后找到与主容器的交叉点。

Instead of an array, hold the words in a Set too. The intersection can be calculated by simply doing

将单词保存在一个集合中,而不是数组。这个交点可以通过简单的操作来计算

bidTextSet.retainAll(mainWordsSet);

What remains will be the words that occur in bigText that are in your "dictionary".

剩下的将是在你的“字典”中出现在bigText中的单词。

#6


3  

Use a search algorithm like Boyer-Moore. Google Boyer Moore, and it has lots of links which explain how it works. For instance, there is a Java example.

使用博伊尔-摩尔这样的搜索算法。谷歌Boyer Moore,它有很多链接来解释它是如何工作的。例如,有一个Java示例。

#7


2  

I'm afraid it's not efficient at all in any case!

我担心它在任何情况下都没有效率!

To pick the right algorithm, you need to provide some answers:

要选择正确的算法,您需要提供一些答案:

  1. What can be computed off-line? That is, is bigText known in advance? I guess temp is not, from its name.
  2. 什么可以离线计算?也就是说,bigText是预先知道的吗?我猜temp不是,从它的名字来看。
  3. Are you actually searching for words? If so, index them. Bloom filter can help, too.
  4. 你真的在找单词吗?如果是这样,指数。Bloom filter也有帮助。
  5. If you need a bit of fuzziness, may stem or soundex can do the job?
  6. 如果你需要一点模糊,可以用stem或soundex来做这个工作吗?

Sticking to strict inclusion test, you might build a trie from your temp array. It would prevent searching the same sub-string several times.

坚持严格的包含测试,您可以从临时数组构建一个trie。它将防止多次搜索相同的子字符串。

#8


1  

That is a very efficient approach. You can improve it slightly by only evaluating temp.length once

这是一个非常有效的方法。你可以通过只计算一次时间长度来稍微改进它

for(int x = 0, len = temp.length; x < len; x++)

Although you don't provide enough detail of your program, it's quite possible you can find a more efficent approach by redesigning your program.

尽管您没有提供足够的程序细节,但是通过重新设计程序,您很有可能找到一种更有效的方法。

#1


14  

I think you're looking for an algorithm like Rabin-Karp or Aho–Corasick which are designed to search in parallel for a large number of sub-strings in a text.

我认为你正在寻找一种算法,比如Rabin-Karp或Aho-Corasick,它被设计成在文本中并行搜索大量子字符串。

#2


10  

Note that your current complexity is O(|S1|*n), where |S1| is the length of bigtext and n is the number of elements in your array, since each search is actually O(|S1|).

注意,当前的复杂性是O(|S1|*n),其中|S1|是bigtext的长度,n是数组中的元素数量,因为每个搜索实际上是O(|S1|)。

By building a suffix tree from bigtext, and iterating on elements in the array, you could bring this complexity down to O(|S1| + |S2|*n), where |S2| is the length of the longest string in the array. Assuming |S2| << |S1|, it could be much faster!

通过从bigtext构建后缀树,并对数组中的元素进行迭代,可以将这种复杂性降低到O(|S1| + | |*n),其中|S2|是数组中最长字符串的长度。假设|S2| < |S1|,可能会快得多!

Building a suffix tree is O(|S1|), and each search is O(|S2|). You don't have to go through bigtext to find it, just on the relevant piece of the suffix tree. Since it is done n times, you get total of O(|S1| + n*|S2|), which is asymptotically better then the naive implementation.

构建一个后缀树是O(|S1|),每个搜索是O(|S2|)。你不需要通过bigtext找到它,只需要在后缀树的相关部分。因为它被做了n次,你会得到O的总数(|S1| + n*|S2|),这比单纯的实现要好得多。

#3


8  

If you have additional information about temp, you can maybe improve the iteration.

如果您有关于temp的其他信息,您可能可以改进迭代。

You can also reduce the time spent, if you parallelize the iteration.

如果将迭代并行化,还可以减少花费的时间。

#4


5  

Efficency depends heavily on what is valuable to you.

效率在很大程度上取决于什么对你有价值。

Are you willing to increase memory for reduced time? Are you willing to increase time for efficent handling of large data sets? Are you willing to increase contention for CPU cores? Are you willing to do pre-processing (perhaps one or more forms of indexing) to reduce the lookup time in a critical section.

你愿意为了减少时间而增加记忆吗?您是否愿意增加有效处理大型数据集的时间?您是否愿意增加CPU内核的争用?您是否愿意进行预处理(可能是一种或多种形式的索引),以减少关键部分中的查找时间。

With you offering, you indicate the entire portion you want made more efficent, but that means you have excluded any portion of the code or system where the trade-off can be made. This forces one to imagine what you care about and what you don't care about. Odds are excellent that all the posted answers are both correct and incorrect depending on one's point of view.

通过提供,您可以指出您想要使整个部分更有效,但是这意味着您已经排除了代码或系统中可以进行权衡的任何部分。这迫使人们去想象你在乎什么,不关心什么。根据一个人的观点,所有公布的答案都是正确的和错误的可能性是非常大的。

#5


3  

An alternative approach would be to tokenize the text - let's say split by common punctuation. Then put these tokens in to a Set and then find the intersect with the main container.

另一种方法是对文本进行标记——比方说用常见的标点符号进行拆分。然后将这些令牌放入一个集合中,然后找到与主容器的交叉点。

Instead of an array, hold the words in a Set too. The intersection can be calculated by simply doing

将单词保存在一个集合中,而不是数组。这个交点可以通过简单的操作来计算

bidTextSet.retainAll(mainWordsSet);

What remains will be the words that occur in bigText that are in your "dictionary".

剩下的将是在你的“字典”中出现在bigText中的单词。

#6


3  

Use a search algorithm like Boyer-Moore. Google Boyer Moore, and it has lots of links which explain how it works. For instance, there is a Java example.

使用博伊尔-摩尔这样的搜索算法。谷歌Boyer Moore,它有很多链接来解释它是如何工作的。例如,有一个Java示例。

#7


2  

I'm afraid it's not efficient at all in any case!

我担心它在任何情况下都没有效率!

To pick the right algorithm, you need to provide some answers:

要选择正确的算法,您需要提供一些答案:

  1. What can be computed off-line? That is, is bigText known in advance? I guess temp is not, from its name.
  2. 什么可以离线计算?也就是说,bigText是预先知道的吗?我猜temp不是,从它的名字来看。
  3. Are you actually searching for words? If so, index them. Bloom filter can help, too.
  4. 你真的在找单词吗?如果是这样,指数。Bloom filter也有帮助。
  5. If you need a bit of fuzziness, may stem or soundex can do the job?
  6. 如果你需要一点模糊,可以用stem或soundex来做这个工作吗?

Sticking to strict inclusion test, you might build a trie from your temp array. It would prevent searching the same sub-string several times.

坚持严格的包含测试,您可以从临时数组构建一个trie。它将防止多次搜索相同的子字符串。

#8


1  

That is a very efficient approach. You can improve it slightly by only evaluating temp.length once

这是一个非常有效的方法。你可以通过只计算一次时间长度来稍微改进它

for(int x = 0, len = temp.length; x < len; x++)

Although you don't provide enough detail of your program, it's quite possible you can find a more efficent approach by redesigning your program.

尽管您没有提供足够的程序细节,但是通过重新设计程序,您很有可能找到一种更有效的方法。