C＃ - 在另一个字符串中查找一组字符串之一的最快方法

I need to check whether a string contains any swear words.

我需要检查字符串是否包含任何脏话。

Following some advice from another question here, I made a HashSet containing the words:

根据另一个问题的一些建议，我创建了一个包含以下单词的HashSet：

HashSet<string> swearWords = new HashSet<string>() { "word_one", "word_two", "etc" };

Now I need to see if any of the values contained in swearWords are in my string.

现在我需要查看swearWords中包含的任何值是否在我的字符串中。

I've seen it done the other way round, eg:

我已经看到它反过来了，例如：

swearWords.Contains(myString)

But this will return false.

但这将失败。

What's the fastest way to check if any of the words in the HashSet are in myString?

检查HashSet中的任何单词是否在myString中的最快方法是什么？

NB: I figure I can use a foreach loop to check each word in turn, and break if a match is found, I'm just wondering if there's a faster way.

注意：我想我可以使用foreach循环依次检查每个单词，如果找到匹配则中断，我只是想知道是否有更快的方法。

5 个解决方案

#1

You could try a regex, but I'm not sure it's faster.

你可以试试一个正则表达式，但我不确定它是否更快。

Regex rx = new Regex("(" + string.Join("|", swearWords) + ")");
rx.IsMatch(myString)

#2

If you place your swears in an IEnumerable<> implementing container:

如果您将发誓放在IEnumerable <>实现容器中：

var containsSwears = swarWords.Any(w => myString.Contains(w));

Note: HashSet<> implements IEnumerable<>

注意：HashSet <>实现IEnumerable <>

#3

If you have really large set of swear words you could use Aho–Corasick algorithm: http://tomasp.net/blog/ahocorasick.aspx

如果你有大量的咒骂词，你可以使用Aho-Corasick算法：http：//tomasp.net/blog/ahocorasick.aspx

#4

You could split "myString" into an IEnumerable type, and then use "Overlaps" on them?

您可以将“myString”拆分为IEnumerable类型，然后对它们使用“Overlaps”吗？

http://msdn.microsoft.com/en-us/library/bb355623(v=vs.90).aspx

(P.S. Long time no see...)

（P.S.好久不见......）

EDIT: Just noticed error in my previous answer.

编辑：刚刚在我之前的回答中发现错误。

#5

The main problem with such schemes is defining what a word is in the context of the string you want to check.

这种方案的主要问题是在要检查的字符串的上下文中定义单词的内容。

Naive implementations such as those using input.Contains simply do not have the concept of a word; they will "detect" swear words even when that was not the intent.
天真的实现，例如使用input.Contains的实现，根本就没有单词的概念;他们会“发现”发誓的话，即使这不是意图。
Breaking words on whitespace is not going to cut it (consider also punctuation marks, etc).
在空格上打破单词不会削减它（也可以考虑标点符号等）。
Breaking on characters other than whitespace is going to raise culture issues: what characters are considered word-characters exactly?
打破空白以外的字符会引发文化问题：哪些字符被认为是单词字符？

Assuming that your stopword list only uses the latin alphabet, a practical choice would be to assume that words are sequences consisting of only latin characters. So a reasonable starting solution would be

假设您的禁用词列表仅使用拉丁字母，实际的选择是假设单词是仅由拉丁字符组成的序列。因此，合理的起始解决方案将是

var words = Regex.Split(@"[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Pc}\p{Lm}]", myString);

The regex above is the standard class \W modified to not include digits; for more info, see http://msdn.microsoft.com/en-us/library/20bw873z.aspx. For other approaches, see this question and possibly the CodeProject link supplied in the accepted answer.

上面的正则表达式是标准类\ W修改为不包括数字;有关详细信息，请参阅http://msdn.microsoft.com/en-us/library/20bw873z.aspx。对于其他方法，请参阅此问题以及可能的接受答案中提供的CodeProject链接。

Having split the input string, you can iterate over words and replace those that match anything in your list (use swearWords.Contains(word) to check) or simply detect if there are any matches at all with

拆分输入字符串后，您可以迭代单词并替换匹配列表中任何内容的单词（使用swearWords.Contains（word）进行检查）或者只是检测是否存在任何匹配项

var anySwearWords = words.Intersect(swearWords).Any();

#1