检查一个字符串中的单词是否在另一个字符串中的最快方法是什么?

时间:2021-08-12 01:38:01

I have a string of words; let's call them bad:

我有一串话;让我们称他们为坏:

bad = "foo bar baz"

I can keep this string as a whitespace separated string, or as a list:

我可以将此字符串保留为以空格分隔的字符串或列表:

bad = bad.split(" ");

If I have another string, like so:

如果我有另一个字符串,如下所示:

str = "This is my first foo string"

What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?

检查错误字符串中的任何单词是否在我的比较字符串中的最快方法是什么?如果发现该字词,删除该单词的最快方法是什么?

#Find if a word is there
bad.split(" ").each do |word|
  found = str.include?(word)
end

#Remove the word
bad.split(" ").each do |word|
  str.gsub!(/#{word}/, "")
end

8 个解决方案

#1


9  

If the list of bad words gets huge, a hash is a lot faster:

如果坏词列表变得很大,那么散列会快得多:

    require 'benchmark'

    bad = ('aaa'..'zzz').to_a    # 17576 words
    str= "What's the fasted way to check if any word from the bad string is within my "
    str += "comparison string, and what's the fastest way to remove said word if it's "
    str += "found" 
    str *= 10

    badex = /\b(#{bad.join('|')})\b/i

    bad_hash = {}
    bad.each{|w| bad_hash[w] = true}

    n = 10
    Benchmark.bm(10) do |x|

      x.report('regex:') {n.times do 
        str.gsub(badex,'').squeeze(' ')
      end}

      x.report('hash:') {n.times do
        str.gsub(/\b\w+\b/){|word| bad_hash[word] ? '': word}.squeeze(' ')
      end}

    end
                user     system      total        real
regex:     10.485000   0.000000  10.485000 ( 13.312500)
hash:       0.000000   0.000000   0.000000 (  0.000000)

#2


3  

bad = "foo bar baz"

坏=“foo bar baz”

=> "foo bar baz"

>“foo bar baz”

str = "This is my first foo string"

str =“这是我的第一个foo字符串”

=> "This is my first foo string"

>“这是我的第一个foo字符串”

(str.split(' ') - bad.split(' ')).join(' ')

(str.split('') - bad.split(''))。join('')

=> "This is my first string"

>“这是我的第一个字符串”

#3


1  

All the solutions have problems with catching the bad words if the case does not match. The regex solution is easiest to fix by adding the ignore-case flag:

如果案例不匹配,所有解决方案都会遇到捕获坏词的问题。通过添加ignore-case标志,最容易修复正则表达式解决方案:

badex = /\b(#{bad.split.join('|')})\b/i

In addition, using "String".include?(" String ") will lead to boundary problems with the first and last words in the string or strings where the target words have punctuation or are hyphenated. Testing for those situations will result in a lot of other code being needed. Because of that I think the regex solution is the best one. It is not the fastest but it is going to be more flexible right out of the box, and, if the other algorithms are tweaked to handle case folding and compound-words the regex solution might pull ahead.

此外,使用“String”.include?(“String”)将导致字符串中的第一个和最后一个单词或目标单词具有标点符号或连字符的字符串的边界问题。对这些情况进行测试将导致需要许多其他代码。因此,我认为正则表达式解决方案是最好的解决方案。它不是最快的,但开箱即可更灵活,如果其他算法经过调整以处理案例折叠和复合词,那么正则表达式解决方案可能会提前。

#!/usr/bin/ruby

require 'benchmark'

bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10

n = 10_000
Benchmark.bm(20) do |x|
  x.report('regex:') do 
    n.times { str.gsub(badex,'').gsub('  ',' ') }
  end

  x.report('regex with squeeze:') do 
    n.times{ str.gsub(badex,'').squeeze(' ') }
  end

  x.report('array subtraction') do
    n.times { (str.split(' ') - bad.split(' ')).join(' ') }
  end
end

I made the str variable a lot longer, to make the routines work a bit harder.

我使str变量更长,以使例程工作更加困难。

                          user     system      total        real
regex:                0.740000   0.010000   0.750000 (  0.752846)
regex with squeeze:   0.570000   0.000000   0.570000 (  0.581304)
array subtraction     1.430000   0.010000   1.440000 (  1.449578)

Doh!, I'm too used to how other languages handle their benchmarks. Now I got it working and looking better!

Doh!,我已经习惯了其他语言如何处理他们的基准测试。现在我让它工作,看起来更好!

Just a little comment about what it looks like the OP is trying to do: Black-listed word removal is easy to fool, and a pain to keep maintained. L33t-sp34k makes it trivial to sneek words through. Depending on the application, people will consider it a game to find ways to push offensive words past the filtering. The best solution I found when I was asked to work on this, was to create a generator that would create all the variations on a word and dump them into a database where some process could check as soon as possible, rather than in real time. A million small strings being checked can take a while if you are searching through a long list of offensive words; I'm sure we could come up with quite a list of things that someone would find offensive, but that's an exercise for a different day.

关于OP正在尝试做什么的一点点评论:黑名单删除很容易愚弄,并且难以保持。 L33t-sp34k让悄悄话语变得微不足道。根据应用程序的不同,人们会认为这是一种游戏,可以找到通过过滤推动攻击性词语的方法。我在被要求处理这个问题时找到的最佳解决方案是创建一个生成器,它可以创建一个单词的所有变体并将它们转储到一个数据库中,其中一些进程可以尽快检查,而不是实时检查。如果您正在搜索一长串冒犯性词语,那么要检查的一百万个小字符串可能需要一段时间;我相信我们可以提出一些有人会觉得冒犯的事情清单,但那是一个不同日子的练习。

I haven't seen anything similar in Ruby to Perl's Regexp::Assemble, but that was a good way to go after this sort of problem. You can pass an array of words, plus options for case-folding and word-boundaries, and it will spit out a regex pattern that will match all the words, with their commonalities considered to result in the smallest pattern that will match all words in the list. The problem after that is locating which word in the original string matched the hits found by the pattern, so they can be removed. Differences in word case and hits within compound-words makes that replacement more interesting.

我没有在Ruby中看到类似于Perl的Regexp :: Assemble,但这是解决这类问题的好方法。你可以传递一个单词数组,加上大小写折叠和单词边界的选项,它会吐出一个匹配所有单词的正则表达式模式,它们的共性被认为会产生与所有单词匹配的最小模式。列表。之后的问题是找到原始字符串中的哪个单词与模式找到的匹配项匹配,因此可以删除它们。单词案例和复合词中命中的差异使得替换更有趣。

And we won't even go into words that are benign or offensive depending on the context.

根据具体情况,我们甚至不会谈论良性或冒犯性的言论。


I added a bit more comprehensive test for the array-subtraction benchmark, to fit how it would need to work in a real piece of code. The if clause is specified in the answer, this now reflects it:

我为数组减法基准测试添加了一些更全面的测试,以适应在真正的代码中工作的方式。 if子句在答案中指定,现在反映它:

#!/usr/bin/env ruby

require 'benchmark'

bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10

str_split = str.split
bad_split = bad.split

n = 10_000
Benchmark.bm(20) do |x|
  x.report('regex') do 
    n.times { str.gsub(badex,'').gsub('  ',' ') }
  end

  x.report('regex with squeeze') do 
    n.times{ str.gsub(badex,'').squeeze(' ') }
  end

  x.report('bad.any?') do
    n.times { 
      if (bad_split.any? { |bw| str.include?(bw) })
        (str_split - bad_split).join(' ')
      end
    }
  end

  x.report('array subtraction') do
    n.times { (str_split - bad_split).join(' ') }
  end

end

with two test runs:

有两个测试运行:

ruby test.rb 
                          user     system      total        real
regex                 1.000000   0.010000   1.010000 (  1.001093)
regex with squeeze    0.870000   0.000000   0.870000 (  0.873224)
bad.any?              1.760000   0.000000   1.760000 (  1.762195)
array subtraction     1.350000   0.000000   1.350000 (  1.346043)

ruby test.rb 
                          user     system      total        real
regex                 1.000000   0.010000   1.010000 (  1.004365)
regex with squeeze    0.870000   0.000000   0.870000 (  0.868525)
bad.any?              1.770000   0.000000   1.770000 (  1.775567)
array subtraction     1.360000   0.000000   1.360000 (  1.359100)

#4


0  

I usually make a point of not optimizing without measurements, but here's a wag:

我通常会指出在没有测量的情况下不进行优化,但这是一个摇摆不定的事情:

To make it fast, you should iterate through each string once. You want to avoid a loop with bad count * str count inner compares. So, you could build a big regexp and gsub with it.

为了加快速度,你应该遍历每个字符串一次。你想避免一个带有错误计数的循环* str count inner compare。所以,你可以用它构建一个大的正则表达式和gsub。

(adding foo variants to test word boundary works)

(添加foo变体以测试单词边界的工作原理)

str = "This is my first foo fooo ofoo string"

=> "This is my first foo fooo ofoo string"

badex = /\b(#{bad.split.join('|')})\b/

=> /\b(foo|bar|baz)\b/

str.gsub(badex,'').gsub('  ',' ')

=> "This is my first fooo ofoo string"

Of course the huge resulting regexp might be as slow as the implied nested iteration in my other answer. Only way to know is to measure.

当然,巨大的regexp可能和我在其他答案中隐含的嵌套迭代一样慢。唯一知道的方法就是衡量。

#5


0  

bad = %w(foo bar baz)
str = "This is my first foo string"

# find the first word in the list
found = bad.find {|word| str.include?(word)}

# remove it
str[found] = ''  ;# str => "This is my first  string"

#6


0  

I'd benchmark this:

我对此进行了基准测试:

bad = "foo bar baz".split(' ')
str = "This is my first foo string".split(' ')

# 1. What's the fasted way to check if any word from the bad string is within my comparison string
p bad.any? { |bw| str.include?(bw) }

# 2. What's the fastest way to remove said word if it's found?
p (str - bad).join(' ')

any? will quick checking as soon as it sees a match. If you can order your bad words by their probability, you can save some cycles.

任何?一看到比赛就会快速检查。如果您可以按概率订购坏词,则可以节省一些周期。

#7


0  

Here's one that will check for words and phrases.

这是一个将检查单词和短语的人。

 def checkContent(str)
     bad = ["foo", "bar", "this place sucks", "or whatever"]

     # may be best to map and singularize everything as well. 
     # maybe add some regex to catch those pesky, "How i make $69 dollars each second online..."
     # maybe apply some comparison stuff to check for weird characters in those pesky, "How i m4ke $69 $ollars an hour"


     bad_hash = {}
     bad_phrase_hash = {}

     bad.map(&:downcase).each do |word|
         words = word.split().map(&:downcase)
         if words.length > 1
             words.each do |inner|
                if bad_hash.key?(inner)
                    if bad_hash[inner].is_a?(Hash) && !bad_hash[inner].key?(words.length)
                         bad_hash[inner][words.length] = true
                    elsif bad_hash[inner] === 1
                        bad_hash[inner] = {1=>true,words.length => true}
                    end
                else
                    bad_hash[inner] = {words.length => true}
                end
             end
             bad_phrase_hash[word] = true
         else
             bad_hash[word] = 1
         end
     end

     string = str.split().map(&:downcase)
     string.each_with_index do |word,index|
        if bad_hash.key?(word)
            if bad_hash[word].is_a?(Hash)
                if bad_hash[word].key?(1)
                    return false
                else
                    bad_hash[word].keys.sort.each do |length|
                        value = string[index...(index + length)].join(" ")
                        if bad_phrase_hash.key?(value)
                            return false
                        end
                    end
                end
            else
                return false
            end
        end
     end
     return true
  end

#8


-2  

The include? method is what you need. The ruby String specificacion says:

包括?方法就是你所需要的。 ruby String specificacion说:

str.include?( string ) -> true or false Returns true if str contains the given string or character.

str.include?(string) - > true或false如果str包含给定的字符串或字符,则返回true。

"hello".include? "lo" -> true

“你好” .INCLUDE? “lo” - >是的

"hello".include? "ol" -> false

“你好” .INCLUDE? “ol” - > false

"hello".include? ?h -> true

“你好” .INCLUDE? ?h - >是的

Note that it has O(n) and what you purposed is O(n^2)

注意它有O(n),你的目的是O(n ^ 2)

#1


9  

If the list of bad words gets huge, a hash is a lot faster:

如果坏词列表变得很大,那么散列会快得多:

    require 'benchmark'

    bad = ('aaa'..'zzz').to_a    # 17576 words
    str= "What's the fasted way to check if any word from the bad string is within my "
    str += "comparison string, and what's the fastest way to remove said word if it's "
    str += "found" 
    str *= 10

    badex = /\b(#{bad.join('|')})\b/i

    bad_hash = {}
    bad.each{|w| bad_hash[w] = true}

    n = 10
    Benchmark.bm(10) do |x|

      x.report('regex:') {n.times do 
        str.gsub(badex,'').squeeze(' ')
      end}

      x.report('hash:') {n.times do
        str.gsub(/\b\w+\b/){|word| bad_hash[word] ? '': word}.squeeze(' ')
      end}

    end
                user     system      total        real
regex:     10.485000   0.000000  10.485000 ( 13.312500)
hash:       0.000000   0.000000   0.000000 (  0.000000)

#2


3  

bad = "foo bar baz"

坏=“foo bar baz”

=> "foo bar baz"

>“foo bar baz”

str = "This is my first foo string"

str =“这是我的第一个foo字符串”

=> "This is my first foo string"

>“这是我的第一个foo字符串”

(str.split(' ') - bad.split(' ')).join(' ')

(str.split('') - bad.split(''))。join('')

=> "This is my first string"

>“这是我的第一个字符串”

#3


1  

All the solutions have problems with catching the bad words if the case does not match. The regex solution is easiest to fix by adding the ignore-case flag:

如果案例不匹配,所有解决方案都会遇到捕获坏词的问题。通过添加ignore-case标志,最容易修复正则表达式解决方案:

badex = /\b(#{bad.split.join('|')})\b/i

In addition, using "String".include?(" String ") will lead to boundary problems with the first and last words in the string or strings where the target words have punctuation or are hyphenated. Testing for those situations will result in a lot of other code being needed. Because of that I think the regex solution is the best one. It is not the fastest but it is going to be more flexible right out of the box, and, if the other algorithms are tweaked to handle case folding and compound-words the regex solution might pull ahead.

此外,使用“String”.include?(“String”)将导致字符串中的第一个和最后一个单词或目标单词具有标点符号或连字符的字符串的边界问题。对这些情况进行测试将导致需要许多其他代码。因此,我认为正则表达式解决方案是最好的解决方案。它不是最快的,但开箱即可更灵活,如果其他算法经过调整以处理案例折叠和复合词,那么正则表达式解决方案可能会提前。

#!/usr/bin/ruby

require 'benchmark'

bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10

n = 10_000
Benchmark.bm(20) do |x|
  x.report('regex:') do 
    n.times { str.gsub(badex,'').gsub('  ',' ') }
  end

  x.report('regex with squeeze:') do 
    n.times{ str.gsub(badex,'').squeeze(' ') }
  end

  x.report('array subtraction') do
    n.times { (str.split(' ') - bad.split(' ')).join(' ') }
  end
end

I made the str variable a lot longer, to make the routines work a bit harder.

我使str变量更长,以使例程工作更加困难。

                          user     system      total        real
regex:                0.740000   0.010000   0.750000 (  0.752846)
regex with squeeze:   0.570000   0.000000   0.570000 (  0.581304)
array subtraction     1.430000   0.010000   1.440000 (  1.449578)

Doh!, I'm too used to how other languages handle their benchmarks. Now I got it working and looking better!

Doh!,我已经习惯了其他语言如何处理他们的基准测试。现在我让它工作,看起来更好!

Just a little comment about what it looks like the OP is trying to do: Black-listed word removal is easy to fool, and a pain to keep maintained. L33t-sp34k makes it trivial to sneek words through. Depending on the application, people will consider it a game to find ways to push offensive words past the filtering. The best solution I found when I was asked to work on this, was to create a generator that would create all the variations on a word and dump them into a database where some process could check as soon as possible, rather than in real time. A million small strings being checked can take a while if you are searching through a long list of offensive words; I'm sure we could come up with quite a list of things that someone would find offensive, but that's an exercise for a different day.

关于OP正在尝试做什么的一点点评论:黑名单删除很容易愚弄,并且难以保持。 L33t-sp34k让悄悄话语变得微不足道。根据应用程序的不同,人们会认为这是一种游戏,可以找到通过过滤推动攻击性词语的方法。我在被要求处理这个问题时找到的最佳解决方案是创建一个生成器,它可以创建一个单词的所有变体并将它们转储到一个数据库中,其中一些进程可以尽快检查,而不是实时检查。如果您正在搜索一长串冒犯性词语,那么要检查的一百万个小字符串可能需要一段时间;我相信我们可以提出一些有人会觉得冒犯的事情清单,但那是一个不同日子的练习。

I haven't seen anything similar in Ruby to Perl's Regexp::Assemble, but that was a good way to go after this sort of problem. You can pass an array of words, plus options for case-folding and word-boundaries, and it will spit out a regex pattern that will match all the words, with their commonalities considered to result in the smallest pattern that will match all words in the list. The problem after that is locating which word in the original string matched the hits found by the pattern, so they can be removed. Differences in word case and hits within compound-words makes that replacement more interesting.

我没有在Ruby中看到类似于Perl的Regexp :: Assemble,但这是解决这类问题的好方法。你可以传递一个单词数组,加上大小写折叠和单词边界的选项,它会吐出一个匹配所有单词的正则表达式模式,它们的共性被认为会产生与所有单词匹配的最小模式。列表。之后的问题是找到原始字符串中的哪个单词与模式找到的匹配项匹配,因此可以删除它们。单词案例和复合词中命中的差异使得替换更有趣。

And we won't even go into words that are benign or offensive depending on the context.

根据具体情况,我们甚至不会谈论良性或冒犯性的言论。


I added a bit more comprehensive test for the array-subtraction benchmark, to fit how it would need to work in a real piece of code. The if clause is specified in the answer, this now reflects it:

我为数组减法基准测试添加了一些更全面的测试,以适应在真正的代码中工作的方式。 if子句在答案中指定,现在反映它:

#!/usr/bin/env ruby

require 'benchmark'

bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10

str_split = str.split
bad_split = bad.split

n = 10_000
Benchmark.bm(20) do |x|
  x.report('regex') do 
    n.times { str.gsub(badex,'').gsub('  ',' ') }
  end

  x.report('regex with squeeze') do 
    n.times{ str.gsub(badex,'').squeeze(' ') }
  end

  x.report('bad.any?') do
    n.times { 
      if (bad_split.any? { |bw| str.include?(bw) })
        (str_split - bad_split).join(' ')
      end
    }
  end

  x.report('array subtraction') do
    n.times { (str_split - bad_split).join(' ') }
  end

end

with two test runs:

有两个测试运行:

ruby test.rb 
                          user     system      total        real
regex                 1.000000   0.010000   1.010000 (  1.001093)
regex with squeeze    0.870000   0.000000   0.870000 (  0.873224)
bad.any?              1.760000   0.000000   1.760000 (  1.762195)
array subtraction     1.350000   0.000000   1.350000 (  1.346043)

ruby test.rb 
                          user     system      total        real
regex                 1.000000   0.010000   1.010000 (  1.004365)
regex with squeeze    0.870000   0.000000   0.870000 (  0.868525)
bad.any?              1.770000   0.000000   1.770000 (  1.775567)
array subtraction     1.360000   0.000000   1.360000 (  1.359100)

#4


0  

I usually make a point of not optimizing without measurements, but here's a wag:

我通常会指出在没有测量的情况下不进行优化,但这是一个摇摆不定的事情:

To make it fast, you should iterate through each string once. You want to avoid a loop with bad count * str count inner compares. So, you could build a big regexp and gsub with it.

为了加快速度,你应该遍历每个字符串一次。你想避免一个带有错误计数的循环* str count inner compare。所以,你可以用它构建一个大的正则表达式和gsub。

(adding foo variants to test word boundary works)

(添加foo变体以测试单词边界的工作原理)

str = "This is my first foo fooo ofoo string"

=> "This is my first foo fooo ofoo string"

badex = /\b(#{bad.split.join('|')})\b/

=> /\b(foo|bar|baz)\b/

str.gsub(badex,'').gsub('  ',' ')

=> "This is my first fooo ofoo string"

Of course the huge resulting regexp might be as slow as the implied nested iteration in my other answer. Only way to know is to measure.

当然,巨大的regexp可能和我在其他答案中隐含的嵌套迭代一样慢。唯一知道的方法就是衡量。

#5


0  

bad = %w(foo bar baz)
str = "This is my first foo string"

# find the first word in the list
found = bad.find {|word| str.include?(word)}

# remove it
str[found] = ''  ;# str => "This is my first  string"

#6


0  

I'd benchmark this:

我对此进行了基准测试:

bad = "foo bar baz".split(' ')
str = "This is my first foo string".split(' ')

# 1. What's the fasted way to check if any word from the bad string is within my comparison string
p bad.any? { |bw| str.include?(bw) }

# 2. What's the fastest way to remove said word if it's found?
p (str - bad).join(' ')

any? will quick checking as soon as it sees a match. If you can order your bad words by their probability, you can save some cycles.

任何?一看到比赛就会快速检查。如果您可以按概率订购坏词,则可以节省一些周期。

#7


0  

Here's one that will check for words and phrases.

这是一个将检查单词和短语的人。

 def checkContent(str)
     bad = ["foo", "bar", "this place sucks", "or whatever"]

     # may be best to map and singularize everything as well. 
     # maybe add some regex to catch those pesky, "How i make $69 dollars each second online..."
     # maybe apply some comparison stuff to check for weird characters in those pesky, "How i m4ke $69 $ollars an hour"


     bad_hash = {}
     bad_phrase_hash = {}

     bad.map(&:downcase).each do |word|
         words = word.split().map(&:downcase)
         if words.length > 1
             words.each do |inner|
                if bad_hash.key?(inner)
                    if bad_hash[inner].is_a?(Hash) && !bad_hash[inner].key?(words.length)
                         bad_hash[inner][words.length] = true
                    elsif bad_hash[inner] === 1
                        bad_hash[inner] = {1=>true,words.length => true}
                    end
                else
                    bad_hash[inner] = {words.length => true}
                end
             end
             bad_phrase_hash[word] = true
         else
             bad_hash[word] = 1
         end
     end

     string = str.split().map(&:downcase)
     string.each_with_index do |word,index|
        if bad_hash.key?(word)
            if bad_hash[word].is_a?(Hash)
                if bad_hash[word].key?(1)
                    return false
                else
                    bad_hash[word].keys.sort.each do |length|
                        value = string[index...(index + length)].join(" ")
                        if bad_phrase_hash.key?(value)
                            return false
                        end
                    end
                end
            else
                return false
            end
        end
     end
     return true
  end

#8


-2  

The include? method is what you need. The ruby String specificacion says:

包括?方法就是你所需要的。 ruby String specificacion说:

str.include?( string ) -> true or false Returns true if str contains the given string or character.

str.include?(string) - > true或false如果str包含给定的字符串或字符,则返回true。

"hello".include? "lo" -> true

“你好” .INCLUDE? “lo” - >是的

"hello".include? "ol" -> false

“你好” .INCLUDE? “ol” - > false

"hello".include? ?h -> true

“你好” .INCLUDE? ?h - >是的

Note that it has O(n) and what you purposed is O(n^2)

注意它有O(n),你的目的是O(n ^ 2)