从字符串或字符串数组中删除多余的垃圾词

I have millions of arrays that each contain about five strings. I am trying to remove all of the "junk words" (for lack of a better description) from the arrays, such as all articles of speech, words like "to", "and", "or", "the", "a" and so on.

我有数百万个数组，每个数组包含5个字符串。我试着从数组中删除所有的“垃圾单词”(因为缺少更好的描述)，比如所有的文章，诸如“to”，“and”，“或者”，“a”等等。

For example, one of my arrays has these six strings:

例如，我的一个数组有这六个字符串:

"14000"
"Things"
"to"
"Be"
"Happy"
"About"

I want to remove the "to" from the array.

我想从数组中删除“to”。

One solution is to do:

一种解决办法是:

excess_words = ["to","and","or","the","a"]
cleaned_array = dirty_array.reject {|term| excess_words.include? term}

But I am hoping to avoid manually typing every excess word. Does anyone know of a Rails function or helper that would help in this process? Or perhaps an array of "junk words" already written?

但我希望避免手工输入多余的单词。有人知道在这个过程中有什么Rails函数或帮助程序吗?或者可能是一堆已经写好的“垃圾词”?

2 个解决方案

#1

Dealing with stopwords is easy, but I'd suggest you do it BEFORE you split the string into the component words.

处理stopwords很容易，但我建议您在将字符串分割为组件词之前进行处理。

Building a fairly simple regular expression can make short work of the words:

构建一个相当简单的正则表达式可以使单词简短:

STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i

clean_string = 'to into and sandbar or forest the thesis a algebra'.gsub(STOPWORDS, '')
# => " into  sandbar  forest  thesis  algebra"

clean_string.split
# => ["into", "sandbar", "forest", "thesis", "algebra"]

How do you handle them if you get them already split? I'd join(' ') the array to turn it back into a string, then run the above code, which returns the array again.

如果你已经分开了，你该如何处理?我将把数组(')加入到字符串中，然后运行上面的代码，它再次返回数组。

incoming_array = [
  "14000",
  "Things",
  "to",
  "Be",
  "Happy",
  "About",
]

STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i

incoming_array = incoming_array.join(' ').gsub(STOPWORDS, '').split
# => ["14000", "Things", "Be", "Happy", "About"]

You could try to use Array's set operations, but you'll run afoul of the case sensitivity of the words, forcing you to iterate over the stopwords and the arrays which will run a LOT slower.

您可以尝试使用Array的set操作，但是您会违反单词的大小写敏感性，迫使您遍历stopwords和将运行得慢得多的数组。

Take a look at these two answers for some added tips on how you can build very powerful patterns making it easy to match thousands of strings:

看看这两个问题的答案，你可以找到一些关于如何构建非常强大的模式的技巧，让它们更容易匹配成千上万的字符串:

"How do I ignore file types in a web crawler?"
“如何忽略web爬虫中的文件类型?”
"Is there an efficient way to perform hundreds of text substitutions in Ruby?"
“有一种有效的方法可以在Ruby中执行数百个文本替换吗?”

#2

All you need is a list of English stopwords. You can find it here, or google for 'english stopwords list'

你所需要的只是一份英语停顿的清单。你可以在这里找到它，或者谷歌的“英语终止词列表”

#1