I need to remove any instances of 544 full-text stopwords from a user-entered search string, then format it to run a partial match full-text search in boolean mode.
我需要从用户输入的搜索字符串中删除任何544个全文终止字的实例,然后将其格式化,以布尔模式运行部分匹配全文搜索。
input: "new york city", output: "+york* +city*" ("new" is a stopword).
输入:“纽约市”,输出:“+york* +city*”(“new”是一个stopword)。
I have an ugly solution that works: explode the search string into an array of words, look up each word in the array of stopwords, unset them if there is a match, implode the remaining words and finally run a regex to add the boolean mode formatting. There has to be a more elegant solution.
我有一个糟糕的解决方案:将搜索字符串分解成一个单词数组,在stopwords数组中查找每个单词,如果匹配就取消它们的设置,内爆剩下的单词,最后运行regex来添加布尔模式格式。必须有一个更优雅的解决方案。
My question has 2 parts.
我的问题有两部分。
1) What do you think is the cleanest way to do this?
你认为最干净的方法是什么?
2) I solved part of the problem using a huge regex but this raised another question.
我使用一个巨大的regex解决了部分问题,但这又提出了另一个问题。
EDIT: This actually works. I'm embarrassed to say that the memory issue I was having (and believed was my regex) was actually generated later in the code due to the huge number of matches after filtering out stopwords.
编辑:这确实有效。我很不好意思地说,我所遇到的内存问题(而且我相信是我的regex)实际上是后来在代码中生成的,因为在过滤了stopwords之后,有大量的匹配。
$tmp = preg_replace('/(\b('.implode('|',$stopwords).')\b)+/','',$this->val);
$boolified = preg_replace('/([^\s]+)/','+$1*',$tmp);
2 个解决方案
#1
1
Split a search string in a words array and then
将搜索字符串分割为一个单词数组,然后
- do array_diff() with stopwords array
- array_diff()使用stopwords数组吗
- or make stopwords a hash and use hash lookups (if isset($stopwords[$word]) then...)
- 或者将stopwords设置为散列并使用散列查找(if isset($stopwords[$word]))然后…)
- or keep stopwords sorted and use binary search for each word
- 或者对停止词进行排序,并对每个词使用二进制搜索
it's hard to say what's going to be faster, you might want to profile each option (and if you do, please share the results!)
很难说什么会更快,你可能想要对每个选项进行剖析(如果你这么做了,请分享结果!)
#2
2
Build a suffix tree from the 544 words and just walk trough it with the input string letter by letter and jump back to the root of the tree at the beginning of every new word. When you find a match at the end of a word, remove it. This is O(n) over the length of the input strings if the word list reamins static.
从544个单词中构建一个后缀树,并在每个新单词的开头以字母的形式将其与输入字符串字母组合在一起,然后跳转到树的根。当你在一个词的结尾找到一个匹配词时,把它删除。这是O(n)除以输入字符串的长度,如果单词列表reamins是静态的。
#1
1
Split a search string in a words array and then
将搜索字符串分割为一个单词数组,然后
- do array_diff() with stopwords array
- array_diff()使用stopwords数组吗
- or make stopwords a hash and use hash lookups (if isset($stopwords[$word]) then...)
- 或者将stopwords设置为散列并使用散列查找(if isset($stopwords[$word]))然后…)
- or keep stopwords sorted and use binary search for each word
- 或者对停止词进行排序,并对每个词使用二进制搜索
it's hard to say what's going to be faster, you might want to profile each option (and if you do, please share the results!)
很难说什么会更快,你可能想要对每个选项进行剖析(如果你这么做了,请分享结果!)
#2
2
Build a suffix tree from the 544 words and just walk trough it with the input string letter by letter and jump back to the root of the tree at the beginning of every new word. When you find a match at the end of a word, remove it. This is O(n) over the length of the input strings if the word list reamins static.
从544个单词中构建一个后缀树,并在每个新单词的开头以字母的形式将其与输入字符串字母组合在一起,然后跳转到树的根。当你在一个词的结尾找到一个匹配词时,把它删除。这是O(n)除以输入字符串的长度,如果单词列表reamins是静态的。