模式匹配字符串时,Javascript / jQuery更快替代$ .inArray

时间:2021-03-11 16:59:51

I've got a large array of words in Javascript (~100,000), and I'd like to be able to quickly return a subset of them based on a text pattern.

我在Javascript(~100,000)中有大量的单词,我希望能够根据文本模式快速返回它们的子集。

For example, I'd like to return all the words that begin with a pattern so typing hap should give me ["happy", "happiness", "happening", etc, etc], as a result.

例如,我想返回所有以模式开头的单词,因此输入hap应该给我[“快乐”,“快乐”,“发生”等等],结果。

If it's possible I'd like to do this without iterating over the entire array.

如果有可能我想在不迭代整个数组的情况下这样做。

Something like this is not working fast enough:

这样的事情不够快:

// data contains an array of beginnings of words e.g. 'hap'
$.each(data, function(key, possibleWord) {
 found = $.inArray(possibleWord, words);
 // do something if found
}

Any ideas on how I could quickly reduce the set to possible matches without iterating over the whole word set? The word array is in alphabetical order if that helps.

关于如何在不迭代整个单词集的情况下快速将集合减少到可能的匹配的任何想法?如果有帮助,单词数组按字母顺序排列。

6 个解决方案

#1


3  

If you just want to search for prefixes there are data structures just for that, such as the Trie and Ternary search trees

如果您只是想搜索前缀,那么就有数据结构,例如Trie和Ternary搜索树

A quick Google search and some promissing Javascrit Trie and autocomplete implementations show up:

一个快速的谷歌搜索和一些承诺Javascrit Trie和自动完成实现显示:

http://ejohn.org/blog/javascript-trie-performance-analysis/

http://ejohn.org/blog/javascript-trie-performance-analysis/

Autocomplete using a trie

使用trie自动完成

http://odhyan.com/blog/2010/11/trie-implementation-in-javascript/

http://odhyan.com/blog/2010/11/trie-implementation-in-javascript/

#2


1  

I have absolutely no idea if this is any faster (a jsperf test is probably in order...), but you can do it with one giant string and a RegExp search instead of arrays:

我完全不知道这是否更快(jsperf测试可能是有序的...),但你可以用一个巨大的字符串和一个RegExp搜索代替数组:

var giantStringOfWords = giantArrayOfWords.join(' ');
function searchForBeginning(beginning, str) {
    var pattern = new RegExp('\\b' + str + '\\w*'),
        matches = str.match(pattern);
    return matches;
}

var hapResults = searchForBeginning('hap', giantStringOfWords);

#3


0  

The best approach is to structure the data better. Make an object with keys like "hap". That member holds an array of words (or word suffixes if you want to save space) or a separated string of words for regexp searching.

最好的方法是更好地构建数据。使用像“hap”这样的键创建一个对象。该成员包含一系列单词(如果您想节省空间,则为单词后缀)或用于正则表达式搜索的单独字符串。

This means you will have shorter objects to iterate/search. Another way is to sort the arrays and use a binary search pattern. There's a good conversation about techniques and optimizations here: http://ejohn.org/blog/revised-javascript-dictionary-search/

这意味着你将有更短的对象来迭代/搜索。另一种方法是对数组进行排序并使用二进制搜索模式。这里有关于技术和优化的良好对话:http://ejohn.org/blog/revised-javascript-dictionary-search/

#4


0  

I suppose that using raw javascript can help a bit, you can do:

我想使用原始的javascript可以帮助一点,你可以这样做:

var arr = ["happy", "happiness", "nothere", "notHereEither", "happening"], subset = [];
for(var i = 0, len = arr.length; i < len; i ++) {
     if(arr[i].search("hap") !== -1) {
           subset.push(arr[i]);
     }
}
//subset === ["happy", "happiness","happening"]

Also, if the array is ordered you could break early if the first letter is bigger than the first of your search, instead of looping the entire array.

此外,如果数组是有序的,如果第一个字母大于搜索的第一个字母,则可以提前中断,而不是循环整个数组。

#5


0  

var data = ['foo', 'happy', 'happiness', 'foohap'];    
jQuery.each(data, function(i, item) {
      if(item.match(/^hap/))
        console.log(item) 
    });

If you have the data in an array, you're going to have to loop through the whole thing.

如果你有一个数组中的数据,你将不得不循环整个事情。

#6


0  

A really simple optimization is on page load go through your big words array and make a note of what index ranges apply to each starting letter. E.g., in my example below the "a" words go from 0 to 2, "b" words go from 3 to 4, etc. Then when actually doing a pattern match only look through the applicable range. Although obviously some letters will have more words than others, a given search will only have to look through an average of 100,000/26 words.

一个非常简单的优化是页面加载通过你的大字数组并记下每个起始字母适用的索引范围。例如,在下面的例子中,“a”字从0变为2,“b”字从3变为4等。然后当实际进行模式匹配时,只查看适用的范围。虽然显然有些字母会比其他字母有更多的单词,但是给定的搜索只需要查看平均100,000 / 26个单词。

// words array assumed to be lowercase and in alphabetical order
var words = ["a","an","and","be","blue","cast","etc."];

// figure out the index for the first and last word starting with
// each letter of the alphabet, so that later searches can use
// just the appropriate range instead of searching the whole array
var letterIndexes = {},
    i,
    l,
    letterIndex = 0,
    firstLetter;
for (i=0, l=words.length; i<l; i++) {
    if (words[i].charAt(0) === firstLetter)
       continue;
    if (firstLetter)
        letterIndexes[firstLetter] = {first : letterIndex, last : i-1};
    letterIndex = i;
    firstLetter = words[i].charAt(0);
}

function getSubset(pattern) {
    pattern = pattern.toLowerCase()
    var subset = [],
        fl = pattern.charAt(0),
        matched = false;
    if (letterIndexes[firstLetter])
        for (var i = letterIndexes[fl].first, l = letterIndex[fl].last; i <= l; i++) {
            if (pattern === words[i].substr(0, pattern.length)) {
                subset.push(words[i]);
                matched = true;
            } else if (matched) {
                break;
            }
        }
    return subset;        
}

Note also that when searching through the (range within the) words array, once a match is found I set a flag, which indicates we've gone past all of the words that are alphabetically before the pattern and are now making our way through the matching words. That way as soon as the pattern no longer matches we can break out of the loop. If the pattern doesn't match at all we still end up going through all the words for that first letter though.

另请注意,当搜索单词数组中的(范围)时,一旦找到匹配项,我会设置一个标记,表示我们已经超过了模式之前按字母顺序排列的所有单词,现在正在通过匹配单词。这样一旦模式不再匹配,我们就可以突破循环。如果模式根本不匹配,我们仍然会查看第一个字母的所有单词。

Also, if you're doing this as a user types, when letters are added to the end of the pattern you only have to search through the previous subset, not through the whole list.

此外,如果您在用户输入时这样做,当字母添加到模式的末尾时,您只需搜索前一个子集,而不是整个列表。

P.S. Of course if you want to break the word list up by first letter you could easily do that server-side.

附:当然,如果你想通过第一个字母打破单词列表,你可以很容易地做到服务器端。

#1


3  

If you just want to search for prefixes there are data structures just for that, such as the Trie and Ternary search trees

如果您只是想搜索前缀,那么就有数据结构,例如Trie和Ternary搜索树

A quick Google search and some promissing Javascrit Trie and autocomplete implementations show up:

一个快速的谷歌搜索和一些承诺Javascrit Trie和自动完成实现显示:

http://ejohn.org/blog/javascript-trie-performance-analysis/

http://ejohn.org/blog/javascript-trie-performance-analysis/

Autocomplete using a trie

使用trie自动完成

http://odhyan.com/blog/2010/11/trie-implementation-in-javascript/

http://odhyan.com/blog/2010/11/trie-implementation-in-javascript/

#2


1  

I have absolutely no idea if this is any faster (a jsperf test is probably in order...), but you can do it with one giant string and a RegExp search instead of arrays:

我完全不知道这是否更快(jsperf测试可能是有序的...),但你可以用一个巨大的字符串和一个RegExp搜索代替数组:

var giantStringOfWords = giantArrayOfWords.join(' ');
function searchForBeginning(beginning, str) {
    var pattern = new RegExp('\\b' + str + '\\w*'),
        matches = str.match(pattern);
    return matches;
}

var hapResults = searchForBeginning('hap', giantStringOfWords);

#3


0  

The best approach is to structure the data better. Make an object with keys like "hap". That member holds an array of words (or word suffixes if you want to save space) or a separated string of words for regexp searching.

最好的方法是更好地构建数据。使用像“hap”这样的键创建一个对象。该成员包含一系列单词(如果您想节省空间,则为单词后缀)或用于正则表达式搜索的单独字符串。

This means you will have shorter objects to iterate/search. Another way is to sort the arrays and use a binary search pattern. There's a good conversation about techniques and optimizations here: http://ejohn.org/blog/revised-javascript-dictionary-search/

这意味着你将有更短的对象来迭代/搜索。另一种方法是对数组进行排序并使用二进制搜索模式。这里有关于技术和优化的良好对话:http://ejohn.org/blog/revised-javascript-dictionary-search/

#4


0  

I suppose that using raw javascript can help a bit, you can do:

我想使用原始的javascript可以帮助一点,你可以这样做:

var arr = ["happy", "happiness", "nothere", "notHereEither", "happening"], subset = [];
for(var i = 0, len = arr.length; i < len; i ++) {
     if(arr[i].search("hap") !== -1) {
           subset.push(arr[i]);
     }
}
//subset === ["happy", "happiness","happening"]

Also, if the array is ordered you could break early if the first letter is bigger than the first of your search, instead of looping the entire array.

此外,如果数组是有序的,如果第一个字母大于搜索的第一个字母,则可以提前中断,而不是循环整个数组。

#5


0  

var data = ['foo', 'happy', 'happiness', 'foohap'];    
jQuery.each(data, function(i, item) {
      if(item.match(/^hap/))
        console.log(item) 
    });

If you have the data in an array, you're going to have to loop through the whole thing.

如果你有一个数组中的数据,你将不得不循环整个事情。

#6


0  

A really simple optimization is on page load go through your big words array and make a note of what index ranges apply to each starting letter. E.g., in my example below the "a" words go from 0 to 2, "b" words go from 3 to 4, etc. Then when actually doing a pattern match only look through the applicable range. Although obviously some letters will have more words than others, a given search will only have to look through an average of 100,000/26 words.

一个非常简单的优化是页面加载通过你的大字数组并记下每个起始字母适用的索引范围。例如,在下面的例子中,“a”字从0变为2,“b”字从3变为4等。然后当实际进行模式匹配时,只查看适用的范围。虽然显然有些字母会比其他字母有更多的单词,但是给定的搜索只需要查看平均100,000 / 26个单词。

// words array assumed to be lowercase and in alphabetical order
var words = ["a","an","and","be","blue","cast","etc."];

// figure out the index for the first and last word starting with
// each letter of the alphabet, so that later searches can use
// just the appropriate range instead of searching the whole array
var letterIndexes = {},
    i,
    l,
    letterIndex = 0,
    firstLetter;
for (i=0, l=words.length; i<l; i++) {
    if (words[i].charAt(0) === firstLetter)
       continue;
    if (firstLetter)
        letterIndexes[firstLetter] = {first : letterIndex, last : i-1};
    letterIndex = i;
    firstLetter = words[i].charAt(0);
}

function getSubset(pattern) {
    pattern = pattern.toLowerCase()
    var subset = [],
        fl = pattern.charAt(0),
        matched = false;
    if (letterIndexes[firstLetter])
        for (var i = letterIndexes[fl].first, l = letterIndex[fl].last; i <= l; i++) {
            if (pattern === words[i].substr(0, pattern.length)) {
                subset.push(words[i]);
                matched = true;
            } else if (matched) {
                break;
            }
        }
    return subset;        
}

Note also that when searching through the (range within the) words array, once a match is found I set a flag, which indicates we've gone past all of the words that are alphabetically before the pattern and are now making our way through the matching words. That way as soon as the pattern no longer matches we can break out of the loop. If the pattern doesn't match at all we still end up going through all the words for that first letter though.

另请注意,当搜索单词数组中的(范围)时,一旦找到匹配项,我会设置一个标记,表示我们已经超过了模式之前按字母顺序排列的所有单词,现在正在通过匹配单词。这样一旦模式不再匹配,我们就可以突破循环。如果模式根本不匹配,我们仍然会查看第一个字母的所有单词。

Also, if you're doing this as a user types, when letters are added to the end of the pattern you only have to search through the previous subset, not through the whole list.

此外,如果您在用户输入时这样做,当字母添加到模式的末尾时,您只需搜索前一个子集,而不是整个列表。

P.S. Of course if you want to break the word list up by first letter you could easily do that server-side.

附:当然,如果你想通过第一个字母打破单词列表,你可以很容易地做到服务器端。