The requirements are:
要求是:
- Find strings from an array (from here on called options) that include ALL the terms in ANY order
- 从数组中查找包含所有项的所有项的字符串。
- Correctly highlight the matching terms - ie. insert a string before and after each matching term - I'm using
<u>
and</u>
here - 正确地高亮显示匹配项。在每个匹配项前后插入一个字符串——我在这里使用和
- Both the search query and options can be "anything"
- 搜索查询和选项都可以是“任何东西”
For the sake of simplicity answers can concentrate on case-insensitive search through a list including only ASCII characters and assume that the term delimiter is a plain space - ie. a query entered as "Foo bar baz" means the search terms are ['foo', 'bar', 'baz']
.
为了简单起见,答案可以集中在不区分大小写的搜索列表中,其中只包含ASCII字符,并假设术语分隔符是一个纯空格——即。输入“Foo bar baz”的查询表示搜索条件为[' Foo '、'bar'、'baz']。
To clarify:
澄清:
- All terms must exist separately from each other in matched options - ie. shorter terms should not exist only as substrings of longer terms and no two terms should overlap
- 所有的术语必须在匹配的选项中分开存在。较短的项不应该只作为较长的项的子字符串存在,也不应该有两个项重叠
- Repeated search terms must exist in the option at least as many times as in the query
- 重复搜索词必须在选项中存在,至少与查询中存在的次数相同
The final application is (perhaps not surprisingly) an autocomplete of sorts.
最终的应用程序是(也许并不奇怪)自动完成的。
TL;DR
Most recent fiddle comparing the proposed algorithms side by side:
https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
(feel free to update this link if you add a new algorithm)最新的比较提议的算法:https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
jsPerf comparing algorithms in a somewhat more realistic / representative way - a few strings are basically "entered" one character at a time on each rep:
https://jsperf.com/comparison-of-algorithms-to-search-and-highlightjsPerf比较了一些比较现实/有代表性的算法——一些字符串基本上是“输入”一个字符,每个代表都有一个字符:https://jsperf.com/比对算法-搜索和高亮。
At this point it is clear (thanks to trincot's excellent base-comparison) that the majority of time used by the original implementations was spent on DOM-output. Its significance has been minimized as much as possible in the fiddle.
在这一点上,很明显(多亏了trincot出色的基础比较),原始实现使用的大部分时间都花在了DOM-output上。它的重要性在小提琴中被尽可能地最小化。
There is still a clear difference in performance between the algorithms in each search, but not one of them is consistently fastest on every keystroke. After revisiting and cleaning up my own "Divide and Conquer" it does outperform the others consistently in any realistic scenario I try though.
在每次搜索中,算法的性能仍然存在明显的差异,但没有一种算法在每次击键时都是最快的。在重新审视和清理我自己的“分而治之”之后,在我尝试的任何现实场景中,它的表现都比其他的要好。
Tigregalis introduced the idea of a pre-run optimization, which seems reasonable given that options are unlikely to change between keystrokes. I have added (a function for) this to all methods here. The only algorithm where I saw an obvious benefit from it was in Skirtle's Permutations, but I'll encourage each answerer to consider if it might be useful for their own algorithms.
Tigregalis引入了预运行优化的概念,这似乎是合理的,因为在击键之间的选项不太可能改变。我已经为这里的所有方法添加了(一个函数)。我看到的唯一一个明显的好处是避开了所有的排列,但是我鼓励每个回答者考虑它是否对他们自己的算法有用。
Some algorithms will be much easier to adapt than others. It is still my opinion that this will be more important than the minor performance differences in a real implementation.
有些算法比其他算法更容易适应。我的观点是,这将比实际执行中的次要性能差异更重要。
Note that the current version of Tigregalis' Shrinking Set has a bug - I've excluded it from fiddle and jsperf until that is fixed.
请注意,当前版本的Tigregalis的缩小集有一个缺陷——我把它排除在了小提琴和jsperf之外,直到它被修复。
Viral Permutations
In theory this can be solved by "manually" constructing a RegExp that contains every possible permutation of the search terms separated by a capturing group to catch anything between terms - a search for "foo bar" results in (foo)(.*?)(bar)|(bar)(.*?)(foo)
.
理论上,这可以通过“手工”构造一个RegExp来解决,该RegExp包含由捕获组分隔的每一个搜索术语的可能排列,以捕获术语之间的任何内容——在(foo)(.*?)(bar) (bar)(.*?)(foo) (. ?)(foo)中搜索“foo bar”结果。
The highlighting is then done in one pass with string.replace()
. If there is any change in the string we have a match.
然后用string.replace()进行一次突出显示。如果字符串有任何变化,我们有一个匹配。
Demo:
演示:
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var terms, terms_esc;
function viral_permutations() {
var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted;
meta_elm = document.getElementById('viral_permutations_meta');
res_elm = document.getElementById('viral_permutations_result');
res_elm.innerHTML = meta_elm.innerHTML = '';
t0 = performance.now();
//Split query in terms at delimiter and lowercase them
terms = document.getElementById('viral_permutations').value.split(/\s/).filter(function(n) {
return n;
}).map(function(term){
return term.toLowerCase();
});
meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + '<br>';
//Escape terms
terms_esc = terms.map(function(term) {
return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
});
//Wrap terms in in individual capturing groups
terms_esc = terms.map(function(term) {
return '(' + term + ')';
});
//Find all permutations
permuted = permutate_array(terms_esc);
//Construct a group for each permutation
match_groups = [];
for (var i in permuted) {
match_groups.push(permuted[i].join('(.*?)'));
}
try {
//Construct the final regex
regex_string = match_groups.join('|');
//Display it
document.getElementById('viral_permutations_regex').innerHTML = regex_string;
meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + '<br>';
regex = new RegExp(regex_string, 'i');
//Loop through each option
for (i = 0; i < options.length; i++) {
option = options[i];
//Replace the terms with highlighted terms
highlighted = option.replace(regex, viral_permutations_replace);
//If anything was changed (or the query is empty) we have a match
if (highlighted !== option || terms.length === 0) {
//Append it to the result list
li = document.createElement('li');
li.innerHTML = highlighted;
res_elm.appendChild(li);
}
}
//Print some meta
t1 = performance.now();
meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms';
} catch(e) {
meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>';
}
}
//The replacement function
function viral_permutations_replace() {
var i, m, j, retval, m_cased, unmatched;
retval = '';
//Make a clone of the terms array (that we can modify without destroying the original)
unmatched = terms.slice(0);
//Loop arguments between the first (which is the full match) and
//the last 2 (which are the offset and the full option)
for (i = 1; i < arguments.length - 1; i++) {
m = arguments[i];
//Check that we have a string - most of the arguments will be undefined
if (typeof m !== 'string') continue;
//Lowercase the match
m_cased = m.toLowerCase();
//Append it to the return value - highlighted if it is among our terms
j = unmatched.indexOf(m_cased);
if (j >= 0) {
//Remove it from the unmatched terms array
unmatched.splice(j, 1);
retval += '<u>' + m + '</u>';
} else {
retval += m;
}
}
return retval;
}
//Helper function to return all possible permutations of an array
function permutate_array(arr) {
var perm, perm_intern;
perm_intern = function(perm, pre, post, n) {
var elem, i, j, ref, rest;
if (n > 0) {
for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) {
rest = post.slice(0);
elem = rest.splice(i, 1);
perm_intern(perm, pre.concat(elem), rest, n - 1);
}
} else {
perm.push(pre);
}
};
perm = [];
perm_intern(perm, [], arr, arr.length);
return perm;
}
viral_permutations();
<input type="text" id="viral_permutations" onkeyup="viral_permutations()">
<p id="viral_permutations_meta"></p>
<pre style="width:100%;overflow:auto;background:#eee" id="viral_permutations_regex"></pre>
<ul style="height:300px;overflow:auto" id="viral_permutations_result"></ul>
Thanks to trincot for pointing out that my original version would occasionally highlight a recurring term twice - which is fixed in this snippet.
感谢trincot指出我的原始版本偶尔会突出重复出现的术语两次——这是在这段代码中固定的。
Fails because:
失败原因:
- The regex grows exponentially as terms are added. At 7 terms (even single letters) it is past 250kb and my browser gives up with
Error: regexp too big
... - regex随着项的添加而呈指数级增长。在7个术语(甚至单个字母),它超过250kb,我的浏览器放弃了错误:regexp太大……
A few other noteworthy strategies that didn't work:
Capturing groups with all terms in each group:
(foo|bar)(.*)(foo|bar)
Fails because:
失败原因:
- Will match options that include repeated terms - fx.
The food in the food court
would match though it obviously shouldn't. - 将匹配包括重复条款- fx的选项。食物区里的食物很适合,但显然不适合。
- If we "double-check" that all terms were, in fact, found it will fail to find valid matches - fx.
The food in the food bar
would findfoo
twice and never get tobar
. - 如果我们“再三检查”所有的术语,事实上,它将无法找到有效的匹配- fx。食品酒吧里的食物会有两次,而且永远不会去酒吧。
Negative lookaheads and backreferences:
(foo|bar|baz)(.*?)((?!\1)(?:foo|bar|baz))(.*?)((?!\1|\3)(?:foo|bar|baz))
Fails because:
失败原因:
- Whenever terms are repeated in the query it will reach impossible conditions like "Find me a
foo
,bar
orbar
which is not afoo
nor abar
". - 每当查询中重复出现术语时,它将达到不可能的条件,如“为我找到一个不是foo或bar的foo、bar或bar”。
- I'm fairly certain it has other issues, but I stopped pursuing it when I realized it was logically flawed.
- 我很确定它还有其他的问题,但当我意识到它在逻辑上有缺陷时,我就停止了追求它。
Positive lookaheads
(?=.*foo)(?=.*bar)(?=.*baz)
Fails because:
失败原因:
- It is very difficult (if not impossible) to reliably highlight the found matches.
- 要可靠地突出显示找到的匹配项是非常困难的(如果不是不可能的话)。
- I haven't found any way to ensure all terms actually exist - ie. they may all be individually present in the option, but shorter terms may only exist as substrings of longer terms - or terms may overlap.
- 我还没有找到任何方法来确保所有的术语都存在。它们可能都单独存在于选项中,但更短的术语可能只作为更长的项的子字符串存在——或者术语可能有重叠。
5 个解决方案
#1
6
I gave it a go but I'm not sure how much help this will be. My approach is similar to your Divide and Conquer technique.
我试了一下,但我不确定这能帮多少忙。我的方法类似于你的分治法。
Instead of biting off bits of the string I search for each term ahead of time and store up a collection of all the matches, recording the start and finish positions. If there aren't enough matches for a particular search term the algorithm immediately bails for that 'option'.
而不是撕咬字符串,我搜索每一个词提前,并储存所有的比赛,记录开始和结束的位置。如果一个特定的搜索词没有足够的匹配项,算法会立即发出“选项”的呼叫。
Once it has gathered together all of the possible matches it recursively attempts to find a combination that doesn't overlap. There's a lot of copying of data structures going on in that recursion and I suspect it could be optimized a lot better than I have it here. I can also only apologize for some of the variable names, I was struggling to come up with names that made any sense at all.
一旦它收集了所有可能的匹配项,它就会递归地尝试找到一个不重叠的组合。在这个递归中有很多数据结构的复制我怀疑它可以比我这里的优化得更好。我也只能为一些变量的名字道歉,我正在努力想出一些有意义的名字。
For certain test searches, such as a n a n a n a n ...
it seems to cope better than the original Divide and Conquer technique but I suspect that may be just because of the early bail-out that is performed when insufficient matches are found for a particular search term. Without a large quantity of real data it's difficult to know where the really valuable optimizations would be.
对于某些测试搜索,例如a n n n n n a n a n a n n…它似乎比最初的分治技巧处理得更好,但我怀疑,这可能仅仅是因为早期的纾困措施,即在发现某一特定搜索词的匹配不足时进行纾困。如果没有大量的真实数据,就很难知道真正有价值的优化在哪里。
function search() {
var options = [
'ababababa',
'United States',
'United Kingdom',
'Afghanistan',
'Aland Islands',
'Albania',
'Algeria',
'American Samoa',
'Andorra',
'Angola',
'Anguilla',
'Antarctica',
'Antigua and Barbuda',
'Argentina',
'Armenia',
'Aruba',
'Australia',
'Austria',
'Azerbaijan',
'Bahamas',
'Bahrain',
'Bangladesh',
'Barbados',
'Belarus',
'Belgium',
'Belize',
'Benin',
'Bermuda',
'Bhutan',
'Bolivia, Plurinational State of',
'Bonaire, Sint Eustatius and Saba',
'Bosnia and Herzegovina',
'Botswana',
'Bouvet Island',
'Brazil',
'British Indian Ocean Territory',
'Brunei Darussalam',
'Bulgaria',
'Burkina Faso',
'Burundi',
'Cambodia',
'Cameroon',
'Canada',
'Cape Verde',
'Cayman Islands',
'Central African Republic',
'Chad',
'Chile',
'China',
'Christmas Island',
'Cocos (Keeling) Islands',
'Colombia',
'Comoros',
'Congo',
'Congo, The Democratic Republic of The',
'Cook Islands',
'Costa Rica',
'Cote D\'ivoire',
'Croatia',
'Cuba',
'Curacao',
'Cyprus',
'Czech Republic',
'Denmark',
'Djibouti',
'Dominica',
'Dominican Republic',
'Ecuador',
'Egypt',
'El Salvador',
'Equatorial Guinea',
'Eritrea',
'Estonia',
'Ethiopia',
'Falkland Islands (Malvinas)',
'Faroe Islands',
'Fiji',
'Finland',
'France',
'French Guiana',
'French Polynesia',
'French Southern Territories',
'Gabon',
'Gambia',
'Georgia',
'Germany',
'Ghana',
'Gibraltar',
'Greece',
'Greenland',
'Grenada',
'Guadeloupe',
'Guam',
'Guatemala',
'Guernsey',
'Guinea',
'Guinea-bissau',
'Guyana',
'Haiti',
'Heard Island and Mcdonald Islands',
'Holy See (Vatican City State)',
'Honduras',
'*',
'Hungary',
'Iceland',
'India',
'Indonesia',
'Iran, Islamic Republic of',
'Iraq',
'Ireland',
'Isle of Man',
'Israel',
'Italy',
'Jamaica',
'Japan',
'Jersey',
'Jordan',
'Kazakhstan',
'Kenya',
'Kiribati',
'Korea, Democratic People\'s Republic of',
'Korea, Republic of',
'Kuwait',
'Kyrgyzstan',
'Lao People\'s Democratic Republic',
'Latvia',
'Lebanon',
'Lesotho',
'Liberia',
'Libya',
'Liechtenstein',
'Lithuania',
'Luxembourg',
'Macao',
'Macedonia, The Former Yugoslav Republic of',
'Madagascar',
'Malawi',
'Malaysia',
'Maldives',
'Mali',
'Malta',
'Marshall Islands',
'Martinique',
'Mauritania',
'Mauritius',
'Mayotte',
'Mexico',
'Micronesia, Federated States of',
'Moldova, Republic of',
'Monaco',
'*',
'Montenegro',
'Montserrat',
'Morocco',
'Mozambique',
'Myanmar',
'Namibia',
'Nauru',
'Nepal',
'Netherlands',
'New Caledonia',
'New Zealand',
'Nicaragua',
'Niger',
'Nigeria',
'Niue',
'Norfolk Island',
'Northern Mariana Islands',
'Norway',
'Oman',
'Pakistan',
'Palau',
'Palestinian Territory, Occupied',
'Panama',
'Papua New Guinea',
'Paraguay',
'Peru',
'Philippines',
'Pitcairn',
'Poland',
'Portugal',
'Puerto Rico',
'Qatar',
'Reunion',
'Romania',
'Russian Federation',
'Rwanda',
'Saint Barthelemy',
'Saint Helena, Ascension and Tristan da Cunha',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Martin (French part)',
'Saint Pierre and Miquelon',
'Saint Vincent and The Grenadines',
'Samoa',
'San Marino',
'Sao Tome and Principe',
'Saudi Arabia',
'Senegal',
'Serbia',
'Seychelles',
'Sierra Leone',
'Singapore',
'Sint Maarten (Dutch part)',
'Slovakia',
'Slovenia',
'Solomon Islands',
'Somalia',
'South Africa',
'South Georgia and The South Sandwich Islands',
'South Sudan',
'Spain',
'Sri Lanka',
'Sudan',
'Suriname',
'Svalbard and Jan Mayen',
'Swaziland',
'Sweden',
'Switzerland',
'Syrian Arab Republic',
'*, Province of China',
'Tajikistan',
'Tanzania, United Republic of',
'Thailand',
'Timor-leste',
'Togo',
'Tokelau',
'Tonga',
'Trinidad and Tobago',
'Tunisia',
'Turkey',
'Turkmenistan',
'Turks and Caicos Islands',
'Tuvalu',
'Uganda',
'Ukraine',
'United Arab Emirates',
'United Kingdom',
'United States',
'United States Minor Outlying Islands',
'Uruguay',
'Uzbekistan',
'Vanuatu',
'Venezuela, Bolivarian Republic of',
'Viet Nam',
'Virgin Islands, British',
'Virgin Islands, U.S.',
'Wallis and Futuna',
'Western Sahara',
'Yemen',
'Zambia',
'Zimbabwe'
];
var terms = document.getElementById('search').value.trim().toLowerCase().split(/\s+/);
if (!terms[0]) {
terms = [];
}
document.getElementById('terms').innerText = 'Terms: ' + JSON.stringify(terms);
var startTime = performance.now();
// Term counts is a map storing how many times each search term appears in the query
var termCounts = {};
terms.forEach(function(term) {
termCounts[term] = (termCounts[term] || 0) + 1;
});
// An array of search terms with the duplicates removed
var uniqueTerms = Object.keys(termCounts);
// Loop through each option and map to either a highlight version or null
options = options.map(function(optionText) {
var matches = {},
lastMatchIndex = {},
option = optionText.toLowerCase();
uniqueTerms.forEach(function(term) {
// This array will be populated with start/end position of each match for this term
matches[term] = [];
// The index of the last match... which could be deduced from the matches but this is slightly easier
lastMatchIndex[term] = -1;
});
var incompleteMatchTerms = uniqueTerms.slice(),
nextMatchTerm;
// This is probably a premature optimization but doing it this
// way ensures we check that each search term occurs at least
// once as quickly as possible.
while (nextMatchTerm = incompleteMatchTerms.shift()) {
var nextMatchIndex = option.indexOf(nextMatchTerm, lastMatchIndex[nextMatchTerm] + 1);
if (nextMatchIndex === -1) {
// Haven't found enough matches for this term, so the option doesn't match
if (termCounts[nextMatchTerm] > matches[nextMatchTerm].length) {
return null;
}
}
else {
// Found another match, put the term back on the queue
// for another round
incompleteMatchTerms.push(nextMatchTerm);
lastMatchIndex[nextMatchTerm] = nextMatchIndex;
matches[nextMatchTerm].push({
start: nextMatchIndex,
end: nextMatchIndex + nextMatchTerm.length
});
}
}
// Pass in the original array of terms... we attempt to highlight in the order of the original query
var highlights = performHighlight(terms, matches);
if (!highlights) {
return null;
}
// We need the highlights sorted so that we do the replacing from the end of the string
highlights.sort(function(h1, h2) {
return h2.start - h1.start;
});
highlights.forEach(function(highlight) {
optionText = optionText.slice(0, highlight.start)
+ '<u>' + optionText.slice(highlight.start, highlight.end) + '</u>'
+ optionText.slice(highlight.end);
});
return optionText;
function performHighlight(terms, allMatches) {
// If there are no terms left to match we've got a hit
if (terms.length === 0) {
return [];
}
var nextTerms = terms.slice(),
term = nextTerms.shift(),
matches = allMatches[term].slice(),
match;
while (match = matches.shift()) {
var nextMatches = {};
// We need to purge any entries from nextMatches that overlap the current match
uniqueTerms.forEach(function(nextTerm) {
var nextMatch = term === nextTerm ? matches : allMatches[nextTerm];
nextMatches[nextTerm] = nextMatch.filter(function(match2) {
return match.start >= match2.end || match.end <= match2.start;
});
});
var highlights = performHighlight(nextTerms, nextMatches);
if (highlights) {
highlights.push(match);
return highlights;
}
}
return null;
}
});
document.getElementById('results').innerHTML = options.map(function(option) {
if (option) {
return '<li>' + option + '</li>';
}
return '';
}).join('');
document.getElementById('time').innerText = Math.round((performance.now() - startTime) * 100) / 100 + 'ms';
}
<h1>Permutations</h1>
<input type="text" id="search" onkeyup="search()" autocomplete="off">
<p id="terms"></p>
<p id="time"></p>
<ul id="results"></ul>
Update:
更新:
Based on feedback from Mikk3lRo in the comments I've done a bit of performance tuning and come up with this:
根据Mikk3lRo在评论中给出的反馈,我做了一些性能调整,得出了以下结论:
https://jsfiddle.net/skirtle/ndeuqn02/1/
https://jsfiddle.net/skirtle/ndeuqn02/1/
The core algorithm is the same but I've made it much more difficult to understand, all in the name of performance. Most of the changes relate to avoiding the creation of new objects wherever possible.
核心算法是一样的,但我已经使它更加难以理解,所有这些都是以性能的名义。大多数变化都与尽可能避免创建新对象有关。
As the algorithm does a lot of searching upfront for things it might never need there will always be opportunities for other algorithms to be quicker, especially in simple cases. Many of those cases could be handled separately but I haven't attempted that kind of optimisation.
由于算法在前面做了大量的搜索,它可能永远不需要的东西,总是有机会让其他算法更快,特别是在简单的情况下。其中许多案例可以单独处理,但我还没有尝试过这种优化。
In Chrome it now outperforms the other implementations in a lot of different scenarios, though that is an unfair comparison as they haven't yet been tuned in the same way. The other implementations tend to be slightly faster in Firefox for simple searches but times are all in the same ballpark.
在Chrome中,它在很多不同的场景中都优于其他实现,尽管这是不公平的比较,因为它们还没有以同样的方式进行调优。对于简单的搜索来说,其他的实现在Firefox中会稍微快一点,但是时间都是一样的。
Some particularly interesting searches:
一些特别有趣的搜索:
-
a ab ba baba
. I've added a new 'option' and adjusted the CSS to demonstrate this. The algorithms differ in their chosen way to perform the highlighting. My algorithm favours the order of the terms in the query rather than basing it on the length of the terms. There are further optimisations available if I don't worry about the ordering but I think they'd only help in extreme cases of overlaps. - ab英航巴巴。我添加了一个新的“选项”并调整了CSS以演示这一点。算法在执行突出显示的方式上有所不同。我的算法更喜欢查询中术语的顺序,而不是基于术语的长度。如果我不担心排序的话,还有进一步的优化,但是我认为它们只在重叠的极端情况下有用。
-
t r i s t a n d a c u n h a
. Note the spaces between the letters, these are 14 separate search terms. If you type this one term at a time Divide and Conquer will quickly start to struggle but it does recover towards the end. Wipe and Shadow cope well for longer but when you type the letterc
they will fall off a cliff. I think it's an exponential explosion in the backtracking but I haven't confirmed that. I'm sure with a bit of work it could be addressed in simple cases but it might be trickier to fix in cases where the backtracking is caused by an unresolvable overlap. - 注意字母之间的空格,这是14个单独的搜索词。如果你一次输入这一项,分治和征服将很快开始斗争,但它确实在最后恢复。擦拭和阴影处理较长时间,但当你键入字母c时,它们会掉下悬崖。我认为这是回溯的指数爆炸,但我还没有证实。我相信通过一些工作,它可以在简单的情况下得到解决,但在由于不可解重叠而导致回溯的情况下,修复它可能会更加困难。
I'm sure all the implementations could be sped up with a bit more tuning and a few carefully crafted bits of special-case handling. Which one is actually 'best' for real scenarios I'm not sure but my current feeling is that my algorithm probably only has a narrow sweet spot where it would outperform the others in a truly fair test. An algorithm that doesn't do all that searching upfront seems hard to beat for real searches.
我确信所有的实现都可以通过更多的调优和一些精心设计的特殊案例处理来加速。我不确定哪种算法在真实场景中是“最好的”,但我目前的感觉是,我的算法可能只有一个很小的最佳点,在一个真正公平的测试中,它会比其他算法表现得更好。一个不做所有搜索的算法似乎很难被真正的搜索打败。
Update 2
更新2
I've tried a different implementation of my earlier approach:
我尝试了一种不同的方法来实现我之前的方法:
https://jsfiddle.net/skirtle/ndeuqn02/9/
https://jsfiddle.net/skirtle/ndeuqn02/9/
Note that I've only updated my own implementation, the others remain out of date.
注意,我只更新了自己的实现,其他的仍然过时。
I thought I'd try to avoid unnecessary searches by performing them lazily rather than doing them all upfront. It still caches them so that they can be reused if the algorithm needs to backtrack. I don't know whether this makes a significant difference as performing small numbers of extra searches on short strings probably wasn't adding much overhead.
我想我应该尽量避免不必要的搜索,懒惰地执行它们,而不是预先执行它们。它仍然缓存它们,以便在算法需要回溯时可以重用它们。我不知道这是否会产生重大影响,因为在短字符串上执行少量的额外搜索可能不会增加太多开销。
I also experimented with cutting out the function recursion. While it does seem to work I feel there's a high risk of bugs (it would need a lot of unit tests to be sure it actually does work). I'm not convinced this part was really a success because the data structures involved make it really difficult to follow. It does seem to be fast but not by enough to justify the complexity.
我还尝试了切掉函数递归。虽然它看起来确实有效,但我觉得存在很大的缺陷(它需要大量的单元测试以确保它确实有效)。我不相信这部分真的是成功的,因为所涉及的数据结构使它很难遵循。它看起来确实很快,但不足以证明它的复杂性。
I also experimented with alternative ways to build up the final highlights. All that sorting and slicing seemed to be a performance drain but, again, the code gets more complicated trying to avoid it. Some of these gains might be applicable to the other algorithms though.
我还尝试了其他方法来建立最后的亮点。所有的排序和切片看起来都是性能消耗,但是,代码在试图避免这种情况时变得更加复杂。不过,其中一些收益可能适用于其他算法。
Another idea I've introduced here is a pre-search analysis of the query terms (dependent only on the query and not on the options). It checks whether terms can overlap and for any terms where an overlap is impossible (e.g. cat dog
) it uses a much simpler algorithm to just grab the matches. This idea could potentially be applied to the other algorithms too.
我在这里介绍的另一个想法是对查询条件进行预搜索分析(仅依赖于查询,而不依赖于选项)。它检查术语是否可以重叠,对于任何不可能重叠的术语(例如cat dog),它使用一个更简单的算法来抓取匹配项。这个想法也可以应用到其他算法中。
As mentioned in the comments the idea of running some sort of pre-search analysis of the options is also possible but I haven't actually implemented that here. It's difficult to know what sort of search index would be most beneficial as it depends on things like memory usage and the specifics of the options. However, it might be more practical to try carrying over small amounts of information from one search to the next.
正如评论中提到的,对选项进行某种预先搜索分析的想法也是可能的,但我并没有在这里实现。很难知道哪种搜索索引最有用,因为它取决于内存使用和选项的细节。然而,尝试将少量的信息从一个搜索传递到下一个搜索可能更实际。
For example, if someone searches for united states
there's a good chance that the last thing they typed was the final s
and their previous search was united state
. Two potential optimisations based on this are:
例如,如果有人搜索美国,很有可能他们最后输入的是最后的s,而他们之前的搜索是美国。基于此的两个可能的优化是:
- The matching options for
united states
will be a subset of those forunited state
, so if we remember that list we could save having to check all the other options. This could be used for any of the algorithms. - 美国的匹配选项将是美国的子集,所以如果我们记得这个列表,我们可以省去检查其他选项的麻烦。这可以用于任何算法。
- In the case of my algorithm the match caches could be retained from one search to the next. While the cache entry for
state
wouldn't be any use the entry forunited
would be exactly the same from one search to the next, allowing us to skip an expensive part of the algorithm for that term. - 在我的算法中,匹配缓存可以从一个搜索保留到下一个搜索。虽然state的缓存条目不会有任何用处,但是united的条目从一个搜索到下一个搜索是完全相同的,这允许我们跳过这个术语的算法的昂贵部分。
#2
6
I would suggest a slight variant on the divide and conquer idea: instead of splitting the string into chunks (bites), you could "wipe out" the characters that have been matched, and perform further searches on that one string. The character to wipe out with would be the separator, as it is guaranteed to not occur in any of the terms.
我建议对“分治法”的思想做一个细微的改变:不要将字符串分割成块(片段),您可以“删除”已匹配的字符,并对该字符串执行进一步的搜索。要清除的字符将是分隔符,因为它保证不会出现在任何术语中。
Here it is:
这里是:
function trincotWipeSearch(query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
wipe: separator.repeat(term.length),
regex: new RegExp(term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&"), 'gi')
}));
function getOffsets(termIndex, text) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, wipe } = items[termIndex];
regex.lastIndex = 0;
while (match = regex.exec(text)) {
let index = match.index;
// Wipe characters and recurse to find other terms
let offsets = getOffsets(termIndex+1,
text.substr(0, index) + wipe + text.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
regex.lastIndex = match.index + 1;
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option);
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotWipeSearch(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
I have excluded DOM I/O from the time measurement.
我已经将DOM I/O排除在时间度量之外。
Here is a jsfiddle comparing the two algorithms side by side. It should not be difficult to add a third algorithm to compare with other algorithms still.
这里有一个jsfiddle在比较这两种算法。添加第三种算法与其他算法进行比较应该不难。
When the separator could be any regular expression
...then the above function cannot be used. One way to overcome this is to introduce a "shadow" string, just as long as the option string, but with only 2 different possible characters in it (e.g. .
and x
):
…然后不能使用上面的函数。解决这个问题的一种方法是引入一个“影子”字符串,和选项字符串一样长,但是其中只有两个不同的可能字符(例如。和x):
-
One of the two would indicate that the corresponding character (i.e. at the same position) in the option string has been matched with a term, and therefore is no longer available for another term's match.
其中之一将表示选项字符串中对应的字符(即在相同位置)已与项匹配,因此不再适用于其他项的匹配。
-
The other character would indicate that the corresponding character in the option string is still available for being included in a term's match.
另一个字符将指示选项字符串中的相应字符仍然可用,以便包含在术语的匹配中。
Obviously this makes the function a tad slower, as there might be matches that need to be rejected after checking against this shadow string:
显然,这使得函数稍微慢了一点,因为在对这个影子字符串进行检查之后,可能会有匹配项需要被拒绝:
function trincotShadowMarks (query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
used: 'x'.repeat(term.length),
free: '.'.repeat(term.length),
regex: new RegExp(term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&"), 'gi')
}));
function getOffsets(termIndex, text, shadow) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, used, free } = items[termIndex];
regex.lastIndex = 0;
while (regex.lastIndex > -1 && (match = regex.exec(text))) {
let index = match.index;
// Is this match not overlapping with another match?
if (!shadow.substr(index, size).includes('x')) {
// Mark position as used and recurse to find other terms
let offsets = getOffsets(termIndex+1, text,
shadow.substr(0, index) + used + shadow.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
}
regex.lastIndex = shadow.indexOf(free, match.index + 1);
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option, '.'.repeat(option.length));
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotShadowMarks(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
#3
3
Divide and Conquer
Somewhat more complicated than the single-regex Viral Permutations strategy - this recursive algorithm searches for each term one after the other, starting with the longest term.
比单regex病毒排列策略要复杂一些——这个递归算法从最长的项开始,一个接一个地搜索每一项。
Each time a match is found it divides that "bite" into three (unless at the beginning or end), marking the matched "bite" as consumed, and attempts to match the next-longest term in any of the unconsumed "bites".
每次找到匹配项时,它将“bite”分为三种(除非在开始或结束时),将匹配的“bite”标记为消费,并尝试在任何未消费的“bite”中匹配下一个最长的术语。
When it is unable to find the longest unmatched term, it will backtrack and attempt to match the previous term at a different position (even in a different "bite").
当它找不到最长的不匹配项时,它将回溯并尝试在不同的位置(即使是在不同的“咬”位置)匹配前一项。
If it comes back to the longest term and cannot find another position to match it at, it will return false.
如果它返回到最长的项并且找不到另一个位置来匹配它,它将返回false。
This means that in most cases it can return non-matches pretty quickly, simply because they don't even contain the longest term.
这意味着在大多数情况下,它可以很快地返回非匹配项,因为它们甚至不包含最长的项。
Of course if it runs out of terms - ie. successfully matches the shortest - it will return the highlighted match by joining all the "bites" back together.
当然,如果它用完了。成功匹配最短的-它将返回突出显示的匹配,加入所有的“咬”在一起。
Demo:
Updated for improved performance: The base algorithm is exactly the same, but there were some pretty expensive calls to arr.slice()
that could be avoided completely.
为了提高性能而更新:基本算法完全相同,但是对arr.slice()的一些调用非常昂贵,可以完全避免。
function divide_and_conquer_replace(query, options, separator) {
var terms, terms_esc;
//The inner replacement function
function divide_and_conquer_inner(bites, depth) {
var this_term, i, bite, match, new_bites, found_all_others;
depth = depth ? depth : 1;
//Get the longest remaining term
this_term = terms_esc[terms_esc.length - depth];
//Loop all the bites
for (i = 0; i < bites.length; i++) {
bite = bites[i];
//Reset the lastIndex since we're reusing the RegExp objects
this_term.lastIndex = 0;
//Check that we have a string (ie. do not attempt to match bites
//that are already consumed)
if (typeof bite === 'string') {
//Find the next matching position (if any)
while (match = this_term.exec(bite)) {
new_bites = (i > 0) ? bites.slice(0, i) : [];
if (match.index > 0) {
new_bites.push(bite.slice(0, match.index));
}
new_bites.push(['<u>' + match[0] + '</u>']);
if (this_term.lastIndex < bite.length) {
new_bites.push(bite.slice(this_term.lastIndex));
}
if (i < bites.length - 1) {
new_bites = new_bites.concat(bites.slice(i + 1));
}
if (terms_esc.length > depth) {
//Attempt to find all other terms
found_all_others = divide_and_conquer_inner(new_bites, depth + 1);
//If we found all terms we'll pass the modified string all the
//way up to the original callee
if (found_all_others) {
return found_all_others;
}
//Otherwise try to match current term somewhere else
this_term.lastIndex = match.index + 1;
} else {
//If no terms remain we have a match
return new_bites.join('');
}
}
}
}
//If we reach this point at least one term was not found
return null;
};
// Split query in terms at delimiter
terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
//Sort terms according to length - longest term last
terms.sort(function(a, b) {
return a.length - b.length;
});
//Escape terms
//And store RegExp's instead of strings
terms_esc = terms.map(function (term) {
return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
}).map(function (term) {
return new RegExp(term, 'gi');
});
//Loop through each option
return options.map(function(option){
return divide_and_conquer_inner([option]);
}).filter(Boolean);
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';
function divide_and_conquer(){
var query = document.getElementById('divide_and_conquer').value;
var res_elm = document.getElementById('divide_and_conquer_result');
var t0 = performance.now();
var results = divide_and_conquer_replace(query, options, separator);
var t1 = performance.now();
document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
res_elm.innerHTML = '';
for (var result of results) {
res_elm.innerHTML += '<li>' + result + '</li>';
}
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>
This strategy has performance issues when the query consists exclusively of (usually very short) strings that are all / most present in many of the options - such as a a a a a a a a
...
当查询只包含(通常是非常短的)字符串时,这种策略就会出现性能问题,这些字符串在许多选项中都是/大部分是存在的——例如a a a a a a a a a a a…
In realistic scenarios it is currently outperforming the other proposed algorithms - see the link to jsperf added to the question.
在现实的场景中,它目前的表现超过了其他提出的算法——请参阅添加到问题中的jsperf链接。
#4
2
Here is an entirely different approach than taken in my previous answer -- which I could not add all the below to (size restriction), so... it's a separate answer.
这里有一种与我之前的答案完全不同的方法——我不能把下面所有的都加到(尺寸限制)中,所以……这是一个单独的答案。
Generalized Suffix Tree: Preprocessing the options
A generalized suffix tree is a structure that in theory allows searching substrings in a set of strings in an efficient way. So I thought I would have a go at it.
广义后缀树是一种结构,理论上允许以有效的方式在一组字符串中搜索子字符串。所以我想我可以试试。
Building such a tree in an efficient way is far from trivial, as can be seen from this awesome explanation of Ukkonen's algorithm, which concerns building a suffix tree for one phrase (option).
用一种有效的方式构建这样的树绝非易事,这可以从Ukkonen算法的这个令人敬畏的解释中看出,它涉及到为一个短语(选项)构建一个后缀树。
I have taken inspiration from the implementation found here, which needed some adaptation to:
我从这里的实现中获得了灵感,需要一些适应:
- Apply a better coding style (e.g. get rid of non-explicitly declared global variables)
- 应用更好的编码风格(例如,去掉非显式声明的全局变量)
- Make it work without needing to add delimiters after texts. This was really tricky, and I hope I did not miss out on some border conditions
- 使它工作,不需要在文本之后添加分隔符。这真的很棘手,我希望我没有错过一些边界条件
- Make it work for multiple strings (i.e. generalized)
- 使它适用于多个字符串(即广义的)
So here it is:
所以在这里:
"use strict";
// Implementation of a Generalized Suffix Tree using Ukkonen's algorithm
// See also: https://*.com/q/9452701/5459839
class Node {
constructor() {
this.edges = {};
this.suffixLink = null;
}
addEdge(ch, textId, start, end, node) {
this.edges[ch] = { textId, start, end, node };
}
}
class Nikkonen extends Node {
constructor() {
super(); // root node of the tree
this.texts = [];
}
findNode(s) {
if (!s.length) return;
let node = this,
len,
suffixSize = 0,
edge;
for (let i = 0; i < s.length; i += len) {
edge = node.edges[s.charAt(i)];
if (!edge) return;
len = Math.min(edge.end - edge.start, s.length - i);
if (this.texts[edge.textId].substr(edge.start, len) !== s.substr(i, len)) return;
node = edge.node;
}
return { edge, len };
}
findAll(term, termId = 1) {
const { edge, len } = this.findNode(term) || {};
if (!edge) return {}; // not found
// Find all leaves
const matches = new Map;
(function recurse({ node, textId, start, end }, suffixLen) {
suffixLen += end - start;
const edges = Object.values(node.edges);
if (!edges.length) { // leaf node: calculate the match
if (!(matches.has(textId))) matches.set(textId, []);
matches.get(textId).push({ offset: end - suffixLen, termId });
return;
}
edges.forEach( edge => recurse(edge, suffixLen) );
})(edge, term.length - len);
return matches;
}
addText(text) {
// Implements Nikkonen's algorithm for building the tree
// Inspired by https://felix-halim.net/misc/suffix-tree/
const root = this,
active = {
node: root,
textId: this.texts.length,
start: 0,
end: 0,
},
texts = this.texts;
// Private functions
function getChar(textId, i) {
return texts[textId].charAt(i) || '$' + textId;
}
function addEdge(fromNode, textId, start, end, node) {
fromNode.addEdge(getChar(textId, start), textId, start, end, node);
}
function testAndSplit() {
const ch = getChar(active.textId, active.end);
if (active.start < active.end) {
const edge = active.node.edges[getChar(active.textId, active.start)],
splitPoint = edge.start + active.end - active.start;
if (ch === getChar(edge.textId, splitPoint)) return;
const newNode = new Node();
addEdge(active.node, edge.textId, edge.start, splitPoint, newNode);
addEdge(newNode, edge.textId, splitPoint, edge.end, edge.node);
return newNode;
}
if (!(ch in active.node.edges)) return active.node;
}
function canonize() {
while (active.start < active.end) {
const edge = active.node.edges[getChar(active.textId, active.start)];
if (edge.end - edge.start > active.end - active.start) break;
active.start += edge.end - edge.start;
active.node = edge.node;
}
}
function update() {
let prevNewNode = root,
newNode;
while (newNode = testAndSplit()) {
addEdge(newNode, active.textId, active.end, text.length+1, new Node());
// Rule 2: add suffix-link from previously inserted node
if (prevNewNode !== root) {
prevNewNode.suffixLink = newNode;
}
prevNewNode = newNode;
// Rule 3: follow suffixLink after split
active.node = active.node.suffixLink || root;
canonize(); // because active.node changed
}
if (prevNewNode !== root) {
prevNewNode.suffixLink = active.node;
}
}
texts.push(text);
if (!root.suffixLink) root.suffixLink = new Node();
for (let i = 0; i < text.length; i++) {
addEdge(root.suffixLink, active.textId, i, i+1, root);
}
// Main Ukkonen loop: add each character from left to right to the tree
while (active.end <= text.length) {
update();
active.end++;
canonize(); // because active.end changed
}
}
}
function trincotSuffixTree(query, options, suffixTree, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// create Map keyed by term with count info
const termMap = new Map(terms.map( (term, termId) => [term, { termId, count: 0, leftOver: 0, size: term.length }] ));
terms.forEach( (term) => termMap.get(term).count++ );
function getNonOverlaps(offsets, leftOver, lastIndex = 0, offsetIndex = 0) {
// All terms found?
if (!leftOver) return [];
let giveUpAt = Infinity;
// While still enough matches left over:
while (offsetIndex + leftOver <= offsets.length) {
const { termId, offset } = offsets[offsetIndex++];
if (offset < lastIndex) continue; // overlap, try next
if (offset >= giveUpAt) break; // Looking further makes no sense
const termInfo = termMap.get(terms[termId]);
//console.log('termId', termId, 'offset', offset, 'size', termInfo.size, 'lastIndex', lastIndex);
if (!termInfo.leftOver) continue; // too many of the same term, try next
termInfo.leftOver--;
const result = getNonOverlaps(offsets, leftOver - 1, offset + termInfo.size, offsetIndex);
// If success, then completely backtrack out of recursion.
if (result) return result.concat([offset + termInfo.size, offset]);
termInfo.leftOver++; // restore after failed recursive search and try next
// If a term-match at a given offset could not lead to a solution (in recursion),
// and if we keep those matched character postions all unmatched and only start matching after
// the end of that location, it will certainly not lead to a solution either.
giveUpAt = Math.min(giveUpAt, offset + termInfo.size);
}
}
let allTermsAllOptionsOffsets;
// Loop through the unique terms:
for (let [term, termInfo] of termMap) {
// Get the offsets of the matches of this term in all options (in the preprocessed tree)
const thisTermAllOptionsOffsets = suffixTree.findAll(term, termInfo.termId);
//console.log('findAll:', JSON.stringify(Array.from(thisTermAllOptionsOffsets)));
if (!thisTermAllOptionsOffsets.size) return []; // No option has this term, so bail out
if (!allTermsAllOptionsOffsets) {
allTermsAllOptionsOffsets = thisTermAllOptionsOffsets;
} else {
// Merge with all previously found offsets for other terms (intersection)
for (let [optionId, offsets] of allTermsAllOptionsOffsets) {
let newOffsets = thisTermAllOptionsOffsets.get(optionId);
if (!newOffsets || newOffsets.length < termInfo.count) {
// this option does not have enough occurrences of this term
allTermsAllOptionsOffsets.delete(optionId);
} else {
allTermsAllOptionsOffsets.set(optionId, offsets.concat(newOffsets));
}
}
if (!allTermsAllOptionsOffsets.size) return []; // No option has all terms, so bail out
}
}
// Per option, see if (and where) the offsets can serve non-overlapping matches for each term
const matches = Array.from(allTermsAllOptionsOffsets, ([optionId, offsets]) => {
// Indicate how many of each term must (still) be matched:
termMap.forEach( obj => obj.leftOver = obj.count );
return [optionId, getNonOverlaps(offsets.sort( (a, b) => a.offset - b.offset ), terms.length)];
})
// Remove options that could not provide non-overlapping offsets
.filter( ([_, offsets]) => offsets )
// Sort the remaining options in their original order
.sort( (a,b) => a[0] - b[1] )
// Replace optionId, by the corresponding text and apply mark-up at the offsets
.map( ([optionId, offsets]) => {
let option = options[optionId];
offsets.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
});
//console.log(JSON.stringify(matches));
return matches;
}
function trincotPreprocess(options) {
const nikkonen = new Nikkonen();
// Add all the options (lowercased) to the suffic tree
options.map(option => option.toLowerCase()).forEach(nikkonen.addText.bind(nikkonen));
return nikkonen;
}
const options = ['abbbba', 'United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
let preprocessed;
function processInput() {
if (!preprocessed) { // Only first time
const t0 = performance.now();
preprocessed = trincotPreprocess(options);
const spentTime = performance.now() - t0;
// Output the time spent on preprocessing
pretime.textContent = spentTime.toFixed(2);
}
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotSuffixTree(query, options, preprocessed, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Suffix Tree Search</h3>
Preprocessing Time: <span id="pretime"></span>ms (only done once)<br>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
This method has quite some code behind it, so I suppose it might not show interesting performance for small data sets, while for larger data sets it will be memory consuming: the tree takes much more memory than the original options array.
这个方法背后有相当多的代码,所以我认为它可能不会对小数据集显示出有趣的性能,而对于大数据集,它会消耗内存:树比原始的选项数组占用更多的内存。
#5
0
Update 2
Abandoned the concept of shrinking the set due to issues with restoring the working strings in Vue.
由于在Vue中恢复工作字符串的问题,放弃了收缩集合的概念。
Now, the method is simply as follows:
现在,方法简单如下:
- Pre-process the option set to keep the display in sync with the working.
- 预处理选项设置,以保持显示与工作同步。
- Process the terms.
- 处理条件。
- Reduce (filter) the option set by iterating over it, and looping over terms, short-circuiting when there is a mismatch.
- 减少(筛选)选项集,遍历选项集,并对项进行循环,在不匹配时进行短路。
- Using the reduced set, iterating over each option, find the match ranges.
- 使用减少的集合,遍历每个选项,找到匹配范围。
- Insert HTML strings around each match range.
- 在每个匹配范围内插入HTML字符串。
The code is commented.
代码的注释。
Raw javascript (logs the filtered/manipulated options array): https://jsfiddle.net/pvLj9uxe/14/
原始javascript(记录过滤/操作的选项数组):https://jsfiddle.net/pvLj9uxe/14/
New Vue implementation: https://jsfiddle.net/15prcpxn/30/
新Vue实现:https://jsfiddle.net/15prcpxn/30/
The calculation seems reasonably fast - the DOM update is what kills it.
计算速度似乎相当快,因为DOM更新会杀死它。
Added to comparison*: https://jsfiddle.net/ektyx133/4/
添加到比较*:https://jsfiddle.net/ektyx133/4/
*Caveat: Pre-processing the options (treated as "static") is part of the strategy, so it has been processed outside of the benchmark.
*注意:对选项进行预处理(作为“静态”处理)是策略的一部分,因此它在基准之外进行处理。
var separator = /\s|\*|,/;
// this function enhances the raw options array
function enhanceOptions(options) {
return options.map(option => ({
working: option.toLowerCase(), // for use in filtering the set and matching
display: option // for displaying
}))
}
// this function changes the input to lower case, splits the input into terms, removes empty strings from the array, and enhances the terms with the size and wiping string
function processInput(input) {
return input.trim().toLowerCase().split(separator).filter(term => term.length).map(term => ({
value: term.toLowerCase(),
size: term.length,
wipe: " ".repeat(term.length)
})).sort((a, b) => b.size - a.size);
}
// this function filters the data set, then finds the match ranges, and finally returns an array with HTML tags inserted
function filterAndHighlight(terms, enhancedOptions) {
let options = enhancedOptions,
l = terms.length;
// filter the options - consider recursion instead
options = options.filter(option => {
let i = 0,
working = option.working,
term;
while (i < l) {
if (!~working.indexOf((term = terms[i]).value)) return false;
working = working.replace(term.value, term.wipe);
i++;
}
return true;
})
// generate the display string array
let displayOptions = options.map(option => {
let rangeSet = [],
working = option.working,
display = option.display;
// find the match ranges
terms.forEach(term => {
working = working.replace(term.value, (match, offset) => { // duplicate the wipe string replacement from the filter, but grab the offsets
rangeSet.push({
start: offset,
end: offset + term.size
});
return term.wipe;
})
})
// sort the match ranges, last to first
rangeSet.sort((a, b) => b.start - a.start);
// insert the html tags within the string around each match range
rangeSet.forEach(range => {
display = display.slice(0, range.start) + '<u>' + display.slice(range.start, range.end) + '</u>' + display.slice(range.end)
})
return display;
})
return displayOptions;
}
Old Attempt
https://jsfiddle.net/15prcpxn/25/
https://jsfiddle.net/15prcpxn/25/
My attempt, using Vue for rendering (the methods are sequential, so you can probably put it all into one monolithic function without much effort - inputs will be terms, and full option set; outputs will be filtered option set, and highlighted ranges).
我的尝试,使用Vue进行渲染(这些方法是顺序的,所以你可以不费多大力气就把它们都放到一个独立的函数中——输入将是terms,并设置完整的选项;输出将被过滤选项集和突出显示范围)。
- Split the input into individual terms
- 将输入分成单独的项
- Sort terms by length (longest term first, so that when you have an option such as
"abc ab"
, and the terms"a abc"
, i.e. a term is a substring of another term, it will be able to match"abc"
) - 按长度对项进行排序(最长项优先,以便当您有一个选项,如“abc ab”和“a abc”,即一个项是另一个项的子字符串,它将能够匹配“abc”)
- Copy/change terms to lower case
- 抄写/更改条款为小写
- Copy options (our "display set") to lower case (our "working set")
- 复制选项(我们的“显示集”)到小写(我们的“工作集”)
- For each term, remove working options without matches from the "working set", and in parallel remove display options from the "display set" - when doing so, remove the matched term strings from the surviving working option strings, e.g. matching term
"a"
in option"abc"
yields"bc"
[actual implementation is the inverse: For each term, recreate "working set" when there are matches, and in parallel add display options to the "display set", then pass these sets to the next term] - this gives us the filtered display set - 对于每一项,删除工作选择不匹配的“工作集”,和并行删除显示选项“显示设置”——当这样做时,删除从幸存的工作选择字符串匹配项的字符串,如匹配项选择“abc”收益率的“a”“公元前”[实际实现逆:对于每个术语,当有匹配时重新创建“工作集”,并在“显示集”中并行地添加显示选项,然后将这些设置传递给下一个术语]——这将给我们过滤显示集
- Copy the filtered display set to lower case, to give us a new filtered working set
- 将过滤后的显示集复制到小写,以给我们一个新的过滤工作集
- For each working option in the remaining filtered working set, create a range-set, by recording the ranges (i.e. start and end, e.g. matching term
"a"
in option"abc"
:start = 0, end = 1
) where each term matches by taking the offset (start) of the match and the length of the term/match. Replace the matched string with empty spaces (or other unused character) of equal length to the term and feed this into the next term, e.g. matching term"a"
in option"abc"
yields" bc"
- this preserves the length of the working option, ensuring that the filtered working set (in lower case) matches the filtered display set (original case). The total number of range-sets will be equal to the number of remaining options in the filtered option set. - 为每个工作选择剩下的过滤工作集,创建一个值域范围,通过记录的范围(即开始和结束,如匹配项选项“abc”:“一”= 0开始,结束= 1),每一项比赛通过比赛的抵消(start)和词的长度/匹配。匹配的字符串替换为空的空间(或其他未使用的字符)同等长度的术语和饲料这进入下一项,如匹配项选择“abc”收益率的“a”“公元前”——这个保留的长度选择工作,确保过滤工作集(小写)匹配过滤显示设置(原)。距离集的总数将等于过滤选项集中的剩余选项的数目。
- Also, sort the ranges within each range-set (in descending order, to work in reverse), to allow for string insertion.
- 此外,对每个范围集中的范围进行排序(按降序排列,以便反向工作),以允许字符串插入。
- For each option in the filtered display set, (working in reverse so as not to disturb the index when manipulating the string), insert the
<u>
</u>
tags around the matched ranges by slicing up the display option, e.g. matching term"a"
in option"abc"
:new option = "<u>" + "a" + "</u>" + "bc"
- 过滤显示每个选项的设置,(工作在反向,以免打扰索引时操纵字符串),插入 <你> < / u >标签的匹配范围通过切片显示选项,如匹配术语“选项“abc”:“新选项= < u >”+“a”+“< / u >”+“公元前”
- Render it
- 使它
Performance is poor when there are many matches / non-useful terms (e.g. when you input a single character). For end-use, I'd probably put in an input-calculation delay.
当有许多匹配项/无用项时(例如,当您输入单个字符时),性能会很差。对于最终用途,我可能会加入一个输入计算延迟。
I should be able to roll up some of these steps into fewer steps which may improve performance. I'll revisit tomorrow.
我应该能够将其中的一些步骤简化为更少的步骤,从而提高性能。我明天会重新审视。
Vue presumably also takes care of some of the optimisations through virtual DOM etc. so it won't necessarily be reflective of vanilla Javascript / DOM rendering.
Vue可能还会通过虚拟DOM等来处理一些优化,所以它不一定是普通Javascript / DOM渲染的反映。
#1
6
I gave it a go but I'm not sure how much help this will be. My approach is similar to your Divide and Conquer technique.
我试了一下,但我不确定这能帮多少忙。我的方法类似于你的分治法。
Instead of biting off bits of the string I search for each term ahead of time and store up a collection of all the matches, recording the start and finish positions. If there aren't enough matches for a particular search term the algorithm immediately bails for that 'option'.
而不是撕咬字符串,我搜索每一个词提前,并储存所有的比赛,记录开始和结束的位置。如果一个特定的搜索词没有足够的匹配项,算法会立即发出“选项”的呼叫。
Once it has gathered together all of the possible matches it recursively attempts to find a combination that doesn't overlap. There's a lot of copying of data structures going on in that recursion and I suspect it could be optimized a lot better than I have it here. I can also only apologize for some of the variable names, I was struggling to come up with names that made any sense at all.
一旦它收集了所有可能的匹配项,它就会递归地尝试找到一个不重叠的组合。在这个递归中有很多数据结构的复制我怀疑它可以比我这里的优化得更好。我也只能为一些变量的名字道歉,我正在努力想出一些有意义的名字。
For certain test searches, such as a n a n a n a n ...
it seems to cope better than the original Divide and Conquer technique but I suspect that may be just because of the early bail-out that is performed when insufficient matches are found for a particular search term. Without a large quantity of real data it's difficult to know where the really valuable optimizations would be.
对于某些测试搜索,例如a n n n n n a n a n a n n…它似乎比最初的分治技巧处理得更好,但我怀疑,这可能仅仅是因为早期的纾困措施,即在发现某一特定搜索词的匹配不足时进行纾困。如果没有大量的真实数据,就很难知道真正有价值的优化在哪里。
function search() {
var options = [
'ababababa',
'United States',
'United Kingdom',
'Afghanistan',
'Aland Islands',
'Albania',
'Algeria',
'American Samoa',
'Andorra',
'Angola',
'Anguilla',
'Antarctica',
'Antigua and Barbuda',
'Argentina',
'Armenia',
'Aruba',
'Australia',
'Austria',
'Azerbaijan',
'Bahamas',
'Bahrain',
'Bangladesh',
'Barbados',
'Belarus',
'Belgium',
'Belize',
'Benin',
'Bermuda',
'Bhutan',
'Bolivia, Plurinational State of',
'Bonaire, Sint Eustatius and Saba',
'Bosnia and Herzegovina',
'Botswana',
'Bouvet Island',
'Brazil',
'British Indian Ocean Territory',
'Brunei Darussalam',
'Bulgaria',
'Burkina Faso',
'Burundi',
'Cambodia',
'Cameroon',
'Canada',
'Cape Verde',
'Cayman Islands',
'Central African Republic',
'Chad',
'Chile',
'China',
'Christmas Island',
'Cocos (Keeling) Islands',
'Colombia',
'Comoros',
'Congo',
'Congo, The Democratic Republic of The',
'Cook Islands',
'Costa Rica',
'Cote D\'ivoire',
'Croatia',
'Cuba',
'Curacao',
'Cyprus',
'Czech Republic',
'Denmark',
'Djibouti',
'Dominica',
'Dominican Republic',
'Ecuador',
'Egypt',
'El Salvador',
'Equatorial Guinea',
'Eritrea',
'Estonia',
'Ethiopia',
'Falkland Islands (Malvinas)',
'Faroe Islands',
'Fiji',
'Finland',
'France',
'French Guiana',
'French Polynesia',
'French Southern Territories',
'Gabon',
'Gambia',
'Georgia',
'Germany',
'Ghana',
'Gibraltar',
'Greece',
'Greenland',
'Grenada',
'Guadeloupe',
'Guam',
'Guatemala',
'Guernsey',
'Guinea',
'Guinea-bissau',
'Guyana',
'Haiti',
'Heard Island and Mcdonald Islands',
'Holy See (Vatican City State)',
'Honduras',
'*',
'Hungary',
'Iceland',
'India',
'Indonesia',
'Iran, Islamic Republic of',
'Iraq',
'Ireland',
'Isle of Man',
'Israel',
'Italy',
'Jamaica',
'Japan',
'Jersey',
'Jordan',
'Kazakhstan',
'Kenya',
'Kiribati',
'Korea, Democratic People\'s Republic of',
'Korea, Republic of',
'Kuwait',
'Kyrgyzstan',
'Lao People\'s Democratic Republic',
'Latvia',
'Lebanon',
'Lesotho',
'Liberia',
'Libya',
'Liechtenstein',
'Lithuania',
'Luxembourg',
'Macao',
'Macedonia, The Former Yugoslav Republic of',
'Madagascar',
'Malawi',
'Malaysia',
'Maldives',
'Mali',
'Malta',
'Marshall Islands',
'Martinique',
'Mauritania',
'Mauritius',
'Mayotte',
'Mexico',
'Micronesia, Federated States of',
'Moldova, Republic of',
'Monaco',
'*',
'Montenegro',
'Montserrat',
'Morocco',
'Mozambique',
'Myanmar',
'Namibia',
'Nauru',
'Nepal',
'Netherlands',
'New Caledonia',
'New Zealand',
'Nicaragua',
'Niger',
'Nigeria',
'Niue',
'Norfolk Island',
'Northern Mariana Islands',
'Norway',
'Oman',
'Pakistan',
'Palau',
'Palestinian Territory, Occupied',
'Panama',
'Papua New Guinea',
'Paraguay',
'Peru',
'Philippines',
'Pitcairn',
'Poland',
'Portugal',
'Puerto Rico',
'Qatar',
'Reunion',
'Romania',
'Russian Federation',
'Rwanda',
'Saint Barthelemy',
'Saint Helena, Ascension and Tristan da Cunha',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Martin (French part)',
'Saint Pierre and Miquelon',
'Saint Vincent and The Grenadines',
'Samoa',
'San Marino',
'Sao Tome and Principe',
'Saudi Arabia',
'Senegal',
'Serbia',
'Seychelles',
'Sierra Leone',
'Singapore',
'Sint Maarten (Dutch part)',
'Slovakia',
'Slovenia',
'Solomon Islands',
'Somalia',
'South Africa',
'South Georgia and The South Sandwich Islands',
'South Sudan',
'Spain',
'Sri Lanka',
'Sudan',
'Suriname',
'Svalbard and Jan Mayen',
'Swaziland',
'Sweden',
'Switzerland',
'Syrian Arab Republic',
'*, Province of China',
'Tajikistan',
'Tanzania, United Republic of',
'Thailand',
'Timor-leste',
'Togo',
'Tokelau',
'Tonga',
'Trinidad and Tobago',
'Tunisia',
'Turkey',
'Turkmenistan',
'Turks and Caicos Islands',
'Tuvalu',
'Uganda',
'Ukraine',
'United Arab Emirates',
'United Kingdom',
'United States',
'United States Minor Outlying Islands',
'Uruguay',
'Uzbekistan',
'Vanuatu',
'Venezuela, Bolivarian Republic of',
'Viet Nam',
'Virgin Islands, British',
'Virgin Islands, U.S.',
'Wallis and Futuna',
'Western Sahara',
'Yemen',
'Zambia',
'Zimbabwe'
];
var terms = document.getElementById('search').value.trim().toLowerCase().split(/\s+/);
if (!terms[0]) {
terms = [];
}
document.getElementById('terms').innerText = 'Terms: ' + JSON.stringify(terms);
var startTime = performance.now();
// Term counts is a map storing how many times each search term appears in the query
var termCounts = {};
terms.forEach(function(term) {
termCounts[term] = (termCounts[term] || 0) + 1;
});
// An array of search terms with the duplicates removed
var uniqueTerms = Object.keys(termCounts);
// Loop through each option and map to either a highlight version or null
options = options.map(function(optionText) {
var matches = {},
lastMatchIndex = {},
option = optionText.toLowerCase();
uniqueTerms.forEach(function(term) {
// This array will be populated with start/end position of each match for this term
matches[term] = [];
// The index of the last match... which could be deduced from the matches but this is slightly easier
lastMatchIndex[term] = -1;
});
var incompleteMatchTerms = uniqueTerms.slice(),
nextMatchTerm;
// This is probably a premature optimization but doing it this
// way ensures we check that each search term occurs at least
// once as quickly as possible.
while (nextMatchTerm = incompleteMatchTerms.shift()) {
var nextMatchIndex = option.indexOf(nextMatchTerm, lastMatchIndex[nextMatchTerm] + 1);
if (nextMatchIndex === -1) {
// Haven't found enough matches for this term, so the option doesn't match
if (termCounts[nextMatchTerm] > matches[nextMatchTerm].length) {
return null;
}
}
else {
// Found another match, put the term back on the queue
// for another round
incompleteMatchTerms.push(nextMatchTerm);
lastMatchIndex[nextMatchTerm] = nextMatchIndex;
matches[nextMatchTerm].push({
start: nextMatchIndex,
end: nextMatchIndex + nextMatchTerm.length
});
}
}
// Pass in the original array of terms... we attempt to highlight in the order of the original query
var highlights = performHighlight(terms, matches);
if (!highlights) {
return null;
}
// We need the highlights sorted so that we do the replacing from the end of the string
highlights.sort(function(h1, h2) {
return h2.start - h1.start;
});
highlights.forEach(function(highlight) {
optionText = optionText.slice(0, highlight.start)
+ '<u>' + optionText.slice(highlight.start, highlight.end) + '</u>'
+ optionText.slice(highlight.end);
});
return optionText;
function performHighlight(terms, allMatches) {
// If there are no terms left to match we've got a hit
if (terms.length === 0) {
return [];
}
var nextTerms = terms.slice(),
term = nextTerms.shift(),
matches = allMatches[term].slice(),
match;
while (match = matches.shift()) {
var nextMatches = {};
// We need to purge any entries from nextMatches that overlap the current match
uniqueTerms.forEach(function(nextTerm) {
var nextMatch = term === nextTerm ? matches : allMatches[nextTerm];
nextMatches[nextTerm] = nextMatch.filter(function(match2) {
return match.start >= match2.end || match.end <= match2.start;
});
});
var highlights = performHighlight(nextTerms, nextMatches);
if (highlights) {
highlights.push(match);
return highlights;
}
}
return null;
}
});
document.getElementById('results').innerHTML = options.map(function(option) {
if (option) {
return '<li>' + option + '</li>';
}
return '';
}).join('');
document.getElementById('time').innerText = Math.round((performance.now() - startTime) * 100) / 100 + 'ms';
}
<h1>Permutations</h1>
<input type="text" id="search" onkeyup="search()" autocomplete="off">
<p id="terms"></p>
<p id="time"></p>
<ul id="results"></ul>
Update:
更新:
Based on feedback from Mikk3lRo in the comments I've done a bit of performance tuning and come up with this:
根据Mikk3lRo在评论中给出的反馈,我做了一些性能调整,得出了以下结论:
https://jsfiddle.net/skirtle/ndeuqn02/1/
https://jsfiddle.net/skirtle/ndeuqn02/1/
The core algorithm is the same but I've made it much more difficult to understand, all in the name of performance. Most of the changes relate to avoiding the creation of new objects wherever possible.
核心算法是一样的,但我已经使它更加难以理解,所有这些都是以性能的名义。大多数变化都与尽可能避免创建新对象有关。
As the algorithm does a lot of searching upfront for things it might never need there will always be opportunities for other algorithms to be quicker, especially in simple cases. Many of those cases could be handled separately but I haven't attempted that kind of optimisation.
由于算法在前面做了大量的搜索,它可能永远不需要的东西,总是有机会让其他算法更快,特别是在简单的情况下。其中许多案例可以单独处理,但我还没有尝试过这种优化。
In Chrome it now outperforms the other implementations in a lot of different scenarios, though that is an unfair comparison as they haven't yet been tuned in the same way. The other implementations tend to be slightly faster in Firefox for simple searches but times are all in the same ballpark.
在Chrome中,它在很多不同的场景中都优于其他实现,尽管这是不公平的比较,因为它们还没有以同样的方式进行调优。对于简单的搜索来说,其他的实现在Firefox中会稍微快一点,但是时间都是一样的。
Some particularly interesting searches:
一些特别有趣的搜索:
-
a ab ba baba
. I've added a new 'option' and adjusted the CSS to demonstrate this. The algorithms differ in their chosen way to perform the highlighting. My algorithm favours the order of the terms in the query rather than basing it on the length of the terms. There are further optimisations available if I don't worry about the ordering but I think they'd only help in extreme cases of overlaps. - ab英航巴巴。我添加了一个新的“选项”并调整了CSS以演示这一点。算法在执行突出显示的方式上有所不同。我的算法更喜欢查询中术语的顺序,而不是基于术语的长度。如果我不担心排序的话,还有进一步的优化,但是我认为它们只在重叠的极端情况下有用。
-
t r i s t a n d a c u n h a
. Note the spaces between the letters, these are 14 separate search terms. If you type this one term at a time Divide and Conquer will quickly start to struggle but it does recover towards the end. Wipe and Shadow cope well for longer but when you type the letterc
they will fall off a cliff. I think it's an exponential explosion in the backtracking but I haven't confirmed that. I'm sure with a bit of work it could be addressed in simple cases but it might be trickier to fix in cases where the backtracking is caused by an unresolvable overlap. - 注意字母之间的空格,这是14个单独的搜索词。如果你一次输入这一项,分治和征服将很快开始斗争,但它确实在最后恢复。擦拭和阴影处理较长时间,但当你键入字母c时,它们会掉下悬崖。我认为这是回溯的指数爆炸,但我还没有证实。我相信通过一些工作,它可以在简单的情况下得到解决,但在由于不可解重叠而导致回溯的情况下,修复它可能会更加困难。
I'm sure all the implementations could be sped up with a bit more tuning and a few carefully crafted bits of special-case handling. Which one is actually 'best' for real scenarios I'm not sure but my current feeling is that my algorithm probably only has a narrow sweet spot where it would outperform the others in a truly fair test. An algorithm that doesn't do all that searching upfront seems hard to beat for real searches.
我确信所有的实现都可以通过更多的调优和一些精心设计的特殊案例处理来加速。我不确定哪种算法在真实场景中是“最好的”,但我目前的感觉是,我的算法可能只有一个很小的最佳点,在一个真正公平的测试中,它会比其他算法表现得更好。一个不做所有搜索的算法似乎很难被真正的搜索打败。
Update 2
更新2
I've tried a different implementation of my earlier approach:
我尝试了一种不同的方法来实现我之前的方法:
https://jsfiddle.net/skirtle/ndeuqn02/9/
https://jsfiddle.net/skirtle/ndeuqn02/9/
Note that I've only updated my own implementation, the others remain out of date.
注意,我只更新了自己的实现,其他的仍然过时。
I thought I'd try to avoid unnecessary searches by performing them lazily rather than doing them all upfront. It still caches them so that they can be reused if the algorithm needs to backtrack. I don't know whether this makes a significant difference as performing small numbers of extra searches on short strings probably wasn't adding much overhead.
我想我应该尽量避免不必要的搜索,懒惰地执行它们,而不是预先执行它们。它仍然缓存它们,以便在算法需要回溯时可以重用它们。我不知道这是否会产生重大影响,因为在短字符串上执行少量的额外搜索可能不会增加太多开销。
I also experimented with cutting out the function recursion. While it does seem to work I feel there's a high risk of bugs (it would need a lot of unit tests to be sure it actually does work). I'm not convinced this part was really a success because the data structures involved make it really difficult to follow. It does seem to be fast but not by enough to justify the complexity.
我还尝试了切掉函数递归。虽然它看起来确实有效,但我觉得存在很大的缺陷(它需要大量的单元测试以确保它确实有效)。我不相信这部分真的是成功的,因为所涉及的数据结构使它很难遵循。它看起来确实很快,但不足以证明它的复杂性。
I also experimented with alternative ways to build up the final highlights. All that sorting and slicing seemed to be a performance drain but, again, the code gets more complicated trying to avoid it. Some of these gains might be applicable to the other algorithms though.
我还尝试了其他方法来建立最后的亮点。所有的排序和切片看起来都是性能消耗,但是,代码在试图避免这种情况时变得更加复杂。不过,其中一些收益可能适用于其他算法。
Another idea I've introduced here is a pre-search analysis of the query terms (dependent only on the query and not on the options). It checks whether terms can overlap and for any terms where an overlap is impossible (e.g. cat dog
) it uses a much simpler algorithm to just grab the matches. This idea could potentially be applied to the other algorithms too.
我在这里介绍的另一个想法是对查询条件进行预搜索分析(仅依赖于查询,而不依赖于选项)。它检查术语是否可以重叠,对于任何不可能重叠的术语(例如cat dog),它使用一个更简单的算法来抓取匹配项。这个想法也可以应用到其他算法中。
As mentioned in the comments the idea of running some sort of pre-search analysis of the options is also possible but I haven't actually implemented that here. It's difficult to know what sort of search index would be most beneficial as it depends on things like memory usage and the specifics of the options. However, it might be more practical to try carrying over small amounts of information from one search to the next.
正如评论中提到的,对选项进行某种预先搜索分析的想法也是可能的,但我并没有在这里实现。很难知道哪种搜索索引最有用,因为它取决于内存使用和选项的细节。然而,尝试将少量的信息从一个搜索传递到下一个搜索可能更实际。
For example, if someone searches for united states
there's a good chance that the last thing they typed was the final s
and their previous search was united state
. Two potential optimisations based on this are:
例如,如果有人搜索美国,很有可能他们最后输入的是最后的s,而他们之前的搜索是美国。基于此的两个可能的优化是:
- The matching options for
united states
will be a subset of those forunited state
, so if we remember that list we could save having to check all the other options. This could be used for any of the algorithms. - 美国的匹配选项将是美国的子集,所以如果我们记得这个列表,我们可以省去检查其他选项的麻烦。这可以用于任何算法。
- In the case of my algorithm the match caches could be retained from one search to the next. While the cache entry for
state
wouldn't be any use the entry forunited
would be exactly the same from one search to the next, allowing us to skip an expensive part of the algorithm for that term. - 在我的算法中,匹配缓存可以从一个搜索保留到下一个搜索。虽然state的缓存条目不会有任何用处,但是united的条目从一个搜索到下一个搜索是完全相同的,这允许我们跳过这个术语的算法的昂贵部分。
#2
6
I would suggest a slight variant on the divide and conquer idea: instead of splitting the string into chunks (bites), you could "wipe out" the characters that have been matched, and perform further searches on that one string. The character to wipe out with would be the separator, as it is guaranteed to not occur in any of the terms.
我建议对“分治法”的思想做一个细微的改变:不要将字符串分割成块(片段),您可以“删除”已匹配的字符,并对该字符串执行进一步的搜索。要清除的字符将是分隔符,因为它保证不会出现在任何术语中。
Here it is:
这里是:
function trincotWipeSearch(query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
wipe: separator.repeat(term.length),
regex: new RegExp(term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&"), 'gi')
}));
function getOffsets(termIndex, text) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, wipe } = items[termIndex];
regex.lastIndex = 0;
while (match = regex.exec(text)) {
let index = match.index;
// Wipe characters and recurse to find other terms
let offsets = getOffsets(termIndex+1,
text.substr(0, index) + wipe + text.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
regex.lastIndex = match.index + 1;
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option);
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotWipeSearch(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
I have excluded DOM I/O from the time measurement.
我已经将DOM I/O排除在时间度量之外。
Here is a jsfiddle comparing the two algorithms side by side. It should not be difficult to add a third algorithm to compare with other algorithms still.
这里有一个jsfiddle在比较这两种算法。添加第三种算法与其他算法进行比较应该不难。
When the separator could be any regular expression
...then the above function cannot be used. One way to overcome this is to introduce a "shadow" string, just as long as the option string, but with only 2 different possible characters in it (e.g. .
and x
):
…然后不能使用上面的函数。解决这个问题的一种方法是引入一个“影子”字符串,和选项字符串一样长,但是其中只有两个不同的可能字符(例如。和x):
-
One of the two would indicate that the corresponding character (i.e. at the same position) in the option string has been matched with a term, and therefore is no longer available for another term's match.
其中之一将表示选项字符串中对应的字符(即在相同位置)已与项匹配,因此不再适用于其他项的匹配。
-
The other character would indicate that the corresponding character in the option string is still available for being included in a term's match.
另一个字符将指示选项字符串中的相应字符仍然可用,以便包含在术语的匹配中。
Obviously this makes the function a tad slower, as there might be matches that need to be rejected after checking against this shadow string:
显然,这使得函数稍微慢了一点,因为在对这个影子字符串进行检查之后,可能会有匹配项需要被拒绝:
function trincotShadowMarks (query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
used: 'x'.repeat(term.length),
free: '.'.repeat(term.length),
regex: new RegExp(term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&"), 'gi')
}));
function getOffsets(termIndex, text, shadow) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, used, free } = items[termIndex];
regex.lastIndex = 0;
while (regex.lastIndex > -1 && (match = regex.exec(text))) {
let index = match.index;
// Is this match not overlapping with another match?
if (!shadow.substr(index, size).includes('x')) {
// Mark position as used and recurse to find other terms
let offsets = getOffsets(termIndex+1, text,
shadow.substr(0, index) + used + shadow.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
}
regex.lastIndex = shadow.indexOf(free, match.index + 1);
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option, '.'.repeat(option.length));
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotShadowMarks(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
#3
3
Divide and Conquer
Somewhat more complicated than the single-regex Viral Permutations strategy - this recursive algorithm searches for each term one after the other, starting with the longest term.
比单regex病毒排列策略要复杂一些——这个递归算法从最长的项开始,一个接一个地搜索每一项。
Each time a match is found it divides that "bite" into three (unless at the beginning or end), marking the matched "bite" as consumed, and attempts to match the next-longest term in any of the unconsumed "bites".
每次找到匹配项时,它将“bite”分为三种(除非在开始或结束时),将匹配的“bite”标记为消费,并尝试在任何未消费的“bite”中匹配下一个最长的术语。
When it is unable to find the longest unmatched term, it will backtrack and attempt to match the previous term at a different position (even in a different "bite").
当它找不到最长的不匹配项时,它将回溯并尝试在不同的位置(即使是在不同的“咬”位置)匹配前一项。
If it comes back to the longest term and cannot find another position to match it at, it will return false.
如果它返回到最长的项并且找不到另一个位置来匹配它,它将返回false。
This means that in most cases it can return non-matches pretty quickly, simply because they don't even contain the longest term.
这意味着在大多数情况下,它可以很快地返回非匹配项,因为它们甚至不包含最长的项。
Of course if it runs out of terms - ie. successfully matches the shortest - it will return the highlighted match by joining all the "bites" back together.
当然,如果它用完了。成功匹配最短的-它将返回突出显示的匹配,加入所有的“咬”在一起。
Demo:
Updated for improved performance: The base algorithm is exactly the same, but there were some pretty expensive calls to arr.slice()
that could be avoided completely.
为了提高性能而更新:基本算法完全相同,但是对arr.slice()的一些调用非常昂贵,可以完全避免。
function divide_and_conquer_replace(query, options, separator) {
var terms, terms_esc;
//The inner replacement function
function divide_and_conquer_inner(bites, depth) {
var this_term, i, bite, match, new_bites, found_all_others;
depth = depth ? depth : 1;
//Get the longest remaining term
this_term = terms_esc[terms_esc.length - depth];
//Loop all the bites
for (i = 0; i < bites.length; i++) {
bite = bites[i];
//Reset the lastIndex since we're reusing the RegExp objects
this_term.lastIndex = 0;
//Check that we have a string (ie. do not attempt to match bites
//that are already consumed)
if (typeof bite === 'string') {
//Find the next matching position (if any)
while (match = this_term.exec(bite)) {
new_bites = (i > 0) ? bites.slice(0, i) : [];
if (match.index > 0) {
new_bites.push(bite.slice(0, match.index));
}
new_bites.push(['<u>' + match[0] + '</u>']);
if (this_term.lastIndex < bite.length) {
new_bites.push(bite.slice(this_term.lastIndex));
}
if (i < bites.length - 1) {
new_bites = new_bites.concat(bites.slice(i + 1));
}
if (terms_esc.length > depth) {
//Attempt to find all other terms
found_all_others = divide_and_conquer_inner(new_bites, depth + 1);
//If we found all terms we'll pass the modified string all the
//way up to the original callee
if (found_all_others) {
return found_all_others;
}
//Otherwise try to match current term somewhere else
this_term.lastIndex = match.index + 1;
} else {
//If no terms remain we have a match
return new_bites.join('');
}
}
}
}
//If we reach this point at least one term was not found
return null;
};
// Split query in terms at delimiter
terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
//Sort terms according to length - longest term last
terms.sort(function(a, b) {
return a.length - b.length;
});
//Escape terms
//And store RegExp's instead of strings
terms_esc = terms.map(function (term) {
return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
}).map(function (term) {
return new RegExp(term, 'gi');
});
//Loop through each option
return options.map(function(option){
return divide_and_conquer_inner([option]);
}).filter(Boolean);
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';
function divide_and_conquer(){
var query = document.getElementById('divide_and_conquer').value;
var res_elm = document.getElementById('divide_and_conquer_result');
var t0 = performance.now();
var results = divide_and_conquer_replace(query, options, separator);
var t1 = performance.now();
document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
res_elm.innerHTML = '';
for (var result of results) {
res_elm.innerHTML += '<li>' + result + '</li>';
}
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>
This strategy has performance issues when the query consists exclusively of (usually very short) strings that are all / most present in many of the options - such as a a a a a a a a
...
当查询只包含(通常是非常短的)字符串时,这种策略就会出现性能问题,这些字符串在许多选项中都是/大部分是存在的——例如a a a a a a a a a a a…
In realistic scenarios it is currently outperforming the other proposed algorithms - see the link to jsperf added to the question.
在现实的场景中,它目前的表现超过了其他提出的算法——请参阅添加到问题中的jsperf链接。
#4
2
Here is an entirely different approach than taken in my previous answer -- which I could not add all the below to (size restriction), so... it's a separate answer.
这里有一种与我之前的答案完全不同的方法——我不能把下面所有的都加到(尺寸限制)中,所以……这是一个单独的答案。
Generalized Suffix Tree: Preprocessing the options
A generalized suffix tree is a structure that in theory allows searching substrings in a set of strings in an efficient way. So I thought I would have a go at it.
广义后缀树是一种结构,理论上允许以有效的方式在一组字符串中搜索子字符串。所以我想我可以试试。
Building such a tree in an efficient way is far from trivial, as can be seen from this awesome explanation of Ukkonen's algorithm, which concerns building a suffix tree for one phrase (option).
用一种有效的方式构建这样的树绝非易事,这可以从Ukkonen算法的这个令人敬畏的解释中看出,它涉及到为一个短语(选项)构建一个后缀树。
I have taken inspiration from the implementation found here, which needed some adaptation to:
我从这里的实现中获得了灵感,需要一些适应:
- Apply a better coding style (e.g. get rid of non-explicitly declared global variables)
- 应用更好的编码风格(例如,去掉非显式声明的全局变量)
- Make it work without needing to add delimiters after texts. This was really tricky, and I hope I did not miss out on some border conditions
- 使它工作,不需要在文本之后添加分隔符。这真的很棘手,我希望我没有错过一些边界条件
- Make it work for multiple strings (i.e. generalized)
- 使它适用于多个字符串(即广义的)
So here it is:
所以在这里:
"use strict";
// Implementation of a Generalized Suffix Tree using Ukkonen's algorithm
// See also: https://*.com/q/9452701/5459839
class Node {
constructor() {
this.edges = {};
this.suffixLink = null;
}
addEdge(ch, textId, start, end, node) {
this.edges[ch] = { textId, start, end, node };
}
}
class Nikkonen extends Node {
constructor() {
super(); // root node of the tree
this.texts = [];
}
findNode(s) {
if (!s.length) return;
let node = this,
len,
suffixSize = 0,
edge;
for (let i = 0; i < s.length; i += len) {
edge = node.edges[s.charAt(i)];
if (!edge) return;
len = Math.min(edge.end - edge.start, s.length - i);
if (this.texts[edge.textId].substr(edge.start, len) !== s.substr(i, len)) return;
node = edge.node;
}
return { edge, len };
}
findAll(term, termId = 1) {
const { edge, len } = this.findNode(term) || {};
if (!edge) return {}; // not found
// Find all leaves
const matches = new Map;
(function recurse({ node, textId, start, end }, suffixLen) {
suffixLen += end - start;
const edges = Object.values(node.edges);
if (!edges.length) { // leaf node: calculate the match
if (!(matches.has(textId))) matches.set(textId, []);
matches.get(textId).push({ offset: end - suffixLen, termId });
return;
}
edges.forEach( edge => recurse(edge, suffixLen) );
})(edge, term.length - len);
return matches;
}
addText(text) {
// Implements Nikkonen's algorithm for building the tree
// Inspired by https://felix-halim.net/misc/suffix-tree/
const root = this,
active = {
node: root,
textId: this.texts.length,
start: 0,
end: 0,
},
texts = this.texts;
// Private functions
function getChar(textId, i) {
return texts[textId].charAt(i) || '$' + textId;
}
function addEdge(fromNode, textId, start, end, node) {
fromNode.addEdge(getChar(textId, start), textId, start, end, node);
}
function testAndSplit() {
const ch = getChar(active.textId, active.end);
if (active.start < active.end) {
const edge = active.node.edges[getChar(active.textId, active.start)],
splitPoint = edge.start + active.end - active.start;
if (ch === getChar(edge.textId, splitPoint)) return;
const newNode = new Node();
addEdge(active.node, edge.textId, edge.start, splitPoint, newNode);
addEdge(newNode, edge.textId, splitPoint, edge.end, edge.node);
return newNode;
}
if (!(ch in active.node.edges)) return active.node;
}
function canonize() {
while (active.start < active.end) {
const edge = active.node.edges[getChar(active.textId, active.start)];
if (edge.end - edge.start > active.end - active.start) break;
active.start += edge.end - edge.start;
active.node = edge.node;
}
}
function update() {
let prevNewNode = root,
newNode;
while (newNode = testAndSplit()) {
addEdge(newNode, active.textId, active.end, text.length+1, new Node());
// Rule 2: add suffix-link from previously inserted node
if (prevNewNode !== root) {
prevNewNode.suffixLink = newNode;
}
prevNewNode = newNode;
// Rule 3: follow suffixLink after split
active.node = active.node.suffixLink || root;
canonize(); // because active.node changed
}
if (prevNewNode !== root) {
prevNewNode.suffixLink = active.node;
}
}
texts.push(text);
if (!root.suffixLink) root.suffixLink = new Node();
for (let i = 0; i < text.length; i++) {
addEdge(root.suffixLink, active.textId, i, i+1, root);
}
// Main Ukkonen loop: add each character from left to right to the tree
while (active.end <= text.length) {
update();
active.end++;
canonize(); // because active.end changed
}
}
}
function trincotSuffixTree(query, options, suffixTree, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// create Map keyed by term with count info
const termMap = new Map(terms.map( (term, termId) => [term, { termId, count: 0, leftOver: 0, size: term.length }] ));
terms.forEach( (term) => termMap.get(term).count++ );
function getNonOverlaps(offsets, leftOver, lastIndex = 0, offsetIndex = 0) {
// All terms found?
if (!leftOver) return [];
let giveUpAt = Infinity;
// While still enough matches left over:
while (offsetIndex + leftOver <= offsets.length) {
const { termId, offset } = offsets[offsetIndex++];
if (offset < lastIndex) continue; // overlap, try next
if (offset >= giveUpAt) break; // Looking further makes no sense
const termInfo = termMap.get(terms[termId]);
//console.log('termId', termId, 'offset', offset, 'size', termInfo.size, 'lastIndex', lastIndex);
if (!termInfo.leftOver) continue; // too many of the same term, try next
termInfo.leftOver--;
const result = getNonOverlaps(offsets, leftOver - 1, offset + termInfo.size, offsetIndex);
// If success, then completely backtrack out of recursion.
if (result) return result.concat([offset + termInfo.size, offset]);
termInfo.leftOver++; // restore after failed recursive search and try next
// If a term-match at a given offset could not lead to a solution (in recursion),
// and if we keep those matched character postions all unmatched and only start matching after
// the end of that location, it will certainly not lead to a solution either.
giveUpAt = Math.min(giveUpAt, offset + termInfo.size);
}
}
let allTermsAllOptionsOffsets;
// Loop through the unique terms:
for (let [term, termInfo] of termMap) {
// Get the offsets of the matches of this term in all options (in the preprocessed tree)
const thisTermAllOptionsOffsets = suffixTree.findAll(term, termInfo.termId);
//console.log('findAll:', JSON.stringify(Array.from(thisTermAllOptionsOffsets)));
if (!thisTermAllOptionsOffsets.size) return []; // No option has this term, so bail out
if (!allTermsAllOptionsOffsets) {
allTermsAllOptionsOffsets = thisTermAllOptionsOffsets;
} else {
// Merge with all previously found offsets for other terms (intersection)
for (let [optionId, offsets] of allTermsAllOptionsOffsets) {
let newOffsets = thisTermAllOptionsOffsets.get(optionId);
if (!newOffsets || newOffsets.length < termInfo.count) {
// this option does not have enough occurrences of this term
allTermsAllOptionsOffsets.delete(optionId);
} else {
allTermsAllOptionsOffsets.set(optionId, offsets.concat(newOffsets));
}
}
if (!allTermsAllOptionsOffsets.size) return []; // No option has all terms, so bail out
}
}
// Per option, see if (and where) the offsets can serve non-overlapping matches for each term
const matches = Array.from(allTermsAllOptionsOffsets, ([optionId, offsets]) => {
// Indicate how many of each term must (still) be matched:
termMap.forEach( obj => obj.leftOver = obj.count );
return [optionId, getNonOverlaps(offsets.sort( (a, b) => a.offset - b.offset ), terms.length)];
})
// Remove options that could not provide non-overlapping offsets
.filter( ([_, offsets]) => offsets )
// Sort the remaining options in their original order
.sort( (a,b) => a[0] - b[1] )
// Replace optionId, by the corresponding text and apply mark-up at the offsets
.map( ([optionId, offsets]) => {
let option = options[optionId];
offsets.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
});
//console.log(JSON.stringify(matches));
return matches;
}
function trincotPreprocess(options) {
const nikkonen = new Nikkonen();
// Add all the options (lowercased) to the suffic tree
options.map(option => option.toLowerCase()).forEach(nikkonen.addText.bind(nikkonen));
return nikkonen;
}
const options = ['abbbba', 'United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', '*', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', '*', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', '*, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
let preprocessed;
function processInput() {
if (!preprocessed) { // Only first time
const t0 = performance.now();
preprocessed = trincotPreprocess(options);
const spentTime = performance.now() - t0;
// Output the time spent on preprocessing
pretime.textContent = spentTime.toFixed(2);
}
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotSuffixTree(query, options, preprocessed, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Suffix Tree Search</h3>
Preprocessing Time: <span id="pretime"></span>ms (only done once)<br>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
This method has quite some code behind it, so I suppose it might not show interesting performance for small data sets, while for larger data sets it will be memory consuming: the tree takes much more memory than the original options array.
这个方法背后有相当多的代码,所以我认为它可能不会对小数据集显示出有趣的性能,而对于大数据集,它会消耗内存:树比原始的选项数组占用更多的内存。
#5
0
Update 2
Abandoned the concept of shrinking the set due to issues with restoring the working strings in Vue.
由于在Vue中恢复工作字符串的问题,放弃了收缩集合的概念。
Now, the method is simply as follows:
现在,方法简单如下:
- Pre-process the option set to keep the display in sync with the working.
- 预处理选项设置,以保持显示与工作同步。
- Process the terms.
- 处理条件。
- Reduce (filter) the option set by iterating over it, and looping over terms, short-circuiting when there is a mismatch.
- 减少(筛选)选项集,遍历选项集,并对项进行循环,在不匹配时进行短路。
- Using the reduced set, iterating over each option, find the match ranges.
- 使用减少的集合,遍历每个选项,找到匹配范围。
- Insert HTML strings around each match range.
- 在每个匹配范围内插入HTML字符串。
The code is commented.
代码的注释。
Raw javascript (logs the filtered/manipulated options array): https://jsfiddle.net/pvLj9uxe/14/
原始javascript(记录过滤/操作的选项数组):https://jsfiddle.net/pvLj9uxe/14/
New Vue implementation: https://jsfiddle.net/15prcpxn/30/
新Vue实现:https://jsfiddle.net/15prcpxn/30/
The calculation seems reasonably fast - the DOM update is what kills it.
计算速度似乎相当快,因为DOM更新会杀死它。
Added to comparison*: https://jsfiddle.net/ektyx133/4/
添加到比较*:https://jsfiddle.net/ektyx133/4/
*Caveat: Pre-processing the options (treated as "static") is part of the strategy, so it has been processed outside of the benchmark.
*注意:对选项进行预处理(作为“静态”处理)是策略的一部分,因此它在基准之外进行处理。
var separator = /\s|\*|,/;
// this function enhances the raw options array
function enhanceOptions(options) {
return options.map(option => ({
working: option.toLowerCase(), // for use in filtering the set and matching
display: option // for displaying
}))
}
// this function changes the input to lower case, splits the input into terms, removes empty strings from the array, and enhances the terms with the size and wiping string
function processInput(input) {
return input.trim().toLowerCase().split(separator).filter(term => term.length).map(term => ({
value: term.toLowerCase(),
size: term.length,
wipe: " ".repeat(term.length)
})).sort((a, b) => b.size - a.size);
}
// this function filters the data set, then finds the match ranges, and finally returns an array with HTML tags inserted
function filterAndHighlight(terms, enhancedOptions) {
let options = enhancedOptions,
l = terms.length;
// filter the options - consider recursion instead
options = options.filter(option => {
let i = 0,
working = option.working,
term;
while (i < l) {
if (!~working.indexOf((term = terms[i]).value)) return false;
working = working.replace(term.value, term.wipe);
i++;
}
return true;
})
// generate the display string array
let displayOptions = options.map(option => {
let rangeSet = [],
working = option.working,
display = option.display;
// find the match ranges
terms.forEach(term => {
working = working.replace(term.value, (match, offset) => { // duplicate the wipe string replacement from the filter, but grab the offsets
rangeSet.push({
start: offset,
end: offset + term.size
});
return term.wipe;
})
})
// sort the match ranges, last to first
rangeSet.sort((a, b) => b.start - a.start);
// insert the html tags within the string around each match range
rangeSet.forEach(range => {
display = display.slice(0, range.start) + '<u>' + display.slice(range.start, range.end) + '</u>' + display.slice(range.end)
})
return display;
})
return displayOptions;
}
Old Attempt
https://jsfiddle.net/15prcpxn/25/
https://jsfiddle.net/15prcpxn/25/
My attempt, using Vue for rendering (the methods are sequential, so you can probably put it all into one monolithic function without much effort - inputs will be terms, and full option set; outputs will be filtered option set, and highlighted ranges).
我的尝试,使用Vue进行渲染(这些方法是顺序的,所以你可以不费多大力气就把它们都放到一个独立的函数中——输入将是terms,并设置完整的选项;输出将被过滤选项集和突出显示范围)。
- Split the input into individual terms
- 将输入分成单独的项
- Sort terms by length (longest term first, so that when you have an option such as
"abc ab"
, and the terms"a abc"
, i.e. a term is a substring of another term, it will be able to match"abc"
) - 按长度对项进行排序(最长项优先,以便当您有一个选项,如“abc ab”和“a abc”,即一个项是另一个项的子字符串,它将能够匹配“abc”)
- Copy/change terms to lower case
- 抄写/更改条款为小写
- Copy options (our "display set") to lower case (our "working set")
- 复制选项(我们的“显示集”)到小写(我们的“工作集”)
- For each term, remove working options without matches from the "working set", and in parallel remove display options from the "display set" - when doing so, remove the matched term strings from the surviving working option strings, e.g. matching term
"a"
in option"abc"
yields"bc"
[actual implementation is the inverse: For each term, recreate "working set" when there are matches, and in parallel add display options to the "display set", then pass these sets to the next term] - this gives us the filtered display set - 对于每一项,删除工作选择不匹配的“工作集”,和并行删除显示选项“显示设置”——当这样做时,删除从幸存的工作选择字符串匹配项的字符串,如匹配项选择“abc”收益率的“a”“公元前”[实际实现逆:对于每个术语,当有匹配时重新创建“工作集”,并在“显示集”中并行地添加显示选项,然后将这些设置传递给下一个术语]——这将给我们过滤显示集
- Copy the filtered display set to lower case, to give us a new filtered working set
- 将过滤后的显示集复制到小写,以给我们一个新的过滤工作集
- For each working option in the remaining filtered working set, create a range-set, by recording the ranges (i.e. start and end, e.g. matching term
"a"
in option"abc"
:start = 0, end = 1
) where each term matches by taking the offset (start) of the match and the length of the term/match. Replace the matched string with empty spaces (or other unused character) of equal length to the term and feed this into the next term, e.g. matching term"a"
in option"abc"
yields" bc"
- this preserves the length of the working option, ensuring that the filtered working set (in lower case) matches the filtered display set (original case). The total number of range-sets will be equal to the number of remaining options in the filtered option set. - 为每个工作选择剩下的过滤工作集,创建一个值域范围,通过记录的范围(即开始和结束,如匹配项选项“abc”:“一”= 0开始,结束= 1),每一项比赛通过比赛的抵消(start)和词的长度/匹配。匹配的字符串替换为空的空间(或其他未使用的字符)同等长度的术语和饲料这进入下一项,如匹配项选择“abc”收益率的“a”“公元前”——这个保留的长度选择工作,确保过滤工作集(小写)匹配过滤显示设置(原)。距离集的总数将等于过滤选项集中的剩余选项的数目。
- Also, sort the ranges within each range-set (in descending order, to work in reverse), to allow for string insertion.
- 此外,对每个范围集中的范围进行排序(按降序排列,以便反向工作),以允许字符串插入。
- For each option in the filtered display set, (working in reverse so as not to disturb the index when manipulating the string), insert the
<u>
</u>
tags around the matched ranges by slicing up the display option, e.g. matching term"a"
in option"abc"
:new option = "<u>" + "a" + "</u>" + "bc"
- 过滤显示每个选项的设置,(工作在反向,以免打扰索引时操纵字符串),插入 <你> < / u >标签的匹配范围通过切片显示选项,如匹配术语“选项“abc”:“新选项= < u >”+“a”+“< / u >”+“公元前”
- Render it
- 使它
Performance is poor when there are many matches / non-useful terms (e.g. when you input a single character). For end-use, I'd probably put in an input-calculation delay.
当有许多匹配项/无用项时(例如,当您输入单个字符时),性能会很差。对于最终用途,我可能会加入一个输入计算延迟。
I should be able to roll up some of these steps into fewer steps which may improve performance. I'll revisit tomorrow.
我应该能够将其中的一些步骤简化为更少的步骤,从而提高性能。我明天会重新审视。
Vue presumably also takes care of some of the optimisations through virtual DOM etc. so it won't necessarily be reflective of vanilla Javascript / DOM rendering.
Vue可能还会通过虚拟DOM等来处理一些优化,所以它不一定是普通Javascript / DOM渲染的反映。