查找两个字符串值中出现的常见单词

Suppose I have two strings which may look like below

假设我有两个字符串，它们可能看起来像下面。

var tester = "hello I have to ask you a doubt";
var case   = "hello better explain me the doubt";

This case both strings contains common word such as hello and doubt. So lets say my default string is tester and I have a variable case and it holds set of words that can be any thing. And I do wanna achieve the common words count which present in both tester and case. And it should give me a result in the form of an object.

这种情况下，两个字符串都包含常见的单词，如hello和doubt。假设我的默认字符串是tester我有一个变量case它包含一组可以是任何东西的词。我确实想实现测试者和案例中常见的单词计数。它应该给我一个物体的形式。

Result

结果

{"hello" : 1, "doubt" : 1};

My current implementation is like below

我现在的实现如下所示。

var tester = "hello I have to ask you a doubt";
function getMeRepeatedWordsDetails(case){
    var defaultWords = tester.split(" ");
    var testWords    = case.split(" "), result = {};
    for(var testWord in testWords){
        for(var defaultWord in defaultWords){
            if(defaultWord == testWord){
                result[testWord] = (!result[testWord]) ? 1 : (result[testWord] + 1);  
            }
        }
    }
    return result;
}

As I suspect there are Regex to make this task easier since it can find the pattern matches. But not sure this can be achieved using Regex. I need to know did I'm following the right path to do the same.

正如我所怀疑的，Regex可以使这个任务更容易，因为它可以找到模式匹配。但不确定是否可以使用Regex实现这一点。我需要知道我是否遵循了正确的路径去做同样的事。

1 个解决方案

#1

You can use a first regular expression as a tokenizer to split the tester string into a list of words, then use such words to build a second regular expression that matches the word list. For example:

您可以使用第一个正则表达式作为标记器将测试字符串分割为一个单词列表，然后使用这些单词构建第二个与单词列表匹配的正则表达式。例如:

var tester = "a string with a lot of words";

function getMeRepeatedWordsDetails ( sentence ) {
  sentence = sentence + " ";
  var regex = /[^\s]+/g;
  var regex2 = new RegExp ( "(" + tester.match ( regex ).join ( "|" ) + ")\\W", "g" );
  matches = sentence.match ( regex2 );
  var words = {};
  for ( var i = 0; i < matches.length; i++ ) {
    var match = matches [ i ].replace ( /\W/g, "" );
    var w = words [ match ];
    if ( ! w )
      words [ match ] = 1;
    else
      words [ match ]++;
  }   
  return words;
} 

console.log ( getMeRepeatedWordsDetails ( "another string with some words" ) );

The tokenizer is the line:

记号笔是这样的:

var regex = /[^\s]+/g;

When you do:

当你做的事:

tester.match ( regex )

you get the list of words contained in tester:

你得到了测试者包含的单词列表:

[ "a", "string", "with", "a", "lot", "of", "words" ]

With such an array we build a second regular expression that matches all the words; regex2 has the form:

使用这样一个数组，我们构建了第二个正则表达式，该表达式匹配所有的单词;regex2形式:

/(a|string|with|a|lot|of|words)\W/g

The \W is added to match only whole words, otherwise the a element will match any word beginning with a. The result of applying regex2 to sentence is another array with only the words that are contained in regex2, that is the words that are contained both in tester and sentence. Then the for loop only counts the words in the matches array transforming it into the object you requested.

\ W添加到只匹配整个单词,否则一个元素将匹配任何词开头。regex2应用到句子的结果是另一个数组只有regex2中包含的词,这是包含在测试和句子的词。然后for循环只计算匹配数组中的单词，将其转换为您所请求的对象。

But beware that:

但要注意:

you have to put at least a space at the end of sentence otherwise the \W in regex2 doesn't match the last word: sentence = sentence + " "
你必须在句尾至少留一个空格，否则regex2中的\W与最后一个词不匹配:句子=句子+ "
you have to remove some possible extra character form the matches that has been captured by the \W: match = matches [ i ].replace ( /\W/g, "" )
您必须从由\W: match = matches [i]捕获的匹配中删除一些可能的额外字符。更换(/\W/g ")

#1