JavaScript正则表达式匹配句子中的单词

时间:2021-09-20 04:40:57

What should be the regex for matching a specific word in every sentence in JavaScript?

在JavaScript中每个句子中匹配特定单词的正则表达式应该是什么?

The rules for matching the sentence are clear: It should end with dot (.) and the next letter should be capital.

匹配句子的规则是明确的:它应以点(。)结尾,下一个字母应为大写。

But what I need to achieve is match a word in each sentence. So I suppose I should use groups. Or should I put the string word within the regex?

但我需要达到的是在每个句子中匹配一个单词。所以我想我应该使用群组。或者我应该将字符串单词放在正则表达式中?

Here is my java regex for looping the sentences enter link

这是我的java正则表达式循环句子输入链接

Here is my java regex for matching words in -5 +5 word context: enter link But I will need to have a combination of both in JavaScript.

这是我在java +5单词上下文中匹配单词的java正则表达式:输入链接但是我需要在JavaScript中同时使用它们。

My goal:

Input:

Cliffs have collapsed in New Zealand during an earthquake in the city of Christchurch on the South Island. No serious damage or fatalities were reported in the Valentine's Day quake that struck at 13:13 local time. Based on the med. report everybody were ok.

在南岛基督城市发生地震时,新西兰的悬崖倒塌了。在当地时间13点13分发生的情人节地震中没有发生严重损坏或死亡事故。基于医学。报告每个人都很好。

Output for chosen word "on":

所选单词“on”的输出:

  1. Cliffs have collapsed in New Zealand during an earthquake in the city of Christchurch on the South Island
  2. 在南岛基督城市发生地震时,新西兰的悬崖倒塌了

  3. Based on the med. report everybody were ok.
  4. 基于医学。报告每个人都很好。

1 个解决方案

#1


2  

Update: I provide two solutions below. My original answer only provided the first.

更新:我在下面提供两种解决方案。我的原始答案仅提供了第一个。

  1. One solution uses a single regex to try to parse the entire original paragraph. It can be done, but as described below, may not be the best solution.

    一种解决方案使用单个正则表达式来尝试解析整个原始段落。可以这样做,但如下所述,可能不是最佳解决方案。

  2. An alternative solution is a more involved algorithm, but uses lighter regex's. It splits the text into sentences and works on each sentence separately. This solution is much more efficient and, might I say, more elegant.

    另一种解决方案是更复杂的算法,但使用更轻的正则表达式。它将文本分成句子并分别处理每个句子。这个解决方案效率更高,我可以说更优雅。

Solution 1: Single Regex

解决方案1:单一正则表达式

Run the first code snippet below to demo this solution. It finds all sentences (as you defined them) that contain any keyword you want. The complete regex is...

运行下面的第一个代码段来演示此解决方案。它会找到包含您想要的任何关键字的所有句子(如您所定义)。完整的正则表达式是......

\. +([A-Z]([^.]|.(?! +[A-Z]))*?" + keyword + "([^.]|.(?! +[A-Z]))*?\.(?= +[A-Z]))

\。 +([AZ]([^。] |。(?!+ [AZ]))*?“+关键字+”([^。] |。(?!+ [AZ]))*?\。(? = + [AZ]))

...but the code breaks it down into much more understandable pieces.

...但是代码将其分解为更容易理解的部分。

Once you click the 'Run code snippet' button, it takes a few seconds to run.

单击“运行代码段”按钮后,需要几秒钟才能运行。

This is a fairly regex-heavy solution. It can be fairly slow. Using the example paragraph you provided, this routine becomes intolerably slow. Even being this slow, it is actually not complex enough, as it can't tell when the keyword is embedded in another word. (e.g. when looking for "cats" it will also find "catsup"). Trying to avoid that sort of embedding is possible, but it just made the whole thing too slow to even demonstrate.

这是一个相当正规的解决方案。它可能相当慢。使用您提供的示例段落,此例程变得无法忍受地缓慢。即使这么慢,它实际上也不够复杂,因为它无法判断关键字何时嵌入另一个单词中。 (例如,当寻找“猫”时,它也会发现“猫酱”)。试图避免这种嵌入是可能的,但它只是使整个事情太慢甚至无法演示。

var text = "I like cats. I really like cats. I also like dogs. Dogs and cats are pets. Approx. half of pets are cats. Approx. half of pets are dogs. Some cats are v. expensive.";

var keyword = "cats";

var reStr =
  "\. +"                   + // a preceding sentence-ender, i.e. a period
                             //   followed by one or more spaces
  "("                      + // begin remembering the match (i.e. arr[1] below)
    "[A-Z]"                + // a sentence-starter, i.e. an uppercase letter
    "("                    + // start of a sentence-continuer, which is either
      "[^.]"               + // anything but a period
      "|"                  + // or
      "\.(?! +[A-Z])"      + // a period not followed by one or more spaces
                             //   and an uppercase letter
    ")"                    + // end of a sentence-continuer
    "*?"                   + // zero or more of the preceding sentence-continuers
                             //   but as few as possible
    keyword                + // the keyword being sought
    "([^.]|\.(?! +[A-Z]))" + // a sentence-continuer, as described above
    "*?"                   + // zero or more of them but as few as possible
    "\."                   + // a sentence-ender, i.e. a period
    "(?= +[A-Z])"          + // followed by one or more spaces and an
                             //   uppercase letter, which is not remembered
  ")";                       // finish remembering the match

// That ends up being the following:
// "\. +([A-Z]([^.]|.(?! +[A-Z]))*?" + keyword + "([^.]|.(?! +[A-Z]))*?\.(?= +[A-Z]))"


var re = new RegExp(reStr, "g"); // construct the regular expression

var sentencesWithKeyword = []; // initialize an array to keep the hits
var arr; // prepare an array to temporarily keep 'exec' return values
var expandedText = ". " + text + " A";
// add a sentence-ender (i.e. a period) before the text
//   and a sentence-starter (i.e. an uppercase letter) after the text
//   to facilitate finding the first and last sentences

while ((arr = re.exec(expandedText)) !== null) { // while hits are found
  sentencesWithKeyword.push(arr[1]); // remember the sentence found
  re.lastIndex -= 2; // start the next search two characters back
                     //   to allow for starting the next match
                     //   with the period that ended the current match
}

// show the results
show("Text to search:");
show(text);
show("Query string: " + keyword);
show("Hits:");
for (var num = 0; num < sentencesWithKeyword.length; num += 1) {
  show((num + 1) + ". " + sentencesWithKeyword[num]);
}

function show(msg) {
  document.write("<p>" + msg + "</p>");
}

Solution 2: Divide and Conquer

解决方案2:分而治之

Here, you do the following:

在这里,您执行以下操作:

  • split the original text into an array of sentence elements
  • 将原始文本拆分为句子元素数组

  • search each sentence for the keyword
  • 在每个句子中搜索关键字

  • keep those have the keyword, discard those that don't
  • 保持那些有关键字,丢弃那些不关键字

That way, any regex's you use do not have to simultaneously deal with splitting into sentences, searching for the keyword, keeping hits and discarding non-hits, all in one massive regex.

这样,你使用的任何正则表达式都不必同时处理分裂成句子,搜索关键字,保持命中和丢弃非命中,都在一个大规模的正则表达式中。

var textToSearch = "I like cats. I really like cats. I also like dogs. Cats are great.  Catsup is tasty. Dogs and cats are pets. Approx. half of pets are cats. Approx. half of pets are dogs. Some cats are v. expensive.";

var keyword = "cats";

var sentences = {
  all           : [],
  withKeyword   : [],
  withNoKeyword : []
}

var sentenceRegex = new RegExp("([.]) +([A-Z])", "g");
var sentenceSeparator = "__SENTENCE SEPARATOR__";
var modifiedText = textToSearch.replace(sentenceRegex, "$1" + sentenceSeparator + "$2");
sentences.all = modifiedText.split(sentenceSeparator);

sentences.all.forEach(function(sentence) {
  var keywordRegex = new RegExp("(^| +)" + keyword + "( +|[.])", "i");
  var keywordFound = keywordRegex.test(sentence);
  if (keywordFound) {
    sentences.withKeyword.push(sentence);
  } else {
    sentences.withNoKeyword.push(sentence);
  }
});

document.write("<pre>" + JSON.stringify(sentences, null, 2) + "</pre>");

#1


2  

Update: I provide two solutions below. My original answer only provided the first.

更新:我在下面提供两种解决方案。我的原始答案仅提供了第一个。

  1. One solution uses a single regex to try to parse the entire original paragraph. It can be done, but as described below, may not be the best solution.

    一种解决方案使用单个正则表达式来尝试解析整个原始段落。可以这样做,但如下所述,可能不是最佳解决方案。

  2. An alternative solution is a more involved algorithm, but uses lighter regex's. It splits the text into sentences and works on each sentence separately. This solution is much more efficient and, might I say, more elegant.

    另一种解决方案是更复杂的算法,但使用更轻的正则表达式。它将文本分成句子并分别处理每个句子。这个解决方案效率更高,我可以说更优雅。

Solution 1: Single Regex

解决方案1:单一正则表达式

Run the first code snippet below to demo this solution. It finds all sentences (as you defined them) that contain any keyword you want. The complete regex is...

运行下面的第一个代码段来演示此解决方案。它会找到包含您想要的任何关键字的所有句子(如您所定义)。完整的正则表达式是......

\. +([A-Z]([^.]|.(?! +[A-Z]))*?" + keyword + "([^.]|.(?! +[A-Z]))*?\.(?= +[A-Z]))

\。 +([AZ]([^。] |。(?!+ [AZ]))*?“+关键字+”([^。] |。(?!+ [AZ]))*?\。(? = + [AZ]))

...but the code breaks it down into much more understandable pieces.

...但是代码将其分解为更容易理解的部分。

Once you click the 'Run code snippet' button, it takes a few seconds to run.

单击“运行代码段”按钮后,需要几秒钟才能运行。

This is a fairly regex-heavy solution. It can be fairly slow. Using the example paragraph you provided, this routine becomes intolerably slow. Even being this slow, it is actually not complex enough, as it can't tell when the keyword is embedded in another word. (e.g. when looking for "cats" it will also find "catsup"). Trying to avoid that sort of embedding is possible, but it just made the whole thing too slow to even demonstrate.

这是一个相当正规的解决方案。它可能相当慢。使用您提供的示例段落,此例程变得无法忍受地缓慢。即使这么慢,它实际上也不够复杂,因为它无法判断关键字何时嵌入另一个单词中。 (例如,当寻找“猫”时,它也会发现“猫酱”)。试图避免这种嵌入是可能的,但它只是使整个事情太慢甚至无法演示。

var text = "I like cats. I really like cats. I also like dogs. Dogs and cats are pets. Approx. half of pets are cats. Approx. half of pets are dogs. Some cats are v. expensive.";

var keyword = "cats";

var reStr =
  "\. +"                   + // a preceding sentence-ender, i.e. a period
                             //   followed by one or more spaces
  "("                      + // begin remembering the match (i.e. arr[1] below)
    "[A-Z]"                + // a sentence-starter, i.e. an uppercase letter
    "("                    + // start of a sentence-continuer, which is either
      "[^.]"               + // anything but a period
      "|"                  + // or
      "\.(?! +[A-Z])"      + // a period not followed by one or more spaces
                             //   and an uppercase letter
    ")"                    + // end of a sentence-continuer
    "*?"                   + // zero or more of the preceding sentence-continuers
                             //   but as few as possible
    keyword                + // the keyword being sought
    "([^.]|\.(?! +[A-Z]))" + // a sentence-continuer, as described above
    "*?"                   + // zero or more of them but as few as possible
    "\."                   + // a sentence-ender, i.e. a period
    "(?= +[A-Z])"          + // followed by one or more spaces and an
                             //   uppercase letter, which is not remembered
  ")";                       // finish remembering the match

// That ends up being the following:
// "\. +([A-Z]([^.]|.(?! +[A-Z]))*?" + keyword + "([^.]|.(?! +[A-Z]))*?\.(?= +[A-Z]))"


var re = new RegExp(reStr, "g"); // construct the regular expression

var sentencesWithKeyword = []; // initialize an array to keep the hits
var arr; // prepare an array to temporarily keep 'exec' return values
var expandedText = ". " + text + " A";
// add a sentence-ender (i.e. a period) before the text
//   and a sentence-starter (i.e. an uppercase letter) after the text
//   to facilitate finding the first and last sentences

while ((arr = re.exec(expandedText)) !== null) { // while hits are found
  sentencesWithKeyword.push(arr[1]); // remember the sentence found
  re.lastIndex -= 2; // start the next search two characters back
                     //   to allow for starting the next match
                     //   with the period that ended the current match
}

// show the results
show("Text to search:");
show(text);
show("Query string: " + keyword);
show("Hits:");
for (var num = 0; num < sentencesWithKeyword.length; num += 1) {
  show((num + 1) + ". " + sentencesWithKeyword[num]);
}

function show(msg) {
  document.write("<p>" + msg + "</p>");
}

Solution 2: Divide and Conquer

解决方案2:分而治之

Here, you do the following:

在这里,您执行以下操作:

  • split the original text into an array of sentence elements
  • 将原始文本拆分为句子元素数组

  • search each sentence for the keyword
  • 在每个句子中搜索关键字

  • keep those have the keyword, discard those that don't
  • 保持那些有关键字,丢弃那些不关键字

That way, any regex's you use do not have to simultaneously deal with splitting into sentences, searching for the keyword, keeping hits and discarding non-hits, all in one massive regex.

这样,你使用的任何正则表达式都不必同时处理分裂成句子,搜索关键字,保持命中和丢弃非命中,都在一个大规模的正则表达式中。

var textToSearch = "I like cats. I really like cats. I also like dogs. Cats are great.  Catsup is tasty. Dogs and cats are pets. Approx. half of pets are cats. Approx. half of pets are dogs. Some cats are v. expensive.";

var keyword = "cats";

var sentences = {
  all           : [],
  withKeyword   : [],
  withNoKeyword : []
}

var sentenceRegex = new RegExp("([.]) +([A-Z])", "g");
var sentenceSeparator = "__SENTENCE SEPARATOR__";
var modifiedText = textToSearch.replace(sentenceRegex, "$1" + sentenceSeparator + "$2");
sentences.all = modifiedText.split(sentenceSeparator);

sentences.all.forEach(function(sentence) {
  var keywordRegex = new RegExp("(^| +)" + keyword + "( +|[.])", "i");
  var keywordFound = keywordRegex.test(sentence);
  if (keywordFound) {
    sentences.withKeyword.push(sentence);
  } else {
    sentences.withNoKeyword.push(sentence);
  }
});

document.write("<pre>" + JSON.stringify(sentences, null, 2) + "</pre>");