regexp从字符串中提取一组单词

时间:2022-09-13 09:23:00

Problem Statement -

问题陈述,

We need to extract a set of sequential occurring words from a string.

我们需要从字符串中提取一组连续出现的单词。

Simplest example is as below with expected input and output.

最简单的例子如下所示。

set of words => "word1|word2|word3";

Input string => "i m word1 word2 and this is word3 word2 word1+ i am having this word2 word3.";"

Output => word1 word2
          word3 word2 word1
          word2 word3

Note -- Please note that there is no space in "word1+" and "word3."

注意——请注意“word1+”和“word3”中没有空格。

Please consider this a simplest input. Complexity can be to any extend. mean there can be multiple set of words ( say 500 word) and we need to find those set of words which are occurring together from an input string.

请认为这是最简单的输入。复杂性可以扩展到任何范围。意味着可以有多组单词(比如500个单词),我们需要找到那些从输入字符串中同时出现的单词。

I am doing this in javascript hence what i tried is as below.

我正在用javascript做这个,因此我尝试如下所示。

var pattern = "word1|word2|word3";
var regobj = new RegExp('((('+pattern+')\\s?)+)', "g");

What is the problem in my solution?

我的解决方案有什么问题?

For Input string => "i m word1word2 and this is word3word2 word1+ i am having this word2 word3.";"

it will give output as 
word1word2        -- wrong
word3word2 word1  -- wrong 
word2 word3

Why I want this? or Real Time use case..!

我为什么要这样做呢?或者实时用例…!

I want to extract word numbers from an complex expression.say

我想从一个复杂的表达中提取字数

"one thousand two+three hundred four+1.3456+log(twenty)"

so here I need to extract

这里我需要提取

one thousand two
three hundred four
twenty

and need to replace it respective numerical equivalent.

并且需要替换它各自的数值等价。

3 个解决方案

#1


3  

Use word boundary:

使用字边界:

\b(?:word1|word2|word3)\b

The complete regex in action in perl:

完整的perl中的regex:

my $str = 'i m word1word2 and this is word3 word2 word1+ i am having this word2 word3.';
my @l = ($str =~ /((?:\b(?:word1|word2|word3)\b(?:\s|\.))+)/g);
dump@l;

output:

输出:

("word3 word2 ", "word2 word3.")

With the last expression:

最后一个表达式:

my $str = 'one thousand two+three hundred four+1.3456+log(twenty)';
my @l = ($str =~ /((?:\b(?:one|two|three|four|twenty|hundred|thousand)\b\s*)+)/g);
dump@l;

output:

输出:

("one thousand two", "three hundred four", "twenty")

#2


0  

For the second part of your problem you could use Lingua::EN::Words2Nums

在你的问题的第二部分,你可以使用Lingua::::Words2Nums。

#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::Words2Nums;

my $string = "one thousand two+three hundred four+1.3456+log(twenty)";
my $re = qr(one|thousand|two|three|hundred|four|twenty);
my @groups = split(m/\+/,$string);
for my $group (@groups) {
    my @words = ($group =~ m/\b$re\b/g);
    next unless @words;
    my $number = words2nums("@words");
    print "@words => $number\n";
}

Output:

输出:

one thousand two => 1002
three hundred four => 304
twenty => 20

#3


0  

In Perl you can use split and grep for it:

在Perl中,您可以使用split和grep:

perl -e '$w="word1|word2|word3"; while(<>){ print join " ", grep { /$w/ } split /\W/, $_ }' 
i m word1 word2 and this is word3 word2 word1+ i am having this word2 word3.
word1 word2 word3 word2 word1 word2 word3

In JavaScript the same functions:

在JavaScript中,同样的函数:

var input="i m word1 word2 and this is word3 word2 word1+ i am having this word2 word3.";
var r=new RegExp("^(word1|word2|word3)$");
var wr=new RegExp("\\W");
var out = new Array();
var split = input.split(wr);
for( var i=0; i < split.length; i++) {
  if( split[i].match( r ) ){
    out.push(split[i]);
  }
} 
console.log(out);

The output:

输出:

["word1", "word2", "word3", "word2", "word1", "word2", "word3"]

#1


3  

Use word boundary:

使用字边界:

\b(?:word1|word2|word3)\b

The complete regex in action in perl:

完整的perl中的regex:

my $str = 'i m word1word2 and this is word3 word2 word1+ i am having this word2 word3.';
my @l = ($str =~ /((?:\b(?:word1|word2|word3)\b(?:\s|\.))+)/g);
dump@l;

output:

输出:

("word3 word2 ", "word2 word3.")

With the last expression:

最后一个表达式:

my $str = 'one thousand two+three hundred four+1.3456+log(twenty)';
my @l = ($str =~ /((?:\b(?:one|two|three|four|twenty|hundred|thousand)\b\s*)+)/g);
dump@l;

output:

输出:

("one thousand two", "three hundred four", "twenty")

#2


0  

For the second part of your problem you could use Lingua::EN::Words2Nums

在你的问题的第二部分,你可以使用Lingua::::Words2Nums。

#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::Words2Nums;

my $string = "one thousand two+three hundred four+1.3456+log(twenty)";
my $re = qr(one|thousand|two|three|hundred|four|twenty);
my @groups = split(m/\+/,$string);
for my $group (@groups) {
    my @words = ($group =~ m/\b$re\b/g);
    next unless @words;
    my $number = words2nums("@words");
    print "@words => $number\n";
}

Output:

输出:

one thousand two => 1002
three hundred four => 304
twenty => 20

#3


0  

In Perl you can use split and grep for it:

在Perl中,您可以使用split和grep:

perl -e '$w="word1|word2|word3"; while(<>){ print join " ", grep { /$w/ } split /\W/, $_ }' 
i m word1 word2 and this is word3 word2 word1+ i am having this word2 word3.
word1 word2 word3 word2 word1 word2 word3

In JavaScript the same functions:

在JavaScript中,同样的函数:

var input="i m word1 word2 and this is word3 word2 word1+ i am having this word2 word3.";
var r=new RegExp("^(word1|word2|word3)$");
var wr=new RegExp("\\W");
var out = new Array();
var split = input.split(wr);
for( var i=0; i < split.length; i++) {
  if( split[i].match( r ) ){
    out.push(split[i]);
  }
} 
console.log(out);

The output:

输出:

["word1", "word2", "word3", "word2", "word1", "word2", "word3"]