Perl - Regex只提取以逗号分隔的字符串

I have a question I am hoping someone could help with...

我有一个问题,我希望有人能帮助...

I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).

我有一个包含网页内容的变量(使用WWW :: Mechanize抓取)。

The variable contains data such as these:

该变量包含以下数据:

$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"

The only bits I am interested in from the above examples are:

我从上面的例子中感兴趣的唯一一点是:

@array = ("cat_dog","horse","rabbit","chicken-pig")
@array = ("elephant","MOUSE_RAT","spider","lion-tiger") 
@array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")

The problem I am having:

我遇到的问题:

I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.

我试图从变量中仅提取逗号分隔的字符串,然后将它们存储在数组中以供以后使用。

But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.

但是,最好的方法是确保我在逗号分隔的动物列表的开头(即cat_dog)和结尾(即鸡 - 猪)得到字符串,因为它们没有前缀/后缀逗号。

Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...

此外,由于变量将包含网页内容,因此不可避免的是,可能还存在逗号立即由空格继续然后是另一个单词的实例,因为这是在段落和句子中使用逗号的正确方法...

For example:

Saturn was long thought to be the only ringed planet, however, this is now known not to be the case. 
                                                     ^        ^
                                                     |        |
                                    note the spaces here and here

I am not interested in any cases where the comma is followed by a space (as shown above).

我对逗号后跟空格的任何情况都不感兴趣(如上所示)。

I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)

我只对逗号之后没有空格的情况感兴趣(即cat_dog,horse,rabbit,chicken-pig)

I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.

我尝试了很多方法来做这个,但无法找到构建正则表达式的最佳方法。

4 个解决方案

#1

How about

[^,\s]+(,[^,\s]+)+

which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.

它将匹配一个或多个不是空格或逗号的字符[^,\ s] +后跟一个逗号和一个或多个不是空格或逗号的字符,一次或多次。

Further to comments

进一步评论

To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to @matches.

要匹配多个序列,请添加g修饰符以进行全局匹配。以下拆分每个匹配$&on a,并将结果推送到@matches。

my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my @matches;

while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
    push(@matches, split(/,/, $&));
}   

print join("\n",@matches),"\n";

#2

Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently

虽然你可以构造一个正则表达式,但是正则表达式,分裂,grep和map的组合看起来很不错

my @array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split

Going from right to left:

从右到左:

Split the line on spaces (split)

拆分空格线(拆分)

Leave only elements having no comma at the either end but having one inside (grep)

只保留两端没有逗号但内部有一个逗号的元素(grep)

Split each such element into parts (map and split)

将每个这样的元素拆分成部分(映射和拆分)

That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.

这样你可以很容易地改变部件,例如消除两个连续的逗号在grep中添加&&!/ ,, /。

#3

I hope this is clear and suits your needs:

我希望这很清楚,适合您的需求:

 #!/usr/bin/perl
    use warnings;
    use strict;

    my @strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
    "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf", 
     "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew", 
     "Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
     "Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");

    my $regex = qr/
                \s #From your examples, it seems as if every
                   #comma separated list is preceded by a space.
                (
                    (?:
                        [^,\s]+ #Now, not a comma or a space for the
                                 #terms of the list

                        ,        #followed by a comma
                    )+
                    [^,\s]+     #followed by one last term of the list
                )
                /x;

    my @matches = map {
                    $_ =~ /$regex/;
                    if ($1) {
                        my $comma_sep_list = $1;
                        [split ',', $comma_sep_list];
                    }
                    else {
                        []
                    }
                } @strs;

#4

$var =~ tr/ //s;    
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
      push (@arr, $&);
    }

the regular expression matches three cases :

正则表达式匹配三种情况:

(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,)      : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig

#1