如何在perl regexp匹配中排除特定文本

I am using perl to parse a large report file. I pull out names by looking for last name and first name at the beginning of some lines of the report. I am trying to exclude text following the name. Some of these text fields are numbers, thus easy -- I just look for non-digit characters. But some are fixed text fields that I can list.

我使用perl来解析一个大的报告文件。我通过在报告的某些行的开头查找姓氏和名字来提取姓名。我试图排除名称后面的文字。其中一些文本字段是数字,因此很容易 - 我只是寻找非数字字符。但有些是我可以列出的固定文本字段。

E.g. ---

LastNameA, FirstNameA
LastNameB, FirstNameB 345C
LastNameC, FirstNameC BADTEXT
LastNameD, FirstNameD MOREBADTEXT

I have tried the following

我尝试了以下内容

/^(\D*)((BADTEXT|MOREBADTEXT|))/
/^(\D*)(BADTEXT|MOREBADTEXT|)/
/^(\D*?)((BADTEXT|MOREBADTEXT|))/
/^(\D*)((BADTEXT|MOREBADTEXT)?)/
/^(\D*)(?:(BADTEXT|MOREBADTEXT|))/

and several other combinations. But I get either no match or a match with BADTEXT or MOREBADTEXT sucked into $1 instead of $2. I either want the bad text in $2 or not matched at all.

以及其他几种组合。但我得到的是与BADTEXT或MOREBADTEXT无法匹配或匹配,而不是2美元。我要么想要2美元的坏文本,要么根本不匹配。

Note that the text I don't want appended to the name will be one of a very small list of known text strings, so I can add them to the conditional group.

请注意,我不想附加到名称的文本将是一个非常小的已知文本字符串列表之一,所以我可以将它们添加到条件组。

I have read through perlretut twice but can't find how to do this. Seems like it should be simple! Any help is much appreciated.

我已经阅读了perlretut两次,但无法找到如何做到这一点。好像它应该很简单!任何帮助深表感谢。

1 个解决方案

#1

How about splitting the text on whitespace and only keeping the parts you like?

如何在空格上拆分文本,只保留你喜欢的部分?

#!/usr/bin/perl

use strict;
use warnings;

while (my $line=<DATA>) {
    my @name=grep { ! /\d|^BADTEXT$|^MOREBADTEXT$/ } split /\s+/, $line;
    print "@name\n";
}

__DATA__
LastNameA, FirstNameA
LastNameB, FirstNameB 345C
LastNameC, FirstNameC BADTEXT
LastNameD, FirstNameD MOREBADTEXT

Result:

LastNameA, FirstNameA
LastNameB, FirstNameB
LastNameC, FirstNameC
LastNameD, FirstNameD

This of course means that you need to know that no names have digits in them (no Wainright 3, Loudon), and that you can create an exhaustive list of texts you want excluded, and that those never are equal to words in names.

这当然意味着您需要知道没有名字中有数字(没有Wainright 3,Loudon),并且您可以创建一个您想要排除的文本的详尽列表,并且这些文本永远不会等同于名称中的单词。

If you know that there is exactly one lastname and one firstname, you can just grab the first two elements that split() returns.

如果你知道只有一个姓氏和一个名字,你可以抓住split()返回的前两个元素。

#1