AWK使用正则表达式中的字段值

时间:2022-03-09 19:31:08

I'm trying to find a string pattern composed of the word CONCLUSION followed by the value of field $2 and field $3 from the same record in field $5.

我正在尝试找到一个字符串模式,由结论一词组成,后跟字段$ 2的值和字段$ 3中相同记录的字段$ 3。

For example, my_file.txt is separated by "|":

例如,my_file.txt以“|”分隔:

1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|
2|substance3|substance4|red|Conclusions: Substance4 is not harmful...|
3|substance5|substance6|red|Substance5 interacts with substance6...|

So in this example I only want the first record to be printed because it has the word "CONCLUSIONS" followed by substance1 followed by substance2.

所以在这个例子中,我只想要打印第一条记录,因为它有“结论”一词,后面跟着物质1,后面跟着物质2。

This is what I'm trying but it's not working:

这是我正在尝试但它不起作用:

awk 'BEGIN{FS="|";IGNORECASE=1}{if ($5 ~ /CONCLUSIONS.*$2.*$3/) {print $0}}' my_file.txt

Any help is much appreciated

任何帮助深表感谢

1 个解决方案

#1


5  

$ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt
1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|

How It Works

  • BEGIN{FS="|";IGNORECASE=1}

    This part is unchanged from the code in the question.

    这部分与问题中的代码没有变化。

  • $5 ~ "conclusions.*" $2 ".*" $3

    $ 5~“结论。*”$ 2“。*”$ 3

    This is a condition: it is true if $5 matches a regex composed of four strings concatenated together: "conclusions.*", and $2, and ".*", and $3.

    这是一个条件:如果$ 5匹配由连接在一起的四个字符串组成的正则表达式,则为真:“结论。*”,$ 2,“。*”和$ 3。

    We have specified no action for this condition. Consequently, if the condition is true, awk performs the default action which is to print the line.

    我们没有为此条件指定任何操作。因此,如果条件为真,则awk执行默认操作,即打印该行。

Simpler Examples

Consider:

$ echo "aa aa" | awk '$2 ~ /$1/'

This line prints nothing because awk does not substitute in for variables inside a regex.

此行不打印任何内容,因为awk不会替换正则表达式中的变量。

Observe that no match is found here either:

观察到这里找不到匹配:

$ echo '$1' | awk '$0 ~ /$1/'

There is no match here because, inside a regex,$ matches only at the end of a line. So, /$1/ would only match the end of a line followed by a 1. If we want to get a match here, we need to escape the dollar sign:

这里没有匹配,因为在正则表达式中,$匹配仅在一行的末尾。所以,/ $ 1 /只会匹配一行后跟一个1的结尾。如果我们想在这里得到一个匹配,我们需要逃避美元符号:

$ echo '$1' | awk '$0 ~ /\$1/'
$1

To get a regex that uses awk variables, we can, as is the basis for this answer, do the following:

要获得使用awk变量的正则表达式,我们可以作为此答案的基础,执行以下操作:

$ echo "aa aa" | awk '$2 ~ $1'
aa aa

This does successfully yield a match.

这确实成功地产生了匹配。

A Further Improvement

As Ed Morton suggests in the comments, it might be important to insist that the substances match only on whole words. In that case, we can use \\<...\\> to limit substance matches to whole words. Thus:

正如埃德莫顿在评论中所建议的那样,坚持认为这些物质只与整个单词相匹配可能很重要。在这种情况下,我们可以使用\\ <... \\>来限制与整个单词的实质匹配。从而:

awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt

In this way, substance1 will not match substance10.

这样,物质1就不会与物质10相匹配。

#1


5  

$ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt
1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|

How It Works

  • BEGIN{FS="|";IGNORECASE=1}

    This part is unchanged from the code in the question.

    这部分与问题中的代码没有变化。

  • $5 ~ "conclusions.*" $2 ".*" $3

    $ 5~“结论。*”$ 2“。*”$ 3

    This is a condition: it is true if $5 matches a regex composed of four strings concatenated together: "conclusions.*", and $2, and ".*", and $3.

    这是一个条件:如果$ 5匹配由连接在一起的四个字符串组成的正则表达式,则为真:“结论。*”,$ 2,“。*”和$ 3。

    We have specified no action for this condition. Consequently, if the condition is true, awk performs the default action which is to print the line.

    我们没有为此条件指定任何操作。因此,如果条件为真,则awk执行默认操作,即打印该行。

Simpler Examples

Consider:

$ echo "aa aa" | awk '$2 ~ /$1/'

This line prints nothing because awk does not substitute in for variables inside a regex.

此行不打印任何内容,因为awk不会替换正则表达式中的变量。

Observe that no match is found here either:

观察到这里找不到匹配:

$ echo '$1' | awk '$0 ~ /$1/'

There is no match here because, inside a regex,$ matches only at the end of a line. So, /$1/ would only match the end of a line followed by a 1. If we want to get a match here, we need to escape the dollar sign:

这里没有匹配,因为在正则表达式中,$匹配仅在一行的末尾。所以,/ $ 1 /只会匹配一行后跟一个1的结尾。如果我们想在这里得到一个匹配,我们需要逃避美元符号:

$ echo '$1' | awk '$0 ~ /\$1/'
$1

To get a regex that uses awk variables, we can, as is the basis for this answer, do the following:

要获得使用awk变量的正则表达式,我们可以作为此答案的基础,执行以下操作:

$ echo "aa aa" | awk '$2 ~ $1'
aa aa

This does successfully yield a match.

这确实成功地产生了匹配。

A Further Improvement

As Ed Morton suggests in the comments, it might be important to insist that the substances match only on whole words. In that case, we can use \\<...\\> to limit substance matches to whole words. Thus:

正如埃德莫顿在评论中所建议的那样,坚持认为这些物质只与整个单词相匹配可能很重要。在这种情况下,我们可以使用\\ <... \\>来限制与整个单词的实质匹配。从而:

awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt

In this way, substance1 will not match substance10.

这样,物质1就不会与物质10相匹配。