从文本文件中的重复范围模式中获取特定行

时间:2021-08-06 02:22:55

Wow, this sounds so complicated in the title, but I assume it is not quite so.

哇,这在标题中听起来很复杂,但我认为并非如此。

I have text files that have basically this layout:

我有基本上这种布局的文本文件:

Stimulus ...
...
...
...
Response
Stimulus ...
...
...
...
Response

I used sed to get everything in between and then further extracted information I needed.

我使用sed来获取介于两者之间的所有内容,然后进一步提取我需要的信息。

sed -n -e '/Stimulus/,/Response/ p'

However, sometimes the participants do not respond, in which case the file looks like this:

但是,有时参与者不响应,在这种情况下文件看起来像这样:

Stimulus ...
...
...
...
Stimulus ...
...
...
...
Response

In this special case, my script will not get what I am looking for. So, I am looking for a way to extract the information if and only if the pattern1 is followed by pattern2, not pattern1 again.

在这种特殊情况下,我的脚本将无法获得我想要的内容。所以,当且仅当pattern1后面跟着pattern2而不是pattern1时,我正在寻找一种提取信息的方法。

Let me know, if I formulated it unclear. I am more then happy to provide further information.

让我知道,如果我表达不清楚。我很乐意提供进一步的信息。

6 个解决方案

#1


7  

One dirty way, although it seemed to work in my test, could be to reverse the file content, search from Response to Stimulus and reverse again the result.

一种肮脏的方式,虽然它似乎在我的测试中起作用,但可能是反转文件内容,从Response to Stimulus搜索并再次反转结果。

Assuming following input data:

假设输入数据如下:

Stimulus 1...
...
...
...
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
...
...
...
Stimulus 5...

The command:

命令:

tac infile | sed -ne '/Response/,/Stimulus/ p' | tac -

Yields:

产量:

Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3

EDIT: For an example with isolated Response parts. There is to filter twice (based on a comment of the OP):

编辑:有关隔离响应部分的示例。有两次过滤(根据OP的评论):

tac infile | 
  sed -ne '/Response/,/Stimulus/ p' | 
  tac - | 
  sed -ne '/Stimulus/,/Response/ p'

#2


5  

This is a pure solution:

这是一个纯粹的bash解决方案:

tmp=()
while read l; do
  [[ $l =~ ^Stimulus ]] && tmp=("$l") && continue
  [ ${#tmp[@]} -eq 0 ] && continue
  tmp+=("$l")
  [[ $l =~ ^Response ]] && printf "%s\n" "${tmp[@]}" && tmp=()
done <infile

It starts to fill up the array tmp if a list starting with Stimulus found. If another Stimulus arrives, it just clears tmp and starts the job again. If Response found, it prints the content of the tmp array. Actually printf built-in does an implicit loop.

如果找到以Stimulus开头的列表,它将开始填充数组tmp。如果另一个刺激计划到来,它只是清除tmp并再次开始工作。如果找到Response,则打印tmp数组的内容。实际上printf内置了一个隐式循环。

Input:

输入:

cat >infile <<XXX
...
Response 0
...
Stimulus 1
...
Stimulus 2
...
Response 2
...
Stimulus 3
...
Response 3
...
Response 4
XXX

Output:

输出:

Stimulus 2
...
Response 2
Stimulus 3
...
Response 3

#3


4  

Other option is switch to perl and its flip-flop (range operator):

其他选项是切换到perl及其触发器(范围运算符):

perl -lne '
    BEGIN {
        ## Create regular expression to match the initial and final words.
        ($from_re, $to_re) = map { qr/\A$_/ } qw|Stimulus Response|;
    }
    ## Range, similar to "sed".
    if ( $r = ( m/$from_re/o ... m/$to_re/o ) ) {
        ## If inside the range and found the initial word again, remove 
        ## all lines saved.
        if ( $r > 1 && m/$from_re/o ) {
            @data = ();
        }
        ## Save line.
        push @data, $_;
        ## At the end of the range, print all lines saved.
        if ( $r =~ m/E0\z/ ) {
            printf qq|%s\n|, join qq|\n|, @data;
            @data = ();
        }
    }
' infile

Assuming an input file as:

假设输入文件为:

Stimulus 1...
...
...
...
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
...
...
...
Stimulus 5...

It yields:

它产生:

Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3

#4


4  

Here's a pure solution that tries to minimize stupid side effects:

这是一个纯粹的bash解决方案,试图最小化愚蠢的副作用:

#!/bin/bash

out=()

while read -r l; do
   case "$l" in
       Stimulus*) out=( "$l" ) ;;
       Response*) ((${#out[@]}!=0)) && { printf "%s\n" "${out[@]}" "$l"; out=(); } ;;
       *) ((${#out[@]}!=0)) && out+=( "$l" ) ;;
   esac
done < infile

It also handles the case where there's a Response but no Stimulus.

它还处理响应但没有刺激的情况。

#5


4  

Updated to handle isolated Responses

更新以处理隔离的响应

awk '
/Response/ { 
    if (p==1) {
        for(;k<length(a);) {
            print a[++k]
        }
        print $0
    }
    delete a;k=p=0
} 
/Stimulus/ {
    if (p==1) {
        delete a; i=0
    }
    p=1
} 
p { a[++i]=$0 }' log

#6


4  

Really nice & easy job for GNU , one-way, no unwanted pipes & tools:

sed -n 'H;/^Stimulus/{h;d};/^Response/{x;s/^Response//;tk;p;:k;d}' file

Input File:

输入文件:

Stimulus 1...
bad
bad
bad
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
bad
bad
bad
bad
Stimulus 5...
...
...
...
...
Response 5
bad
bad
bad
bad
Response 6
bad
bad
bad

And output:

并输出:

$sed -n 'H;/^Stimulus/{h;d};/^Response/{x;s/^Response//;tk;p;:k;d}' file
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 5...
...
...
...
...
Response 5

And my code for GNU :

我的GNU awk代码:

awk '{a[++i]=$0};/^Response/ && a[1] !~ /^Response/ {for (k=1; k<=i; k++) {print a[k]}};/^Stimulus|^Response/ { delete a; i=0; a[++i]=$0}' file

As you can see, I need too much awk code ...

如你所见,我需要太多的awk代码......

#1


7  

One dirty way, although it seemed to work in my test, could be to reverse the file content, search from Response to Stimulus and reverse again the result.

一种肮脏的方式,虽然它似乎在我的测试中起作用,但可能是反转文件内容,从Response to Stimulus搜索并再次反转结果。

Assuming following input data:

假设输入数据如下:

Stimulus 1...
...
...
...
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
...
...
...
Stimulus 5...

The command:

命令:

tac infile | sed -ne '/Response/,/Stimulus/ p' | tac -

Yields:

产量:

Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3

EDIT: For an example with isolated Response parts. There is to filter twice (based on a comment of the OP):

编辑:有关隔离响应部分的示例。有两次过滤(根据OP的评论):

tac infile | 
  sed -ne '/Response/,/Stimulus/ p' | 
  tac - | 
  sed -ne '/Stimulus/,/Response/ p'

#2


5  

This is a pure solution:

这是一个纯粹的bash解决方案:

tmp=()
while read l; do
  [[ $l =~ ^Stimulus ]] && tmp=("$l") && continue
  [ ${#tmp[@]} -eq 0 ] && continue
  tmp+=("$l")
  [[ $l =~ ^Response ]] && printf "%s\n" "${tmp[@]}" && tmp=()
done <infile

It starts to fill up the array tmp if a list starting with Stimulus found. If another Stimulus arrives, it just clears tmp and starts the job again. If Response found, it prints the content of the tmp array. Actually printf built-in does an implicit loop.

如果找到以Stimulus开头的列表,它将开始填充数组tmp。如果另一个刺激计划到来,它只是清除tmp并再次开始工作。如果找到Response,则打印tmp数组的内容。实际上printf内置了一个隐式循环。

Input:

输入:

cat >infile <<XXX
...
Response 0
...
Stimulus 1
...
Stimulus 2
...
Response 2
...
Stimulus 3
...
Response 3
...
Response 4
XXX

Output:

输出:

Stimulus 2
...
Response 2
Stimulus 3
...
Response 3

#3


4  

Other option is switch to perl and its flip-flop (range operator):

其他选项是切换到perl及其触发器(范围运算符):

perl -lne '
    BEGIN {
        ## Create regular expression to match the initial and final words.
        ($from_re, $to_re) = map { qr/\A$_/ } qw|Stimulus Response|;
    }
    ## Range, similar to "sed".
    if ( $r = ( m/$from_re/o ... m/$to_re/o ) ) {
        ## If inside the range and found the initial word again, remove 
        ## all lines saved.
        if ( $r > 1 && m/$from_re/o ) {
            @data = ();
        }
        ## Save line.
        push @data, $_;
        ## At the end of the range, print all lines saved.
        if ( $r =~ m/E0\z/ ) {
            printf qq|%s\n|, join qq|\n|, @data;
            @data = ();
        }
    }
' infile

Assuming an input file as:

假设输入文件为:

Stimulus 1...
...
...
...
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
...
...
...
Stimulus 5...

It yields:

它产生:

Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3

#4


4  

Here's a pure solution that tries to minimize stupid side effects:

这是一个纯粹的bash解决方案,试图最小化愚蠢的副作用:

#!/bin/bash

out=()

while read -r l; do
   case "$l" in
       Stimulus*) out=( "$l" ) ;;
       Response*) ((${#out[@]}!=0)) && { printf "%s\n" "${out[@]}" "$l"; out=(); } ;;
       *) ((${#out[@]}!=0)) && out+=( "$l" ) ;;
   esac
done < infile

It also handles the case where there's a Response but no Stimulus.

它还处理响应但没有刺激的情况。

#5


4  

Updated to handle isolated Responses

更新以处理隔离的响应

awk '
/Response/ { 
    if (p==1) {
        for(;k<length(a);) {
            print a[++k]
        }
        print $0
    }
    delete a;k=p=0
} 
/Stimulus/ {
    if (p==1) {
        delete a; i=0
    }
    p=1
} 
p { a[++i]=$0 }' log

#6


4  

Really nice & easy job for GNU , one-way, no unwanted pipes & tools:

sed -n 'H;/^Stimulus/{h;d};/^Response/{x;s/^Response//;tk;p;:k;d}' file

Input File:

输入文件:

Stimulus 1...
bad
bad
bad
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
bad
bad
bad
bad
Stimulus 5...
...
...
...
...
Response 5
bad
bad
bad
bad
Response 6
bad
bad
bad

And output:

并输出:

$sed -n 'H;/^Stimulus/{h;d};/^Response/{x;s/^Response//;tk;p;:k;d}' file
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 5...
...
...
...
...
Response 5

And my code for GNU :

我的GNU awk代码:

awk '{a[++i]=$0};/^Response/ && a[1] !~ /^Response/ {for (k=1; k<=i; k++) {print a[k]}};/^Stimulus|^Response/ { delete a; i=0; a[++i]=$0}' file

As you can see, I need too much awk code ...

如你所见,我需要太多的awk代码......