Wow, this sounds so complicated in the title, but I assume it is not quite so.
哇,这在标题中听起来很复杂,但我认为并非如此。
I have text files that have basically this layout:
我有基本上这种布局的文本文件:
Stimulus ...
...
...
...
Response
Stimulus ...
...
...
...
Response
I used sed to get everything in between and then further extracted information I needed.
我使用sed来获取介于两者之间的所有内容,然后进一步提取我需要的信息。
sed -n -e '/Stimulus/,/Response/ p'
However, sometimes the participants do not respond, in which case the file looks like this:
但是,有时参与者不响应,在这种情况下文件看起来像这样:
Stimulus ...
...
...
...
Stimulus ...
...
...
...
Response
In this special case, my script will not get what I am looking for. So, I am looking for a way to extract the information if and only if the pattern1 is followed by pattern2, not pattern1 again.
在这种特殊情况下,我的脚本将无法获得我想要的内容。所以,当且仅当pattern1后面跟着pattern2而不是pattern1时,我正在寻找一种提取信息的方法。
Let me know, if I formulated it unclear. I am more then happy to provide further information.
让我知道,如果我表达不清楚。我很乐意提供进一步的信息。
6 个解决方案
#1
7
One dirty way, although it seemed to work in my test, could be to reverse the file content, search from Response
to Stimulus
and reverse again the result.
一种肮脏的方式,虽然它似乎在我的测试中起作用,但可能是反转文件内容,从Response to Stimulus搜索并再次反转结果。
Assuming following input data:
假设输入数据如下:
Stimulus 1...
...
...
...
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
...
...
...
Stimulus 5...
The command:
命令:
tac infile | sed -ne '/Response/,/Stimulus/ p' | tac -
Yields:
产量:
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
EDIT: For an example with isolated Response
parts. There is to filter twice (based on a comment of the OP):
编辑:有关隔离响应部分的示例。有两次过滤(根据OP的评论):
tac infile |
sed -ne '/Response/,/Stimulus/ p' |
tac - |
sed -ne '/Stimulus/,/Response/ p'
#2
5
This is a pure bash solution:
这是一个纯粹的bash解决方案:
tmp=()
while read l; do
[[ $l =~ ^Stimulus ]] && tmp=("$l") && continue
[ ${#tmp[@]} -eq 0 ] && continue
tmp+=("$l")
[[ $l =~ ^Response ]] && printf "%s\n" "${tmp[@]}" && tmp=()
done <infile
It starts to fill up the array tmp
if a list starting with Stimulus
found. If another Stimulus
arrives, it just clears tmp
and starts the job again. If Response
found, it prints the content of the tmp
array. Actually printf
built-in does an implicit loop.
如果找到以Stimulus开头的列表,它将开始填充数组tmp。如果另一个刺激计划到来,它只是清除tmp并再次开始工作。如果找到Response,则打印tmp数组的内容。实际上printf内置了一个隐式循环。
Input:
输入:
cat >infile <<XXX
...
Response 0
...
Stimulus 1
...
Stimulus 2
...
Response 2
...
Stimulus 3
...
Response 3
...
Response 4
XXX
Output:
输出:
Stimulus 2
...
Response 2
Stimulus 3
...
Response 3
#3
4
Other option is switch to perl
and its flip-flop (range operator):
其他选项是切换到perl及其触发器(范围运算符):
perl -lne '
BEGIN {
## Create regular expression to match the initial and final words.
($from_re, $to_re) = map { qr/\A$_/ } qw|Stimulus Response|;
}
## Range, similar to "sed".
if ( $r = ( m/$from_re/o ... m/$to_re/o ) ) {
## If inside the range and found the initial word again, remove
## all lines saved.
if ( $r > 1 && m/$from_re/o ) {
@data = ();
}
## Save line.
push @data, $_;
## At the end of the range, print all lines saved.
if ( $r =~ m/E0\z/ ) {
printf qq|%s\n|, join qq|\n|, @data;
@data = ();
}
}
' infile
Assuming an input file as:
假设输入文件为:
Stimulus 1...
...
...
...
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
...
...
...
Stimulus 5...
It yields:
它产生:
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
#4
4
Here's a pure bash solution that tries to minimize stupid side effects:
这是一个纯粹的bash解决方案,试图最小化愚蠢的副作用:
#!/bin/bash
out=()
while read -r l; do
case "$l" in
Stimulus*) out=( "$l" ) ;;
Response*) ((${#out[@]}!=0)) && { printf "%s\n" "${out[@]}" "$l"; out=(); } ;;
*) ((${#out[@]}!=0)) && out+=( "$l" ) ;;
esac
done < infile
It also handles the case where there's a Response
but no Stimulus
.
它还处理响应但没有刺激的情况。
#5
4
Updated to handle isolated Responses
更新以处理隔离的响应
awk '
/Response/ {
if (p==1) {
for(;k<length(a);) {
print a[++k]
}
print $0
}
delete a;k=p=0
}
/Stimulus/ {
if (p==1) {
delete a; i=0
}
p=1
}
p { a[++i]=$0 }' log
#6
4
Really nice & easy job for GNU sed, one-way, no unwanted pipes & tools:
sed -n 'H;/^Stimulus/{h;d};/^Response/{x;s/^Response//;tk;p;:k;d}' file
Input File:
输入文件:
Stimulus 1... bad bad bad Stimulus 2... ... ... ... Response 2 Stimulus 3... ... ... ... Response 3 Stimulus 4... bad bad bad bad Stimulus 5... ... ... ... ... Response 5 bad bad bad bad Response 6 bad bad bad
And output:
并输出:
$sed -n 'H;/^Stimulus/{h;d};/^Response/{x;s/^Response//;tk;p;:k;d}' file Stimulus 2... ... ... ... Response 2 Stimulus 3... ... ... ... Response 3 Stimulus 5... ... ... ... ... Response 5
And my code for GNU awk:
我的GNU awk代码:
awk '{a[++i]=$0};/^Response/ && a[1] !~ /^Response/ {for (k=1; k<=i; k++) {print a[k]}};/^Stimulus|^Response/ { delete a; i=0; a[++i]=$0}' file
As you can see, I need too much awk code ...
如你所见,我需要太多的awk代码......
#1
7
One dirty way, although it seemed to work in my test, could be to reverse the file content, search from Response
to Stimulus
and reverse again the result.
一种肮脏的方式,虽然它似乎在我的测试中起作用,但可能是反转文件内容,从Response to Stimulus搜索并再次反转结果。
Assuming following input data:
假设输入数据如下:
Stimulus 1...
...
...
...
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
...
...
...
Stimulus 5...
The command:
命令:
tac infile | sed -ne '/Response/,/Stimulus/ p' | tac -
Yields:
产量:
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
EDIT: For an example with isolated Response
parts. There is to filter twice (based on a comment of the OP):
编辑:有关隔离响应部分的示例。有两次过滤(根据OP的评论):
tac infile |
sed -ne '/Response/,/Stimulus/ p' |
tac - |
sed -ne '/Stimulus/,/Response/ p'
#2
5
This is a pure bash solution:
这是一个纯粹的bash解决方案:
tmp=()
while read l; do
[[ $l =~ ^Stimulus ]] && tmp=("$l") && continue
[ ${#tmp[@]} -eq 0 ] && continue
tmp+=("$l")
[[ $l =~ ^Response ]] && printf "%s\n" "${tmp[@]}" && tmp=()
done <infile
It starts to fill up the array tmp
if a list starting with Stimulus
found. If another Stimulus
arrives, it just clears tmp
and starts the job again. If Response
found, it prints the content of the tmp
array. Actually printf
built-in does an implicit loop.
如果找到以Stimulus开头的列表,它将开始填充数组tmp。如果另一个刺激计划到来,它只是清除tmp并再次开始工作。如果找到Response,则打印tmp数组的内容。实际上printf内置了一个隐式循环。
Input:
输入:
cat >infile <<XXX
...
Response 0
...
Stimulus 1
...
Stimulus 2
...
Response 2
...
Stimulus 3
...
Response 3
...
Response 4
XXX
Output:
输出:
Stimulus 2
...
Response 2
Stimulus 3
...
Response 3
#3
4
Other option is switch to perl
and its flip-flop (range operator):
其他选项是切换到perl及其触发器(范围运算符):
perl -lne '
BEGIN {
## Create regular expression to match the initial and final words.
($from_re, $to_re) = map { qr/\A$_/ } qw|Stimulus Response|;
}
## Range, similar to "sed".
if ( $r = ( m/$from_re/o ... m/$to_re/o ) ) {
## If inside the range and found the initial word again, remove
## all lines saved.
if ( $r > 1 && m/$from_re/o ) {
@data = ();
}
## Save line.
push @data, $_;
## At the end of the range, print all lines saved.
if ( $r =~ m/E0\z/ ) {
printf qq|%s\n|, join qq|\n|, @data;
@data = ();
}
}
' infile
Assuming an input file as:
假设输入文件为:
Stimulus 1...
...
...
...
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
Stimulus 4...
...
...
...
Stimulus 5...
It yields:
它产生:
Stimulus 2...
...
...
...
Response 2
Stimulus 3...
...
...
...
Response 3
#4
4
Here's a pure bash solution that tries to minimize stupid side effects:
这是一个纯粹的bash解决方案,试图最小化愚蠢的副作用:
#!/bin/bash
out=()
while read -r l; do
case "$l" in
Stimulus*) out=( "$l" ) ;;
Response*) ((${#out[@]}!=0)) && { printf "%s\n" "${out[@]}" "$l"; out=(); } ;;
*) ((${#out[@]}!=0)) && out+=( "$l" ) ;;
esac
done < infile
It also handles the case where there's a Response
but no Stimulus
.
它还处理响应但没有刺激的情况。
#5
4
Updated to handle isolated Responses
更新以处理隔离的响应
awk '
/Response/ {
if (p==1) {
for(;k<length(a);) {
print a[++k]
}
print $0
}
delete a;k=p=0
}
/Stimulus/ {
if (p==1) {
delete a; i=0
}
p=1
}
p { a[++i]=$0 }' log
#6
4
Really nice & easy job for GNU sed, one-way, no unwanted pipes & tools:
sed -n 'H;/^Stimulus/{h;d};/^Response/{x;s/^Response//;tk;p;:k;d}' file
Input File:
输入文件:
Stimulus 1... bad bad bad Stimulus 2... ... ... ... Response 2 Stimulus 3... ... ... ... Response 3 Stimulus 4... bad bad bad bad Stimulus 5... ... ... ... ... Response 5 bad bad bad bad Response 6 bad bad bad
And output:
并输出:
$sed -n 'H;/^Stimulus/{h;d};/^Response/{x;s/^Response//;tk;p;:k;d}' file Stimulus 2... ... ... ... Response 2 Stimulus 3... ... ... ... Response 3 Stimulus 5... ... ... ... ... Response 5
And my code for GNU awk:
我的GNU awk代码:
awk '{a[++i]=$0};/^Response/ && a[1] !~ /^Response/ {for (k=1; k<=i; k++) {print a[k]}};/^Stimulus|^Response/ { delete a; i=0; a[++i]=$0}' file
As you can see, I need too much awk code ...
如你所见,我需要太多的awk代码......