I am new to shell scripting, it would be great if I can get some help with the question below.

我是shell脚本的新手,如果我能从下面的问题中得到一些帮助,那就太棒了。

I want to read a text file line by line, and print all matched patterns in that line to a line in a new text file.

我想逐行读取文本文件,并将该行中的所有匹配模式打印到新文本文件中的一行。

For example:

$ cat input.txt

SYSTEM ERROR: EU-1C0A  Report error -- SYSTEM ERROR: TM-0401 DEFAULT Test error
SYSTEM ERROR: MG-7688 DEFAULT error -- SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error -- ERROR: MG-3218 error occured in HSSL
SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error
SYSTEM ERROR: EU-1C0A  error Failed to fill in test report -- ERROR: MG-7688

The intended output is as follows:

预期输出如下:

$ cat output.txt

EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

I tried the following code:

我尝试了以下代码:

while read p; do
    grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs
done < input.txt > output.txt

which produced this output:

产生了这个输出:

EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 .......

Then I also tried this:

然后我也尝试了这个:

while read p; do
    grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs > output.txt
done < input.txt

But did not help :(

但没有帮助:(

Maybe there is another way, I am open to awk/sed/cut or whatever... :)

也许有另一种方式,我对awk / sed / cut或者其他什么开放...... :)

Note: There can be any number of Error codes (i.e. XX:XXXX, the pattern of interest in a single line).

注意:可以有任意数量的错误代码(即XX:XXXX,单行中感兴趣的模式)。

8 个解决方案

#1

There's always perl! And this will grab any number of matches per line.

永远都是perl!这将获得每行的任意数量的匹配。

perl -nle '@matches = /[A-Z]{2}-[A-Z0-9]{4}/g; print(join(" ", @matches)) if (scalar @matches);' output.txt

-e perl code to be run by compiler and -n run one line at a time and -l automatically chomps the line and adds a newline to prints.

-e perl代码由编译器运行,-n一次运行一行,-l自动选择行并为打印添加换行符。

The regex implicitly matches against $_. So @matches = $_ =~ //g is overly verbose.

正则表达式与$ _隐式匹配。所以@matches = $ _ =〜// g过于冗长。

If there is no match, this will not print anything.

如果没有匹配,则不会打印任何内容。

#2

% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt 
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

Explanation in longform:

longform中的说明:

awk '
    BEGIN{ RS=": " } # Set the record separator to colon-space
    NR>1 {           # Ignore the first record
        printf("%s%s", # Print two strings:
            $1,      # 1. first field of the record (`$1`)
            ($0~/\n/) ? "\n" : " ")
                     # Ternary expression, read as `if condition (thing
                     # between brackets), then thing after `?`, otherwise
                     # thing after `:`.
                     # So: If the record ($0) matches (`~`) newline (`\n`),
                     # then put a newline. Otherwise, put a space.
    }
' input.txt

Previous answer to the unmodified question:

以前回答未经修改的问题:

% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, (NR%2==1)?"\n":" "}' input.txt 
EU-1C0A TM-0401
MG-7688 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

edit: With safeguard against :-injection (thx @e0k). Tests that the first field after the record seperator looks like how we expect it to be.

编辑:防范:-injection(thx @ e0k)。测试记录分隔符之后的第一个字段看起来像我们预期的那样。

awk 'BEGIN{RS=": "};NR>1 && $1 ~ /^[A-Z]{2}-[A-Z0-9]{4}$/ {printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt

#3

You could always keep it extremely simple:

你可以随时保持它非常简单:

$ awk '{o=""; for (i=1;i<=NF;i++) if ($i=="ERROR:") o=o$(i+1)" "; print o}' input.txt
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

The above will add a blank char to the end of each line, trivially avoided if you care...

上面将在每行的末尾添加一个空白字符,如果你关心的话,通常会避免...

#4

To keep your grep pattern, here's a way:

为了保持你的grep模式,这是一种方式:

while IFS='' read -r p; do
    echo $(grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' <<<"$p")
done < input.txt > output.txt

while IFS='' read -r p; do is the standard way to read line-by-line into a variable. See, e.g., this answer.

而IFS =''读-r p; do是逐行读入变量的标准方法。参见,例如,这个答案。

grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' <<<"$p" runs your grep and prints the matches. The <<<"$p" is a "here string" that provides the string $p (the line that was read in) as stdin to grep. This means grep will search the contents of $p and print each match on its own line.

grep -o'[A-Z] \ {2 \} - [A-Z0-9] \ {4 \}'<<<“$ p”运行你的grep并打印匹配。 <<<“$ p”是一个“here string”,它将字符串$ p(读入的行)作为stdin提供给grep。这意味着grep将搜索$ p的内容并在其自己的行上打印每个匹配项。

echo $(grep ...) converts the newlines in grep's output to spaces, and adds a newline at the end. Since this loop happens for each line, the result is to print each input line's matches on a single line of the output.

echo $(grep ...)将grep输出中的换行符转换为空格,并在末尾添加换行符。由于每个行都会发生这种循环,因此结果是在输出的一行上打印每个输入行的匹配。

done < input.txt > output.txt is correct: you are providing input to, and taking output from, the loop as a whole. You don't need redirection within the loop.

done output.txt是正确的:您正在为整个循环提供输入和输出。您不需要在循环内重定向。

#5

Another solution that works if you know that every line will contain exactly two instances of the strings you want to match:

如果您知道每一行将包含您要匹配的字符串的两个实例,则另一种解决方案有效:

cat input.txt | grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs -L2 > output.txt

#6

Here is a solution with awk that is fairly straightforward, but it is not an elegant one-liner (as many awk solutions tend to be). It should work with any number of your error codes per line, and with an error code defined as a field (white space separated word) that matches a given regex. Since it's not a snazzy one-liner, I stored the program in a file:

这是一个非常简单的awk解决方案,但它不是一个优雅的单行程序(因为许多awk解决方案往往是)。它应该与每行的任意数量的错误代码一起使用,并将错误代码定义为与给定正则表达式匹配的字段(空格分隔的单词)。由于它不是一个时髦的单行,我将程序存储在一个文件中:

codes.awk

#!/usr/bin/awk -f
{
    m=0;
    for (i=1; i<=NF; ++i) {
        if ( $i ~ /^[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]$/ ) {
            if (m>0) printf OFS
            printf $i
            m++
        }
    }
    if (m>0) printf ORS
}

You would run this like

你会像这样运行

$ awk -f codes.awk input.txt

I hope you find it fairly easy to read. It runs the block once for each line of input. It iterates over each field and checks if it matches a regular expression, then prints the field if it does. The variable m keeps track of the number of matched fields on the current line so far. The purpose of this is to print the output field separator OFS (a space by default) between the matched fields only as needed and to use the output record separator ORS (a newline by default) only if there was at least one error code found. This prevents unnecessary white space.

我希望你觉得它很容易阅读。它为每行输入运行一次块。它遍历每个字段并检查它是否与正则表达式匹配,然后打印字段(如果匹配)。变量m跟踪到目前为止当前行上匹配字段的数量。这样做的目的是仅在需要时在匹配的字段之间打印输出字段分隔符OFS(默认情况下为空格),并且仅在找到至少一个错误代码时才使用输出记录分隔符ORS(默认情况下为换行符)。这可以防止不必要的空白。

Notice that I have changed your regular expression from [A-Z]{2}-[A-Z0-9]{4} to [A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]. This is because old awk will not (or at least may not) support interval expressions (the {n} parts). You could use [A-Z]{2}-[A-Z0-9]{4} with gawk, however. You can tweak the regex as needed. (In both awk and gawk, regular expressions are delimited by /.)

请注意,我已将正则表达式从[AZ] {2} - [A-Z0-9] {4}更改为[AZ] [AZ] - [A-Z0-9] [A-Z0-9] [A -Z0-9] [A-Z0-9]。这是因为旧的awk不会(或者至少可能不会)支持区间表达式({n}部分)。但是,你可以使用[A-Z] {2} - [A-Z0-9] {4}和gawk。您可以根据需要调整正则表达式。 (在awk和gawk中,正则表达式由/分隔。)

The regex /[A-Z]{2}-[A-Z0-9]{4}/ would match any field that contains your XX-XXXX pattern of letters and digits. You want the field to be a full match to the regex and not just include something that matches that pattern. To do this, the ^ and $ marks the beginning and end of the string. For example, /^[A-Z]{2}-[A-Z0-9]{4}$/ (with gawk) would match US-BOTZ, but not USA-ROBOTS. Without the ^ and $, USA-ROBOTS would match because it includes a substring SA-ROBO that does match the regex.

正则表达式/ [A-Z] {2} - [A-Z0-9] {4} /将匹配包含XX-XXXX字母和数字模式的任何字段。您希望该字段与正则表达式完全匹配,而不仅仅包含与该模式匹配的内容。为此,^和$标记字符串的开头和结尾。例如,/ ^ [A-Z] {2} - [A-Z0-9] {4} $ /(与gawk)匹配US-BOTZ,但不匹配USA-ROBOTS。没有^和$,USA-ROBOTS会匹配,因为它包含一个与正则表达式匹配的子串SA-ROBO。

#7

Parsing `grep -n` with AWK

grep -n -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' file | awk -F: -vi=0 '{
  printf("%s%s", i ? (i == $1 ? " " : "\n") : "", $2)
  i = $1
}'

The idea is to join the lines from the output of grep -n:

想法是从grep -n的输出中加入行:

1:EU-1C0A
1:TM-0401
2:MG-7688
2:DN-0A00
2:DN-0A52
2:MG-3218
3:DN-0A00
3:DN-0A52
4:EU-1C0A
4:MG-7688

by the line numbers. AWK initializes the field separator (-F:) and the i variable (-vi=0), then processes the output of the grep command line by line.

按行号。 AWK初始化字段分隔符(-F :)和i变量(-vi = 0),然后逐行处理grep命令的输出。

It prints a character depending on conditional expression that tests the value of the first field $1. If i is zero (the first iteration), it prints only the second field $2. Otherwise, if the first field equals to i, it prints a space, else a newline ("\n"). After the space/newline the second field is printed.

它根据条件表达式打印一个字符,该条件表达式测试第一个字段$ 1的值。如果i为零(第一次迭代),则仅打印第二个字段$ 2。否则,如果第一个字段等于i,则打印一个空格,否则换行(“\ n”)。在空格/换行符之后,将打印第二个字段。

After printing the next chunk, the value of the first field is stored into i for the next iterations (lines): i = $1.

在打印下一个块之后,第一个字段的值被存储到i中以用于下一次迭代(行):i = $ 1。

Perl

Parsing `grep -n` in Perl

use strict;
use warnings;

my $p = 0;

while (<>) {
  /^(\d+):(.*)$/;
  print $p == $1 ? " " : "\n" if $p;
  print $2;
  $p = $1;
}

Usage: grep -n -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' file | perl script.pl.

用法:grep -n -o'[A-Z] \ {2 \} - [A-Z0-9] \ {4 \}'文件| perl script.pl。

Single Line

But Perl is actually so flexible and powerful that you can solve the problem completely with a single line:

但Perl实际上是如此灵活和强大,你可以用一条线完全解决问题:

perl -lne 'print @_ if @_ = /([A-Z]{2}-[A-Z\d]{4})/g' < file

I've seen a similar solution in one of the answers here. Still I decided to post it as it is more compact.

我在其中一个答案中看到过类似的解决方案。我仍然决定发布它,因为它更紧凑。

One of the key ideas is using the -l switch that

其中一个关键想法是使用-l开关

automatically chomps the input record separator $/;

自动扼杀输入记录分隔符$ /;

assigns the output record separator $\ to have the value of $/ (which is newline by default)

将输出记录分隔符$ \赋值为$ /(默认为换行符)

The value of output record separator, if defined, is printed after the last argument passed to print. As a result, the script prints all matches (@_, in particular) followed by a newline.

输出记录分隔符的值(如果已定义)将在传递给print的最后一个参数之后打印。因此,脚本会打印所有匹配项(特别是@_),后跟换行符。

The @_ variable is usually used as an array of subroutine parameters. I have used it in the script only for the sake of shortness.

@_变量通常用作子例程参数的数组。我只是为了简洁而在脚本中使用它。

#8

In Gnu awk. Supports multiple matches on each record:

在Gnu awk。支持每条记录的多个匹配:

$ awk '
{
    while(match($0, /[A-Z]{2}-[A-Z0-9]{4}/)) {  # find first match on record
        b=b substr($0,RSTART,RLENGTH) OFS       # buffer the match
        $0=substr($0,RSTART+RLENGTH)            # truncate from start of record
    }
    if(b!="") print b                           # print buffer if not empty
    b=""                                        # empty buffer
}' file
EU-1C0A TM-0401 
MG-7688 DN-0A00 DN-0A52 MG-3218 
DN-0A00 DN-0A52 
EU-1C0A MG-7688

Downside: there will be an extra OFS in the end of each printed record.

缺点:每张印刷记录的末尾都会有额外的OFS。

If you want to use other awks than Gnu awk, replace the regex match with:

如果你想使用除Gnu awk之外的其他awk,请将regex匹配替换为:

while(match($0, /[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9]/))

#1

There's always perl! And this will grab any number of matches per line.

永远都是perl!这将获得每行的任意数量的匹配。

perl -nle '@matches = /[A-Z]{2}-[A-Z0-9]{4}/g; print(join(" ", @matches)) if (scalar @matches);' output.txt

-e perl code to be run by compiler and -n run one line at a time and -l automatically chomps the line and adds a newline to prints.

-e perl代码由编译器运行,-n一次运行一行,-l自动选择行并为打印添加换行符。

The regex implicitly matches against $_. So @matches = $_ =~ //g is overly verbose.

正则表达式与$ _隐式匹配。所以@matches = $ _ =〜// g过于冗长。

If there is no match, this will not print anything.

如果没有匹配,则不会打印任何内容。

#2

% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt 
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

Explanation in longform:

longform中的说明:

awk '
    BEGIN{ RS=": " } # Set the record separator to colon-space
    NR>1 {           # Ignore the first record
        printf("%s%s", # Print two strings:
            $1,      # 1. first field of the record (`$1`)
            ($0~/\n/) ? "\n" : " ")
                     # Ternary expression, read as `if condition (thing
                     # between brackets), then thing after `?`, otherwise
                     # thing after `:`.
                     # So: If the record ($0) matches (`~`) newline (`\n`),
                     # then put a newline. Otherwise, put a space.
    }
' input.txt

Previous answer to the unmodified question:

以前回答未经修改的问题:

% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, (NR%2==1)?"\n":" "}' input.txt 
EU-1C0A TM-0401
MG-7688 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

edit: With safeguard against :-injection (thx @e0k). Tests that the first field after the record seperator looks like how we expect it to be.

编辑:防范:-injection(thx @ e0k)。测试记录分隔符之后的第一个字段看起来像我们预期的那样。

awk 'BEGIN{RS=": "};NR>1 && $1 ~ /^[A-Z]{2}-[A-Z0-9]{4}$/ {printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt

#3

You could always keep it extremely simple:

你可以随时保持它非常简单:

$ awk '{o=""; for (i=1;i<=NF;i++) if ($i=="ERROR:") o=o$(i+1)" "; print o}' input.txt
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

The above will add a blank char to the end of each line, trivially avoided if you care...

上面将在每行的末尾添加一个空白字符,如果你关心的话,通常会避免...

#4

To keep your grep pattern, here's a way:

为了保持你的grep模式,这是一种方式:

while IFS='' read -r p; do
    echo $(grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' <<<"$p")
done < input.txt > output.txt

while IFS='' read -r p; do is the standard way to read line-by-line into a variable. See, e.g., this answer.

而IFS =''读-r p; do是逐行读入变量的标准方法。参见,例如,这个答案。

grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' <<<"$p" runs your grep and prints the matches. The <<<"$p" is a "here string" that provides the string $p (the line that was read in) as stdin to grep. This means grep will search the contents of $p and print each match on its own line.

echo $(grep ...) converts the newlines in grep's output to spaces, and adds a newline at the end. Since this loop happens for each line, the result is to print each input line's matches on a single line of the output.

echo $(grep ...)将grep输出中的换行符转换为空格,并在末尾添加换行符。由于每个行都会发生这种循环,因此结果是在输出的一行上打印每个输入行的匹配。

done < input.txt > output.txt is correct: you are providing input to, and taking output from, the loop as a whole. You don't need redirection within the loop.

done output.txt是正确的:您正在为整个循环提供输入和输出。您不需要在循环内重定向。

#5

Another solution that works if you know that every line will contain exactly two instances of the strings you want to match:

如果您知道每一行将包含您要匹配的字符串的两个实例,则另一种解决方案有效:

cat input.txt | grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs -L2 > output.txt

#6

codes.awk

#!/usr/bin/awk -f
{
    m=0;
    for (i=1; i<=NF; ++i) {
        if ( $i ~ /^[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]$/ ) {
            if (m>0) printf OFS
            printf $i
            m++
        }
    }
    if (m>0) printf ORS
}

You would run this like

你会像这样运行

$ awk -f codes.awk input.txt

#7

Parsing `grep -n` with AWK

grep -n -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' file | awk -F: -vi=0 '{
  printf("%s%s", i ? (i == $1 ? " " : "\n") : "", $2)
  i = $1
}'

The idea is to join the lines from the output of grep -n:

想法是从grep -n的输出中加入行:

1:EU-1C0A
1:TM-0401
2:MG-7688
2:DN-0A00
2:DN-0A52
2:MG-3218
3:DN-0A00
3:DN-0A52
4:EU-1C0A
4:MG-7688

by the line numbers. AWK initializes the field separator (-F:) and the i variable (-vi=0), then processes the output of the grep command line by line.

按行号。 AWK初始化字段分隔符(-F :)和i变量(-vi = 0),然后逐行处理grep命令的输出。

After printing the next chunk, the value of the first field is stored into i for the next iterations (lines): i = $1.

在打印下一个块之后,第一个字段的值被存储到i中以用于下一次迭代(行):i = $ 1。

Perl

Parsing `grep -n` in Perl

use strict;
use warnings;

my $p = 0;

while (<>) {
  /^(\d+):(.*)$/;
  print $p == $1 ? " " : "\n" if $p;
  print $2;
  $p = $1;
}

Usage: grep -n -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' file | perl script.pl.

用法:grep -n -o'[A-Z] \ {2 \} - [A-Z0-9] \ {4 \}'文件| perl script.pl。

Single Line

But Perl is actually so flexible and powerful that you can solve the problem completely with a single line:

但Perl实际上是如此灵活和强大,你可以用一条线完全解决问题:

perl -lne 'print @_ if @_ = /([A-Z]{2}-[A-Z\d]{4})/g' < file

I've seen a similar solution in one of the answers here. Still I decided to post it as it is more compact.

我在其中一个答案中看到过类似的解决方案。我仍然决定发布它,因为它更紧凑。

One of the key ideas is using the -l switch that

其中一个关键想法是使用-l开关

automatically chomps the input record separator $/;

自动扼杀输入记录分隔符$ /;

assigns the output record separator $\ to have the value of $/ (which is newline by default)

将输出记录分隔符$ \赋值为$ /(默认为换行符)

The value of output record separator, if defined, is printed after the last argument passed to print. As a result, the script prints all matches (@_, in particular) followed by a newline.

输出记录分隔符的值(如果已定义)将在传递给print的最后一个参数之后打印。因此,脚本会打印所有匹配项(特别是@_),后跟换行符。

The @_ variable is usually used as an array of subroutine parameters. I have used it in the script only for the sake of shortness.

@_变量通常用作子例程参数的数组。我只是为了简洁而在脚本中使用它。

#8

In Gnu awk. Supports multiple matches on each record:

在Gnu awk。支持每条记录的多个匹配:

$ awk '
{
    while(match($0, /[A-Z]{2}-[A-Z0-9]{4}/)) {  # find first match on record
        b=b substr($0,RSTART,RLENGTH) OFS       # buffer the match
        $0=substr($0,RSTART+RLENGTH)            # truncate from start of record
    }
    if(b!="") print b                           # print buffer if not empty
    b=""                                        # empty buffer
}' file
EU-1C0A TM-0401 
MG-7688 DN-0A00 DN-0A52 MG-3218 
DN-0A00 DN-0A52 
EU-1C0A MG-7688

Downside: there will be an extra OFS in the end of each printed record.

缺点:每张印刷记录的末尾都会有额外的OFS。

If you want to use other awks than Gnu awk, replace the regex match with:

如果你想使用除Gnu awk之外的其他awk,请将regex匹配替换为:

while(match($0, /[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9]/))

秒客网

逐行阅读并逐行打印匹配

8 个解决方案

#1

#2

#3

#4

#5

#6

#7

Parsing `grep -n` with AWK

Perl

Parsing `grep -n` in Perl

Single Line

#8

#1

#2

#3

#4

#5

#6

#7

Parsing `grep -n` with AWK

Perl

Parsing `grep -n` in Perl

Single Line

#8

相关文章

逐行阅读并逐行打印匹配

8 个解决方案

#1

#2

#3

#4

#5

#6

#7

Parsing grep -n with AWK

Perl

Parsing grep -n in Perl

Single Line

#8

#1

#2

#3

#4

#5

#6

#7

Parsing grep -n with AWK

Perl

Parsing grep -n in Perl

Single Line

#8

相关文章

Parsing `grep -n` with AWK

Parsing `grep -n` in Perl

Parsing `grep -n` with AWK

Parsing `grep -n` in Perl