两个标签之间的Grep / Sed与多线

时间:2021-10-05 15:45:02

I have many files from which I need to get information.

我有很多文件,我需要从中获取信息。

Example of my files:

我的文件示例:

first file content:

第一个文件内容:

"test This info i need grep</singleline>"

“测试此信息我需要grep ”

and

second file content (with two lines):

第二个文件内容(有两行):

"test This info=
 i need grep too</singleline>"

in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"

在结果中我需要grep这个文本:从第一个文件 - “这个信息我需要grep”和第二个文件 - “这个信息=我也需要grep”

In first file I use:

在第一个文件中我使用:

grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'

and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.

并成功获取“此信息我需要grep”但我无法通过使用相同的命令从第二个文件中获取信息。

Please help rewrite the command or write what the other.

请帮助重写命令或写另一个。

3 个解决方案

#1


I'd use pcregrep, which can match multiline regexes:

我使用pcregrep,它可以匹配多行正则表达式:

pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename

The tricks are:

技巧是:

  • -M allows pcregrep to match on more than one line,
  • -M允许pcregrep匹配多行,

  • -o makes it print only the match,
  • -o使它只打印匹配,

  • \K throws away the part of the match that comes before it,
  • \ K扔掉了之前的比赛部分,

  • (?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
  • (?= )是一个前瞻术语,它匹配一个空字符串if(并且仅当它)后跟 ,并且

  • ((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
  • ((?S)。)*?非贪婪地匹配任何字符,也就是说,如果文件中出现多次 ,它将匹配到最近而不是最远。如果不需要,请删除?。 (?s)在本地启用s选项。匹配其中的换行符;默认情况下它不会这样做。

Thanks to @CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).

感谢@CasimiretHippolyte指出((?s)。)替代(。| \ n)。

#2


Or, if you insist to use grep, you can:

或者,如果您坚持使用grep,您可以:

grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt 

To understand the meaning of each flag, use grep --help:

要了解每个标志的含义,请使用grep --help:

  • -P, --perl-regexp

    PATTERN is a Perl regular expression

    PATTERN是一个Perl正则表达式

  • -o, --only-matching

    show only the part of a line matching PATTERN

    仅显示匹配PATTERN的线条的一部分

  • -z, --null-data

    a data line ends in 0 byte, not newline

    数据行以0字节结尾,而不是换行符

#3


It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).

看起来你正在解析引用可打印的编码文本,其中一个“软”换行符(一个来自固定行宽格​​式的伪像)用行终止=(直接在\ n之前)表示。

Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:

因为在后来的评论中你也表达了将每场比赛打印成一行的愿望,我建议以下2遍评论:

  • use awk to remove the soft line breaks
  • 使用awk删除软换行符

  • then use grep on the result
  • 然后在结果上使用grep

awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
  grep -Po 'test .*?(?=</singleline>)'

Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).

向Wintermute提供非贪婪量词的有用答案的提示,*?以及Wintermute和Maroun Maroun对正向前瞻断言的有用答案,(?= ...)。

Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.

并不是说awk命令删除了行尾=(和换行符一起);用$ 0替换substr调用以保留它。

Since strings of interest are first converted back their original single-line representations:

由于感兴趣的字符串首先被转换回原始的单行表示:

  • The matches are printed in their original form.
  • 比赛以原始形式打印。

  • You can use regular (GNU) grep with line-by-line matching; contrast this with
    • needing to read the entire file at once, as in Maroun Maroun's helpful answer.
      Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
    • 需要立即阅读整个文件,如Maroun Maroun的有用答案。请注意,在撰写本文时,*必须替换为*?在他的答案中正确地工作在多个匹配的文件中。

    • needing to install another utility, pcregrep, as in Wintermute's helpful answer.
    • 需要安装另一个实用程序pcregrep,就像Wintermute的有用答案一样。

    • additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).
    • 此外,必须将匹配清理为单行(您最初未将其作为要求)。

  • 您可以使用常规(GNU)grep进行逐行匹配;与此相反,需要立即阅读整个文件,如Maroun Maroun的有用答案。请注意,在撰写本文时,*必须替换为*?在他的答案中正确地工作在多个匹配的文件中。需要安装另一个实用程序pcregrep,就像Wintermute的有用答案一样。此外,必须将匹配清理为单行(您最初未将其作为要求)。

#1


I'd use pcregrep, which can match multiline regexes:

我使用pcregrep,它可以匹配多行正则表达式:

pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename

The tricks are:

技巧是:

  • -M allows pcregrep to match on more than one line,
  • -M允许pcregrep匹配多行,

  • -o makes it print only the match,
  • -o使它只打印匹配,

  • \K throws away the part of the match that comes before it,
  • \ K扔掉了之前的比赛部分,

  • (?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
  • (?= )是一个前瞻术语,它匹配一个空字符串if(并且仅当它)后跟 ,并且

  • ((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
  • ((?S)。)*?非贪婪地匹配任何字符,也就是说,如果文件中出现多次 ,它将匹配到最近而不是最远。如果不需要,请删除?。 (?s)在本地启用s选项。匹配其中的换行符;默认情况下它不会这样做。

Thanks to @CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).

感谢@CasimiretHippolyte指出((?s)。)替代(。| \ n)。

#2


Or, if you insist to use grep, you can:

或者,如果您坚持使用grep,您可以:

grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt 

To understand the meaning of each flag, use grep --help:

要了解每个标志的含义,请使用grep --help:

  • -P, --perl-regexp

    PATTERN is a Perl regular expression

    PATTERN是一个Perl正则表达式

  • -o, --only-matching

    show only the part of a line matching PATTERN

    仅显示匹配PATTERN的线条的一部分

  • -z, --null-data

    a data line ends in 0 byte, not newline

    数据行以0字节结尾,而不是换行符

#3


It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).

看起来你正在解析引用可打印的编码文本,其中一个“软”换行符(一个来自固定行宽格​​式的伪像)用行终止=(直接在\ n之前)表示。

Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:

因为在后来的评论中你也表达了将每场比赛打印成一行的愿望,我建议以下2遍评论:

  • use awk to remove the soft line breaks
  • 使用awk删除软换行符

  • then use grep on the result
  • 然后在结果上使用grep

awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
  grep -Po 'test .*?(?=</singleline>)'

Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).

向Wintermute提供非贪婪量词的有用答案的提示,*?以及Wintermute和Maroun Maroun对正向前瞻断言的有用答案,(?= ...)。

Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.

并不是说awk命令删除了行尾=(和换行符一起);用$ 0替换substr调用以保留它。

Since strings of interest are first converted back their original single-line representations:

由于感兴趣的字符串首先被转换回原始的单行表示:

  • The matches are printed in their original form.
  • 比赛以原始形式打印。

  • You can use regular (GNU) grep with line-by-line matching; contrast this with
    • needing to read the entire file at once, as in Maroun Maroun's helpful answer.
      Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
    • 需要立即阅读整个文件,如Maroun Maroun的有用答案。请注意,在撰写本文时,*必须替换为*?在他的答案中正确地工作在多个匹配的文件中。

    • needing to install another utility, pcregrep, as in Wintermute's helpful answer.
    • 需要安装另一个实用程序pcregrep,就像Wintermute的有用答案一样。

    • additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).
    • 此外,必须将匹配清理为单行(您最初未将其作为要求)。

  • 您可以使用常规(GNU)grep进行逐行匹配;与此相反,需要立即阅读整个文件,如Maroun Maroun的有用答案。请注意,在撰写本文时,*必须替换为*?在他的答案中正确地工作在多个匹配的文件中。需要安装另一个实用程序pcregrep,就像Wintermute的有用答案一样。此外,必须将匹配清理为单行(您最初未将其作为要求)。