I have many files from which I need to get information.
我有很多文件,我需要从中获取信息。
Example of my files:
我的文件示例:
first file content:
第一个文件内容:
"test This info i need grep</singleline>"
“测试此信息我需要grep ”
and
second file content (with two lines):
第二个文件内容(有两行):
"test This info=
i need grep too</singleline>"
in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"
在结果中我需要grep这个文本:从第一个文件 - “这个信息我需要grep”和第二个文件 - “这个信息=我也需要grep”
In first file I use:
在第一个文件中我使用:
grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'
and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.
并成功获取“此信息我需要grep”但我无法通过使用相同的命令从第二个文件中获取信息。
Please help rewrite the command or write what the other.
请帮助重写命令或写另一个。
3 个解决方案
#1
I'd use pcregrep
, which can match multiline regexes:
我使用pcregrep,它可以匹配多行正则表达式:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
技巧是:
-
-M
allowspcregrep
to match on more than one line, -
-o
makes it print only the match, -
\K
throws away the part of the match that comes before it, -
(?=</singleline>)
is a lookahead term that matches an empty string if (and only if) it is followed by</singleline>
, and -
((?s).)*?
to match any characters non-greedily, which is to say that if you have several occurrences of</singleline>
in the file, it will match until the closest rather than the furthest. If this is not desired, remove the?
.(?s)
enables thes
option locally for the term to make.
match newlines in it; it wouldn't do that by default.
-M允许pcregrep匹配多行,
-o使它只打印匹配,
\ K扔掉了之前的比赛部分,
(?= )是一个前瞻术语,它匹配一个空字符串if(并且仅当它)后跟 ,并且
((?S)。)*?非贪婪地匹配任何字符,也就是说,如果文件中出现多次 ,它将匹配到最近而不是最远。如果不需要,请删除?。 (?s)在本地启用s选项。匹配其中的换行符;默认情况下它不会这样做。
Thanks to @CasimiretHippolyte for pointing out the ((?s).)
alternative to (.|\n)
.
感谢@CasimiretHippolyte指出((?s)。)替代(。| \ n)。
#2
Or, if you insist to use grep
, you can:
或者,如果您坚持使用grep,您可以:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help
:
要了解每个标志的含义,请使用grep --help:
-
-P
, --perl-regexpPATTERN is a Perl regular expression
PATTERN是一个Perl正则表达式
-
-o
, --only-matchingshow only the part of a line matching PATTERN
仅显示匹配PATTERN的线条的一部分
-
-z
, --null-dataa data line ends in 0 byte, not newline
数据行以0字节结尾,而不是换行符
#3
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating =
(directly before the \n
).
看起来你正在解析引用可打印的编码文本,其中一个“软”换行符(一个来自固定行宽格式的伪像)用行终止=(直接在\ n之前)表示。
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
因为在后来的评论中你也表达了将每场比赛打印成一行的愿望,我建议以下2遍评论:
- use
awk
to remove the soft line breaks - then use
grep
on the result
使用awk删除软换行符
然后在结果上使用grep
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?
, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...)
.
向Wintermute提供非贪婪量词的有用答案的提示,*?以及Wintermute和Maroun Maroun对正向前瞻断言的有用答案,(?= ...)。
Not that the awk
command removes the line-ending =
(along with the newline); replace the substr
call with just $0
to retain it.
并不是说awk命令删除了行尾=(和换行符一起);用$ 0替换substr调用以保留它。
Since strings of interest are first converted back their original single-line representations:
由于感兴趣的字符串首先被转换回原始的单行表示:
- The matches are printed in their original form.
- You can use regular (GNU)
grep
with line-by-line matching; contrast this with- needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing,*
must be replaced with*?
in his answer to work correctly work in files with multiple matches. - needing to install another utility,
pcregrep
, as in Wintermute's helpful answer. - additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).
需要立即阅读整个文件,如Maroun Maroun的有用答案。请注意,在撰写本文时,*必须替换为*?在他的答案中正确地工作在多个匹配的文件中。
需要安装另一个实用程序pcregrep,就像Wintermute的有用答案一样。
此外,必须将匹配清理为单行(您最初未将其作为要求)。
- needing to read the entire file at once, as in Maroun Maroun's helpful answer.
比赛以原始形式打印。
您可以使用常规(GNU)grep进行逐行匹配;与此相反,需要立即阅读整个文件,如Maroun Maroun的有用答案。请注意,在撰写本文时,*必须替换为*?在他的答案中正确地工作在多个匹配的文件中。需要安装另一个实用程序pcregrep,就像Wintermute的有用答案一样。此外,必须将匹配清理为单行(您最初未将其作为要求)。
#1
I'd use pcregrep
, which can match multiline regexes:
我使用pcregrep,它可以匹配多行正则表达式:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
技巧是:
-
-M
allowspcregrep
to match on more than one line, -
-o
makes it print only the match, -
\K
throws away the part of the match that comes before it, -
(?=</singleline>)
is a lookahead term that matches an empty string if (and only if) it is followed by</singleline>
, and -
((?s).)*?
to match any characters non-greedily, which is to say that if you have several occurrences of</singleline>
in the file, it will match until the closest rather than the furthest. If this is not desired, remove the?
.(?s)
enables thes
option locally for the term to make.
match newlines in it; it wouldn't do that by default.
-M允许pcregrep匹配多行,
-o使它只打印匹配,
\ K扔掉了之前的比赛部分,
(?= )是一个前瞻术语,它匹配一个空字符串if(并且仅当它)后跟 ,并且
((?S)。)*?非贪婪地匹配任何字符,也就是说,如果文件中出现多次 ,它将匹配到最近而不是最远。如果不需要,请删除?。 (?s)在本地启用s选项。匹配其中的换行符;默认情况下它不会这样做。
Thanks to @CasimiretHippolyte for pointing out the ((?s).)
alternative to (.|\n)
.
感谢@CasimiretHippolyte指出((?s)。)替代(。| \ n)。
#2
Or, if you insist to use grep
, you can:
或者,如果您坚持使用grep,您可以:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help
:
要了解每个标志的含义,请使用grep --help:
-
-P
, --perl-regexpPATTERN is a Perl regular expression
PATTERN是一个Perl正则表达式
-
-o
, --only-matchingshow only the part of a line matching PATTERN
仅显示匹配PATTERN的线条的一部分
-
-z
, --null-dataa data line ends in 0 byte, not newline
数据行以0字节结尾,而不是换行符
#3
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating =
(directly before the \n
).
看起来你正在解析引用可打印的编码文本,其中一个“软”换行符(一个来自固定行宽格式的伪像)用行终止=(直接在\ n之前)表示。
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
因为在后来的评论中你也表达了将每场比赛打印成一行的愿望,我建议以下2遍评论:
- use
awk
to remove the soft line breaks - then use
grep
on the result
使用awk删除软换行符
然后在结果上使用grep
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?
, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...)
.
向Wintermute提供非贪婪量词的有用答案的提示,*?以及Wintermute和Maroun Maroun对正向前瞻断言的有用答案,(?= ...)。
Not that the awk
command removes the line-ending =
(along with the newline); replace the substr
call with just $0
to retain it.
并不是说awk命令删除了行尾=(和换行符一起);用$ 0替换substr调用以保留它。
Since strings of interest are first converted back their original single-line representations:
由于感兴趣的字符串首先被转换回原始的单行表示:
- The matches are printed in their original form.
- You can use regular (GNU)
grep
with line-by-line matching; contrast this with- needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing,*
must be replaced with*?
in his answer to work correctly work in files with multiple matches. - needing to install another utility,
pcregrep
, as in Wintermute's helpful answer. - additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).
需要立即阅读整个文件,如Maroun Maroun的有用答案。请注意,在撰写本文时,*必须替换为*?在他的答案中正确地工作在多个匹配的文件中。
需要安装另一个实用程序pcregrep,就像Wintermute的有用答案一样。
此外,必须将匹配清理为单行(您最初未将其作为要求)。
- needing to read the entire file at once, as in Maroun Maroun's helpful answer.
比赛以原始形式打印。
您可以使用常规(GNU)grep进行逐行匹配;与此相反,需要立即阅读整个文件,如Maroun Maroun的有用答案。请注意,在撰写本文时,*必须替换为*?在他的答案中正确地工作在多个匹配的文件中。需要安装另一个实用程序pcregrep,就像Wintermute的有用答案一样。此外,必须将匹配清理为单行(您最初未将其作为要求)。