Possible duplicate of Regex - find all lines after a match: although my need is a little different.
Regex的可能副本——在匹配后查找所有行:尽管我的需求略有不同。
I want to parse a plain text file with multiple date/value data separated by specific strings. I want to skip the first half of the file until a specific line where I want to match the results.
我想要解析一个纯文本文件,其中包含了由特定字符串分隔的多个日期/值数据。我想跳过文件的前半部分,直到我想匹配结果的特定行。
Here is an example of the file in question (including the mess with tabulations and spaces):
这里有一个文件的例子(包括表格和空格的混乱):
I dont want to capture the following measures. This text is on a single line and contains tabs and spaces is also ends with this token : Token1
05/01/1969 0.01846
15/01/1969 0.16730
25/01/1969 0.33988
05/04/1969 0.81319
15/04/1969 0.76973
25/11/2011 0.24210
05/12/2011 0.25220
15/12/2011 0.31160
25/12/2011 0.36845
End : bla bla bla
This text is also on a single line and marks the beginning of a new series of results. These are the results that I want. it also ends with the following token : Token2
05/01/1969 109.46333
15/01/1969 110.06998 118.18000
25/01/1969 110.82954
05/02/1969 111.51394 118.83000
25/02/1969 112.36483
05/10/2011 114.38798 114.31000
05/10/2011 114.31000 114.38798 114.38798 114.38798 114.38798 114.38798 114.38798
25/12/2011 112.64000 112.41261 112.86301 113.25494 114.06421 115.93219 116.38780
05/01/2012 112.22834 112.92301 113.40561 114.78823 116.62931 117.43421
05/09/2012 110.01410 112.16391 112.88199 115.23640 117.04756 118.04632
15/09/2012 109.97572 112.00809 112.70266 114.91247 116.65256 117.57412
25/09/2012 109.93967 111.87272 112.53305 114.60381 116.26935 117.12756
End : Marks the end of the file
What I wish to do is to match every line after the line which ends with Token2
. I have tried different solutions from the other similar questions but none work. I ended up matching all the results of the file and considered splitting it before applying the following pattern. Is there a pure regex solution to this ?
我想做的是匹配以记号2结尾的每一行。我尝试过不同于其他类似问题的解决方案,但都没有成功。最后,我匹配了文件的所有结果,并考虑在应用以下模式之前将其拆分。有一个纯粹的regex解决方案吗?
Here is the pattern that works for the whole file. With named capture groups :
这是适用于整个文件的模式。与命名捕获组:
(?P<date>\d\d\/\d\d\/\d\d\d\d)\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*){0,1}[\t ]*(?P<prev_no_rain>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_50>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_wet>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_wet>\d+\.*\d*){0,1}
Regex101 link : https://regex101.com/r/a0mCZ2/3
Regex101链接:https://regex101.com/r/a0mCZ2/3
1 个解决方案
#1
2
You may leverage the \G
operator that matches the start of string (that can be excluded with a negative lookaround) and the end of the previous successful match position. With the (?:\G(?!\A)|\bToken2[\r\n]+)
we can tell the regex engine to find a whole word Token2
at the end of the line (with linebreak symbols) and then only find the following subpatterns if they follow in an immediate succession.
您可以利用与字符串的开始(可以通过负查找排除)和之前成功匹配位置的结束匹配的\G操作符。有了(?:\G(?!\A)|\bToken2[\r\n]+),我们可以告诉regex引擎在行尾找到一个完整的单词Token2(带有换行符符号),然后只有在它们连续地执行时才能找到以下子模式。
A regex that can be used:
可以使用的正则表达式:
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K(?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?
See the regex demo. Note I replaced {0,1}
with ?
to shorten it a bit.
查看演示正则表达式。注意,我将{0,1}替换为?把它缩短一点。
The part you are interested in is (?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K
.
你感兴趣的部分是(?:\ G(? ! \ A)[\ r \ n]* | Token2[\ r \ n]+)\ K。
-
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)
- 1 of two alternatives:-
\G(?!\A)[\r\n]*
- end of the previous successful match and 0+ linebreak symbols - \G(?!\A)[\r\n]* -结束先前成功的比赛及0+ linebreak符号
-
|
- or - |——或者
-
Token2[\r\n]+
-Token2
followed with 1+ CR or LFs. (If you need to matchToken2
as a whole word, you might add\b
before it). - Token2[\r\n]+ - Token2和1+ CR或LFs。(如果您需要将Token2与整个单词匹配,您可以在它之前添加\b)。
-
- (或:\G(?!\A)[\r\n]*|Token2[\r\n]+) -两个选择中的一个:\G(?!\A)[\r\n]* -结束先前成功的比赛及0+断行符号| -或Token2[\r\n]+ Token2](如果您需要将Token2与整个单词匹配,您可以在它之前添加\b)。
-
\K
- omit the text matched so far. - 省略到目前为止匹配的文本。
The (?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?
is your pattern that I did not modify too much, and that matches a line with specific fata (note that the fact it matches a line justifies the usage of [\r\n]*
after (\G(?!\A))
).
(? P <日期> \ \ \ / \ d \ d \ / \ d { 4 })\ s *(? P <一起> \ d + \ . * \ d *)\[t]*(? P < observ > \ d + \ . * \ d *)?\[t]*(? P < prev_no_rain > \ d +(?:\ \ d +)*)?\[t]*(? P < prev_10_dry > \ d + \ . * \ d *)?\[t]*(? P < prev_20_dry > \ d + \ . * \ d *)?\[t]*(? P < prev_50 > \ d + \ . * \ d *)?\[t]*(? P < prev_20_wet > \ d + \ . * \ d *)?\[t]*(? P < prev_10_wet > \ d + \ . * \ d *)?您的模式是否我没有修改太多,并且与特定fata的行匹配(请注意,它匹配的行证明了使用[\r\n]* after (\G(?!\ a)))。
#1
2
You may leverage the \G
operator that matches the start of string (that can be excluded with a negative lookaround) and the end of the previous successful match position. With the (?:\G(?!\A)|\bToken2[\r\n]+)
we can tell the regex engine to find a whole word Token2
at the end of the line (with linebreak symbols) and then only find the following subpatterns if they follow in an immediate succession.
您可以利用与字符串的开始(可以通过负查找排除)和之前成功匹配位置的结束匹配的\G操作符。有了(?:\G(?!\A)|\bToken2[\r\n]+),我们可以告诉regex引擎在行尾找到一个完整的单词Token2(带有换行符符号),然后只有在它们连续地执行时才能找到以下子模式。
A regex that can be used:
可以使用的正则表达式:
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K(?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?
See the regex demo. Note I replaced {0,1}
with ?
to shorten it a bit.
查看演示正则表达式。注意,我将{0,1}替换为?把它缩短一点。
The part you are interested in is (?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K
.
你感兴趣的部分是(?:\ G(? ! \ A)[\ r \ n]* | Token2[\ r \ n]+)\ K。
-
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)
- 1 of two alternatives:-
\G(?!\A)[\r\n]*
- end of the previous successful match and 0+ linebreak symbols - \G(?!\A)[\r\n]* -结束先前成功的比赛及0+ linebreak符号
-
|
- or - |——或者
-
Token2[\r\n]+
-Token2
followed with 1+ CR or LFs. (If you need to matchToken2
as a whole word, you might add\b
before it). - Token2[\r\n]+ - Token2和1+ CR或LFs。(如果您需要将Token2与整个单词匹配,您可以在它之前添加\b)。
-
- (或:\G(?!\A)[\r\n]*|Token2[\r\n]+) -两个选择中的一个:\G(?!\A)[\r\n]* -结束先前成功的比赛及0+断行符号| -或Token2[\r\n]+ Token2](如果您需要将Token2与整个单词匹配,您可以在它之前添加\b)。
-
\K
- omit the text matched so far. - 省略到目前为止匹配的文本。
The (?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?
is your pattern that I did not modify too much, and that matches a line with specific fata (note that the fact it matches a line justifies the usage of [\r\n]*
after (\G(?!\A))
).
(? P <日期> \ \ \ / \ d \ d \ / \ d { 4 })\ s *(? P <一起> \ d + \ . * \ d *)\[t]*(? P < observ > \ d + \ . * \ d *)?\[t]*(? P < prev_no_rain > \ d +(?:\ \ d +)*)?\[t]*(? P < prev_10_dry > \ d + \ . * \ d *)?\[t]*(? P < prev_20_dry > \ d + \ . * \ d *)?\[t]*(? P < prev_50 > \ d + \ . * \ d *)?\[t]*(? P < prev_20_wet > \ d + \ . * \ d *)?\[t]*(? P < prev_10_wet > \ d + \ . * \ d *)?您的模式是否我没有修改太多,并且与特定fata的行匹配(请注意,它匹配的行证明了使用[\r\n]* after (\G(?!\ a)))。