I'm using regex to parse structured text as below, with caret symbol marking what I'm trying to match:
我正在使用正则表达式解析结构化文本,如下所示,插入符号标记我想要匹配的内容:
block 1
^^^^^^^
subblock 1.1
attrib a=a1
subblock 1.2
attrib b=b1
^^
block 2
subblock 2.1
attrib a=a2
block 3
^^^^^^^
subblock 3.1
attrib a=a3
subblock 3.2
attrib b=b3
^^
A subblock may or may not appear inside a block, e.g.: subblock 2.2.
子块可以出现在块内,也可以不出现在块内,例如:子块2.2。
The expected match is [(block1,b1), (block3,b3)].
预期的匹配是[(block1,b1),(block3,b3)]。
/(capture block#)[\s\S]*?attrib\sb=(capture b#)/gm
But this ends up matching [(block1, b1), (block2, b3)].
但这最终匹配[(block1,b1),(block2,b3)]。
Where am I doing the regex wrong?
我在哪里做正则表达式错了?
2 个解决方案
#1
2
You can use
您可以使用
(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)
See the regex demo
请参阅正则表达式演示
The regex is based on an unroll the loop technique. Here is an explanation:
正则表达式基于展开循环技术。这是一个解释:
-
(?m)
- multiline modifier to make^
match the beginning of a line -
(^block\s*\d+)
- match and capture theblock
+ optional whitespace(s) + 1+ digits (Group 1) -
.*
- matches the rest of the line (as no DOTALL option should be on) -
(?:\n(?!block\s*\d).*)*
- match any text after that is not a wordblock
followed with optional whitespace(s) followed with a digit (this way, a boundary is set) -
\battrib\s*b=(\w+)
- match a whole wordattrib
followed with 0+ whitespaces, literalb=
, and match and capture 1+ alphanumerics or underscore (note: this can be adjusted as per your real data) with(\w+)
(?m) - 多线修改器使^匹配一行的开头
(^ block \ s * \ d +) - 匹配并捕获块+可选空格+ 1+位(组1)
。* - 匹配行的其余部分(因为没有DOTALL选项应该打开)
(?:\ n(?!block \ s * \ d)。*)* - 匹配之后的任何文本不是一个字块,后跟一个可选的空格后跟一个数字(这样就设置了一个边界)
\ battrib \ s * b =(\ w +) - 匹配整个单词attrib,后跟0+空格,文字b =,匹配并捕获1 +字母数字或下划线(注意:这可以根据您的实际数据进行调整) (\ W +)
import re
p = re.compile(r'(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)')
s = "block 1\n subblock 1.1\n attrib a=a1\n subblock 1.2\n attrib b=b1\nblock 2\n subblock 2.1\n attrib a=a2\nblock 3\n subblock 3.1\n attrib a=a3\n subblock 3.2\n attrib b=b3"
print(p.findall(s))
#2
0
What about this regex? https://regex101.com/r/yZ4fL9/1
这个正则表达式怎么样? https://regex101.com/r/yZ4fL9/1
block (\d).*?attrib b=b(\1)
#1
2
You can use
您可以使用
(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)
See the regex demo
请参阅正则表达式演示
The regex is based on an unroll the loop technique. Here is an explanation:
正则表达式基于展开循环技术。这是一个解释:
-
(?m)
- multiline modifier to make^
match the beginning of a line -
(^block\s*\d+)
- match and capture theblock
+ optional whitespace(s) + 1+ digits (Group 1) -
.*
- matches the rest of the line (as no DOTALL option should be on) -
(?:\n(?!block\s*\d).*)*
- match any text after that is not a wordblock
followed with optional whitespace(s) followed with a digit (this way, a boundary is set) -
\battrib\s*b=(\w+)
- match a whole wordattrib
followed with 0+ whitespaces, literalb=
, and match and capture 1+ alphanumerics or underscore (note: this can be adjusted as per your real data) with(\w+)
(?m) - 多线修改器使^匹配一行的开头
(^ block \ s * \ d +) - 匹配并捕获块+可选空格+ 1+位(组1)
。* - 匹配行的其余部分(因为没有DOTALL选项应该打开)
(?:\ n(?!block \ s * \ d)。*)* - 匹配之后的任何文本不是一个字块,后跟一个可选的空格后跟一个数字(这样就设置了一个边界)
\ battrib \ s * b =(\ w +) - 匹配整个单词attrib,后跟0+空格,文字b =,匹配并捕获1 +字母数字或下划线(注意:这可以根据您的实际数据进行调整) (\ W +)
import re
p = re.compile(r'(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)')
s = "block 1\n subblock 1.1\n attrib a=a1\n subblock 1.2\n attrib b=b1\nblock 2\n subblock 2.1\n attrib a=a2\nblock 3\n subblock 3.1\n attrib a=a3\n subblock 3.2\n attrib b=b3"
print(p.findall(s))
#2
0
What about this regex? https://regex101.com/r/yZ4fL9/1
这个正则表达式怎么样? https://regex101.com/r/yZ4fL9/1
block (\d).*?attrib b=b(\1)