如何与未知数量的组进行匹配?

时间:2022-12-01 16:50:05

I want to do a regex match (in Python) on the output log of a program. The log contains some lines that look like this:

我想在一个程序的输出日志上做一个regex匹配(在Python中)。日志中包含如下几行:

... 
VALUE 100 234 568 9233 119
... 
VALUE 101 124 9223 4329 1559
...

I would like to capture the list of numbers that occurs after the first incidence of the line that starts with VALUE. i.e., I want it to return ('100','234','568','9233','119'). The problem is that I do not know in advance how many numbers there will be.

我想要捕获在第一个以值开始的线的发生率之后发生的数字列表。即。,我希望它返回(' 100 ',' 234 ',' 568 ',' 9233 ',' 119 ')。问题是我事先不知道会有多少个数字。

I tried to use this as a regex:

我试着把它作为一个regex:

VALUE (?:(\d+)\s)+

This matches the line, but it only captures the last value, so I just get ('119',).

这条线匹配,但它只捕获最后一个值,所以我只得到('119',)

5 个解决方案

#1


17  

What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():

您需要的是解析器,而不是正则表达式匹配。在您的例子中,我将考虑使用一个非常简单的解析器split():

s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
    print [int(x) for x in a[1:]]

You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.

您可以使用一个正则表达式是否你的输入行匹配预期的格式(使用正则表达式的问题),那么您可以运行上面的代码,而不必检查“价值”和知道int(x)转换将总是成功,因为你已经确认以下字符组都是数字。

#2


9  

>>> import re
>>> reg = re.compile('\d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']

That doesn't validate that the keyword 'VALUE' appears at the beginning of the string, and it doesn't validate that there is exactly one space between items, but if you can do that as a separate step (or if you don't need to do that at all), then it will find all digit sequences in any string.

这并不验证关键字“价值”出现在字符串的开始,也不会验证项目之间有一个空间,但如果你可以作为一个单独的步骤(或者如果你不需要这么做),那么它将发现任何字符串中所有的数字序列。

#3


2  

You could just run you're main match regex then run a secondary regex on those matches to get the numbers:

你可以运行你的主匹配regex然后在这些匹配上运行一个次要的regex来获取数字:

matches = Regex.Match(log)

foreach (Match match in matches)
{
    submatches = Regex2.Match(match)
}

This is of course also if you don't want to write a full parser.

当然,如果您不想编写完整的解析器,也需要这样做。

#4


2  

Another option not described here is to have a bunch of optional capturing groups.

这里没有描述的另一个选项是拥有一系列可选捕获组。

VALUE *(\d+)? *(\d+)? *(\d+)? *(\d+)? *(\d+)? *$

This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(\d+)? blocks.

此regex可捕获由空格分隔的最多5位组。如果您需要更多的潜在组,只需复制和粘贴更多*(\d+)?块。

#5


0  

I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I'm interested in and the second one to parse the sub groups. For example in this case, I'd start with this:

我遇到了同样的问题,我的解决方案是使用两个正则表达式:第一个表达式匹配我感兴趣的整个组,第二个表达式解析子组。例如,在这个例子中,我将从以下内容开始:

VALUE((\s\d+)+)

This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.

这将导致三种匹配:[0]整行,[1]值[2]后的内容最后一个空格+值。

[0] and [2] can be ignored and then [1] can be used with the following:

可以忽略[0]和[2],可以使用[1]进行如下操作:

\s(\d+)

Note: these regexps were not tested, I hope you get the idea though.

注意:这些regexp没有经过测试,我希望您能理解。


The reason why Greg's answer doesn't work for me is because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.

格雷格的答案对我不起作用的原因是,解析的第二部分更加复杂,而不仅仅是一些由空格分隔的数字。

However, I would honestly go with Greg's solution for this question (it's probably way more efficient).

然而,我真诚地赞同葛瑞格对这个问题的解决方案(可能更有效)。

I'm just writing this answer in case someone is looking for a more sophisticated solution like I needed.

我只是在写这个答案,以防有人在寻找我需要的更复杂的解决方案。

#1


17  

What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():

您需要的是解析器,而不是正则表达式匹配。在您的例子中,我将考虑使用一个非常简单的解析器split():

s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
    print [int(x) for x in a[1:]]

You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.

您可以使用一个正则表达式是否你的输入行匹配预期的格式(使用正则表达式的问题),那么您可以运行上面的代码,而不必检查“价值”和知道int(x)转换将总是成功,因为你已经确认以下字符组都是数字。

#2


9  

>>> import re
>>> reg = re.compile('\d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']

That doesn't validate that the keyword 'VALUE' appears at the beginning of the string, and it doesn't validate that there is exactly one space between items, but if you can do that as a separate step (or if you don't need to do that at all), then it will find all digit sequences in any string.

这并不验证关键字“价值”出现在字符串的开始,也不会验证项目之间有一个空间,但如果你可以作为一个单独的步骤(或者如果你不需要这么做),那么它将发现任何字符串中所有的数字序列。

#3


2  

You could just run you're main match regex then run a secondary regex on those matches to get the numbers:

你可以运行你的主匹配regex然后在这些匹配上运行一个次要的regex来获取数字:

matches = Regex.Match(log)

foreach (Match match in matches)
{
    submatches = Regex2.Match(match)
}

This is of course also if you don't want to write a full parser.

当然,如果您不想编写完整的解析器,也需要这样做。

#4


2  

Another option not described here is to have a bunch of optional capturing groups.

这里没有描述的另一个选项是拥有一系列可选捕获组。

VALUE *(\d+)? *(\d+)? *(\d+)? *(\d+)? *(\d+)? *$

This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(\d+)? blocks.

此regex可捕获由空格分隔的最多5位组。如果您需要更多的潜在组,只需复制和粘贴更多*(\d+)?块。

#5


0  

I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I'm interested in and the second one to parse the sub groups. For example in this case, I'd start with this:

我遇到了同样的问题,我的解决方案是使用两个正则表达式:第一个表达式匹配我感兴趣的整个组,第二个表达式解析子组。例如,在这个例子中,我将从以下内容开始:

VALUE((\s\d+)+)

This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.

这将导致三种匹配:[0]整行,[1]值[2]后的内容最后一个空格+值。

[0] and [2] can be ignored and then [1] can be used with the following:

可以忽略[0]和[2],可以使用[1]进行如下操作:

\s(\d+)

Note: these regexps were not tested, I hope you get the idea though.

注意:这些regexp没有经过测试,我希望您能理解。


The reason why Greg's answer doesn't work for me is because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.

格雷格的答案对我不起作用的原因是,解析的第二部分更加复杂,而不仅仅是一些由空格分隔的数字。

However, I would honestly go with Greg's solution for this question (it's probably way more efficient).

然而,我真诚地赞同葛瑞格对这个问题的解决方案(可能更有效)。

I'm just writing this answer in case someone is looking for a more sophisticated solution like I needed.

我只是在写这个答案,以防有人在寻找我需要的更复杂的解决方案。