.NET offers a Capture collection in its RegularExpression implementation so you can get all instances of a given repeating group rather than just the last instance of it. That's great, but I have a repeating group with subgroups and I'm trying to get at the subgroups as they are related under the group, and can't find a way. Any suggestions?
. net在正则表达式实现中提供了一个捕获集合,因此您可以获得给定重复组的所有实例,而不仅仅是它的最后一个实例。这很好,但是我有一个有子组的重复组,我正在尝试获取子组,因为它们在组下是相关的,而且找不到方法。有什么建议吗?
I've looked at number of other questions, e.g.:
我还看了一些其他的问题,例如:
- Select multiple elements in a regular expression
- 在正则表达式中选择多个元素。
- Regex .NET attached named group
- .NET附加了命名组
- How can I get the Regex Groups for a given Capture?
- 如何获得给定捕获的Regex组?
but I have found no applicable answer either affirmative ("Yep, here's how") or negative ("Nope, can't be done.").
但我没有找到合适的答案,要么是肯定的(“是的,这是怎么回事”),要么是否定的(“不行,不行”)。
For a contrived example say I have an input string:
对于一个设计好的示例,假设我有一个输入字符串:
abc d x 1 2 x 3 x 5 6 e fgh
where the "abc" and "fgh" represent text that I want to ignore in the larger document, "d" and "e" wrap the area of interest, and within that area of interest, "x n [n]" can repeat any number of times. It's those number pairs in the "x" areas that I'm interested in.
当“abc”和“fgh”表示我想在较大的文档中忽略的文本时,“d”和“e”包装感兴趣的区域,并且在感兴趣的区域内,“x n [n]”可以重复任何次数。我感兴趣的是x区域的那些数对。
So I'm parsing it using this regular expression pattern:
我用这个正则表达式模式解析它:
.*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*
which will find exactly one match in the document, but capture the "x" group many times. Here are the three pairs I would want to extract in this example:
它将在文档中找到一个匹配项,但是多次捕获“x”组。下面是我想在这个例子中提取的三对:
- 1, 2
- 1、2
- 3
- 3
- 5, 6
- 5、6
but how can I get them? I could do the following (in C#):
但是我怎么才能得到它们呢?我可以这样做(在c#中):
using System;
using System.Text;
using System.Text.RegularExpressions;
string input = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
foreach (var x in Regex.Match(input, pattern).Groups["x"].Captures) {
MessageBox.Show(x.ToString());
}
and since I'm referencing group "x" I get these strings:
由于我引用了组x,我得到了这些字符串
- x 1 2
- x 1 2
- x 3
- x 3
- x 5 6
- x 5 6
But that doesn't get me at the numbers themselves. So I could do "fir" and "sec" independently instead of just "x":
但这并不能让我理解数字本身。所以我可以独立地写"fir"和"sec"而不只是"x"
using System;
using System.Text;
using System.Text.RegularExpressions;
string input = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
Match m = Regex.Match(input, pattern);
foreach (var f in m.Groups["fir"].Captures) {
MessageBox.Show(f.ToString());
}
foreach (var s in m.Groups["sec"].Captures) {
MessageBox.Show(s.ToString());
}
to get:
得到:
- 1
- 1
- 3
- 3
- 5
- 5
- 2
- 2
- 6
- 6
but then I have no way of knowing that it's the second pair that's missing the "4", and not one of the other pairs.
但是我没有办法知道是第二对丢了“4”,而不是另一对。
So what to do? I know I could easily parse this out in C# or even with a second regex test on the "x" group, but since the first RegEx run has already done all the work and the results ARE known, it seems there ought to be a way to manipulate the Match object to get what I need out of it.
所以要做什么吗?我知道我很容易解析在c#或甚至在“x”与第二个正则表达式测试组,但自从第一个正则表达式运行已经完成所有的工作,结果是已知的,似乎应该有一种方式来操作匹配对象得到我所需要的东西。
And remember, this is a contrived example, the real world case is somewhat more complex so just throwing extra C# code at it would be a pain. But if the existing .NET objects can't do it, then I just need to know that and I'll continue on my way.
请记住,这是一个精心设计的例子,现实世界的情况要复杂得多所以只需要额外的c#代码就会很痛苦。但是如果现有的。net对象不能做到这一点,那么我只需要知道这一点,我将继续我的工作。
Thoughts?
想法吗?
4 个解决方案
#1
5
I am not aware of a fully build in solution and could not find one after a quick search, but this does not exclude the possibility that there is one.
我没有意识到一个完整的解决方案,并且在快速搜索之后找不到一个,但是这并不排除有一个可能。
My best suggestion is to use the Index
and Length
properties to find matching captures. It seems not really elegant but you might be able to come up with some quite nice code after writing some extension methods.
我的最佳建议是使用索引和长度属性来查找匹配的捕获。它看起来并不是很优雅,但是在编写了一些扩展方法之后,您可能可以编写一些相当不错的代码。
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
var match = Regex.Match(input, pattern);
var xs = match.Groups["x"].Captures.Cast<Capture>();
var firs = match.Groups["fir"].Captures.Cast<Capture>();
var secs = match.Groups["sec"].Captures.Cast<Capture>();
Func<Capture, Capture, Boolean> test = (inner, outer) =>
(inner.Index >= outer.Index) &&
(inner.Index < outer.Index + outer.Length);
var result = xs.Select(x => new
{
Fir = firs.FirstOrDefault(f => test(f, x)),
Sec = secs.FirstOrDefault(s => test(s, x))
})
.ToList();
Here one possible solution using the following extension method.
这里有一个可能的解决方案,使用以下扩展方法。
internal static class Extensions
{
internal static IEnumerable<Capture> GetCapturesInside(this Match match,
Capture capture, String groupName)
{
var start = capture.Index;
var end = capture.Index + capture.Length;
return match.Groups[groupName]
.Captures
.Cast<Capture>()
.Where(inner => (inner.Index >= start) &&
(inner.Index < end));
}
}
Now the you can rewrite the code as follows.
现在您可以重写代码如下所示。
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
var match = Regex.Match(input, pattern);
foreach (Capture x in match.Groups["x"].Captures)
{
var fir = match.GetCapturesInside(x, "fir").SingleOrDefault();
var sec = match.GetCapturesInside(x, "sec").SingleOrDefault();
}
#2
3
Will it always be a pair versus single? You could use separate capture groups. Of course, you lose the order of items with this method.
是一对还是一对?您可以使用单独的捕获组。当然,使用此方法会丢失项目的顺序。
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var re = new Regex(@"d\s(?<x>x\s((?<pair>\d+\s\d+)|(?<single>\d+))\s)*e");
var m = re.Match(input);
foreach (Capture s in m.Groups["pair"].Captures)
{
Console.WriteLine(s.Value);
}
foreach (Capture s in m.Groups["single"].Captures)
{
Console.WriteLine(s.Value);
}
1 2
5 6
3
If you need the order, I'd probably go with Blam's suggestion to use a second regular expression.
如果您需要订单,我可能会按照Blam的建议使用第二个正则表达式。
#3
2
I suggest you look into the unique to .net regex the Balanced Groups.
我建议您研究一下。net regex特有的平衡组。
Here is a regex using that to stop the match when the group (either a non digit or an X) is found to close the group. Then the matches are accessed via the captures as required:
这里有一个regex,当发现组(非数字或X)关闭组时,使用它来停止匹配。然后根据需要通过捕获获取匹配项:
string data = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern =
@"(?xn) # Specify options in the pattern
# x - to comment (IgnorePatternWhitespace)
# n - Explicit Capture to ignore non named matches
(?<X>x) # Push the X on the balanced group
((\s)(?<Numbers>\d+))+ # Load up on any numbers into the capture group
(?(Paren)(?!)) # Stop any match that has an X
#(the end of the balance group)";
var results = Regex.Matches(data, pattern)
.OfType<Match>()
.Select ((mt, index) => string.Format("Match {0}: {1}",
index,
string.Join(", ",
mt.Groups["Numbers"]
.Captures
.OfType<Capture>()
.Select (cp => cp.Value))))
;
results.ToList()
.ForEach( result => Console.WriteLine ( result ));
/* Output
Match 0: 1, 2
Match 1: 3
Match 2: 5, 6
*/
#4
1
I have seen OmegaMan's answer and know that you prefer a C# code instead of regex solution. But I wanted to present one alternative anyway.
我已经看到了OmegaMan的答案,并且知道您更喜欢c#代码而不是regex解决方案。但我还是想提出一个替代方案。
In .NET you can reuse named groups. Every time something is captured with that group, it's pushed onto the stack (that's what OmegaMan was referring to by "balancing groups"). You can use this to push an empty capture onto the stack for every x
you find:
在。net中,可以重用命名组。每当有东西被这个组捕获,它就被推到堆栈上(这就是OmegaMan所说的“平衡组”)。您可以使用这个工具将一个空捕获推到堆栈上,为您找到的每一个x:
string pattern = @"d (?<x>x(?<d>) (?:(?<d>\d+) )*)*e";
So now after matching x
the (?<d>)
pushes an empty capture onto the stack. Here is the Console.WriteLine
output (one line per capture):
因此,在匹配了x之后(?
1
2
3
5
6
Hence, when you then walk through Regex.Match(input, pattern).Groups["d"].Captures
and take note of empty strings, you know that a new group of numbers has started.
因此,当你走过Regex的时候。匹配输入,(模式).Groups[d]。捕获并注意空字符串,您知道一组新的数字已经开始。
#1
5
I am not aware of a fully build in solution and could not find one after a quick search, but this does not exclude the possibility that there is one.
我没有意识到一个完整的解决方案,并且在快速搜索之后找不到一个,但是这并不排除有一个可能。
My best suggestion is to use the Index
and Length
properties to find matching captures. It seems not really elegant but you might be able to come up with some quite nice code after writing some extension methods.
我的最佳建议是使用索引和长度属性来查找匹配的捕获。它看起来并不是很优雅,但是在编写了一些扩展方法之后,您可能可以编写一些相当不错的代码。
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
var match = Regex.Match(input, pattern);
var xs = match.Groups["x"].Captures.Cast<Capture>();
var firs = match.Groups["fir"].Captures.Cast<Capture>();
var secs = match.Groups["sec"].Captures.Cast<Capture>();
Func<Capture, Capture, Boolean> test = (inner, outer) =>
(inner.Index >= outer.Index) &&
(inner.Index < outer.Index + outer.Length);
var result = xs.Select(x => new
{
Fir = firs.FirstOrDefault(f => test(f, x)),
Sec = secs.FirstOrDefault(s => test(s, x))
})
.ToList();
Here one possible solution using the following extension method.
这里有一个可能的解决方案,使用以下扩展方法。
internal static class Extensions
{
internal static IEnumerable<Capture> GetCapturesInside(this Match match,
Capture capture, String groupName)
{
var start = capture.Index;
var end = capture.Index + capture.Length;
return match.Groups[groupName]
.Captures
.Cast<Capture>()
.Where(inner => (inner.Index >= start) &&
(inner.Index < end));
}
}
Now the you can rewrite the code as follows.
现在您可以重写代码如下所示。
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
var match = Regex.Match(input, pattern);
foreach (Capture x in match.Groups["x"].Captures)
{
var fir = match.GetCapturesInside(x, "fir").SingleOrDefault();
var sec = match.GetCapturesInside(x, "sec").SingleOrDefault();
}
#2
3
Will it always be a pair versus single? You could use separate capture groups. Of course, you lose the order of items with this method.
是一对还是一对?您可以使用单独的捕获组。当然,使用此方法会丢失项目的顺序。
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var re = new Regex(@"d\s(?<x>x\s((?<pair>\d+\s\d+)|(?<single>\d+))\s)*e");
var m = re.Match(input);
foreach (Capture s in m.Groups["pair"].Captures)
{
Console.WriteLine(s.Value);
}
foreach (Capture s in m.Groups["single"].Captures)
{
Console.WriteLine(s.Value);
}
1 2
5 6
3
If you need the order, I'd probably go with Blam's suggestion to use a second regular expression.
如果您需要订单,我可能会按照Blam的建议使用第二个正则表达式。
#3
2
I suggest you look into the unique to .net regex the Balanced Groups.
我建议您研究一下。net regex特有的平衡组。
Here is a regex using that to stop the match when the group (either a non digit or an X) is found to close the group. Then the matches are accessed via the captures as required:
这里有一个regex,当发现组(非数字或X)关闭组时,使用它来停止匹配。然后根据需要通过捕获获取匹配项:
string data = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern =
@"(?xn) # Specify options in the pattern
# x - to comment (IgnorePatternWhitespace)
# n - Explicit Capture to ignore non named matches
(?<X>x) # Push the X on the balanced group
((\s)(?<Numbers>\d+))+ # Load up on any numbers into the capture group
(?(Paren)(?!)) # Stop any match that has an X
#(the end of the balance group)";
var results = Regex.Matches(data, pattern)
.OfType<Match>()
.Select ((mt, index) => string.Format("Match {0}: {1}",
index,
string.Join(", ",
mt.Groups["Numbers"]
.Captures
.OfType<Capture>()
.Select (cp => cp.Value))))
;
results.ToList()
.ForEach( result => Console.WriteLine ( result ));
/* Output
Match 0: 1, 2
Match 1: 3
Match 2: 5, 6
*/
#4
1
I have seen OmegaMan's answer and know that you prefer a C# code instead of regex solution. But I wanted to present one alternative anyway.
我已经看到了OmegaMan的答案,并且知道您更喜欢c#代码而不是regex解决方案。但我还是想提出一个替代方案。
In .NET you can reuse named groups. Every time something is captured with that group, it's pushed onto the stack (that's what OmegaMan was referring to by "balancing groups"). You can use this to push an empty capture onto the stack for every x
you find:
在。net中,可以重用命名组。每当有东西被这个组捕获,它就被推到堆栈上(这就是OmegaMan所说的“平衡组”)。您可以使用这个工具将一个空捕获推到堆栈上,为您找到的每一个x:
string pattern = @"d (?<x>x(?<d>) (?:(?<d>\d+) )*)*e";
So now after matching x
the (?<d>)
pushes an empty capture onto the stack. Here is the Console.WriteLine
output (one line per capture):
因此,在匹配了x之后(?
1
2
3
5
6
Hence, when you then walk through Regex.Match(input, pattern).Groups["d"].Captures
and take note of empty strings, you know that a new group of numbers has started.
因此,当你走过Regex的时候。匹配输入,(模式).Groups[d]。捕获并注意空字符串,您知道一组新的数字已经开始。