在c#中具有非捕获组的Regex

时间:2021-01-22 22:33:11

I am using the following Regex

我正在使用下面的Regex

JOINTS.*\s*(?:(\d*\s*\S*\s*\S*\s*\S*)\r\n\s*)*

on the following type of data:

关于下列类型的数据:

 JOINTS               DISPL.-X               DISPL.-Y               ROTATION


     1            0.000000E+00           0.975415E+01           0.616921E+01
     2            0.000000E+00           0.000000E+00           0.000000E+00

The idea is to extract two groups, each containing a line (starting with the Joint Number, 1, 2, etc.) The C# code is as follows:

这个想法是提取两个组,每个组包含一行(以联合号1、2等开始)c#代码如下:

string jointPattern = @"JOINTS.*\s*(?:(\d*\s*\S*\s*\S*\s*\S*)\r\n\s*)*";
MatchCollection mc = Regex.Matches(outFileSection, jointPattern );
foreach (Capture c in mc[0].Captures)
{
    JointOutput j = new JointOutput();
    string[] vals = c.Value.Split();
    j.Joint = int.Parse(vals[0]) - 1;
    j.XDisplacement = float.Parse(vals[1]);
    j.YDisplacement = float.Parse(vals[2]);
    j.Rotation = float.Parse(vals[3]);
    joints.Add(j);
}

However, this does not work: rather than returning two captured groups (the inside group), it returns one group: the entire block, including the column headers. Why does this happen? Does C# deal with un-captured groups differently?

但是,这并不能工作:它没有返回两个捕获的组(内部组),而是返回一个组:整个块,包括列标题。这为什么会发生?c#是否与未捕获的组进行了不同的处理?

Finally, are RegExes the best way to do this? (I really do feel like I have two problems now.)

最后,RegExes是最好的方法吗?(我现在真的觉得我有两个问题。)

4 个解决方案

#1


8  

mc[0].Captures is equivalent to mc[0].Groups[0].Captures. Groups[0] always refers to the whole match, so there will only ever be the one Capture associated with it. The part you're looking for is captured in group #1, so you should be using mc[0].Groups[1].Captures.

mc[0]。capture相当于mc[0]. groups [0]. capture。组[0]总是引用整个匹配,因此只有一个捕获与之关联。您正在寻找的部分在组#1中被捕获,因此您应该使用mc[0]. groups [1]. capture。

But your regex is designed to match the whole input in one attempt, so the Matches() method will always return a MatchCollection with only one Match in it (assuming the match is successful). You might as well use Match() instead:

但是,regex被设计为在一次尝试中匹配整个输入,因此Matches()方法将总是返回一个只有一个匹配项的MatchCollection(假设匹配成功)。您不妨使用Match()代替:

  Match m = Regex.Match(source, jointPattern);
  if (m.Success)
  {
    foreach (Capture c in m.Groups[1].Captures)
    {
      Console.WriteLine(c.Value);
    }
  }

output:

输出:

1            0.000000E+00           0.975415E+01           0.616921E+01
2            0.000000E+00           0.000000E+00           0.000000E+00

#2


2  

I would just not use Regex for heavy lifting and parse the text.

我不会使用Regex进行繁重的操作并解析文本。

var data = @"     JOINTS               DISPL.-X               DISPL.-Y               ROTATION


         1            0.000000E+00           0.975415E+01           0.616921E+01
         2            0.000000E+00           0.000000E+00           0.000000E+00";

var lines = data.Split('\r', '\n').Where(s => !string.IsNullOrWhiteSpace(s));
var regex = new Regex(@"(\S+)");

var dataItems = lines.Select(s => regex.Matches(s)).Select(m => m.Cast<Match>().Select(c => c.Value));

在c#中具有非捕获组的Regex

#3


1  

Why not just capture the values and ignore the rest. Here is a regex which gets the values.

为什么不只是获取这些值,而忽略其余的呢?这是一个获取值的regex。

string data = @"JOINTS DISPL.-X DISPL.-Y ROTATION
 1 0.000000E+00 0.975415E+01 0.616921E+01
 2 0.000000E+00 0.000000E+00 0.000000E+00";

string pattern = @"^
\s+
 (?<Joint>\d+)
\s+
 (?<ValX>[^\s]+)
\s+
 (?<ValY>[^\s]+)
\s+
 (?<Rotation>[^\s]+)";

var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
                  .OfType<Match>()
                  .Select (mt => new
                  {
                    Joint = mt.Groups["Joint"].Value,
                    ValX  = mt.Groups["ValX"].Value,
                    ValY  = mt.Groups["ValY"].Value,
                    Rotation = mt.Groups["Rotation"].Value,
                  });
/* result is
IEnumerable<> (2 items)
Joint ValX ValY Rotation
1 0.000000E+00 0.975415E+01 0.616921E+01
2 0.000000E+00 0.000000E+00 0.000000E+00
*/

#4


1  

There's two problems: The repeating part (?:...) is not matching properly; and the .* is greedy and consumes all the input, so the repeating part never matches even if it could.

有两个问题:重复部分(?:…)没有正确匹配;*是贪婪的,它会消耗所有的输入,所以重复的部分即使有可能也不会匹配。

Use this instead:

取代它可使用:

JOINTS.*?[\r\n]+(?:\s*(\d+\s*\S*\s*\S*\s*\S*)[\r\n\s]*)*

This has a non-greedy leading part, ensures that the line-matching part starts on a new line (not in the middle of a title), and uses [\r\n\s]* in case the newlines are not exactly as you expect.

它有一个非贪心的前导部分,确保行匹配部分开始于新的行(不在标题中间),并使用[\r\n\s]*以防新行不完全符合您的期望。

Personally, I would use regexes for this, but I like regexes :-) If you happen to know that the structure of the string will always be [title][newline][newline][lines] then perhaps it's more straightforward (if less flexible) to just split on newlines and process accordingly.

就我个人而言,我将使用regexes来实现这一点,但我喜欢regexes:-)如果您碰巧知道字符串的结构将始终是[title][newline][newline][lines],那么,仅仅对换行进行拆分并相应地进行处理,可能会更简单(如果不那么灵活的话)。

Finally, you can use regex101.com or one of the many other regex testing sites to help debug your regular expressions.

最后,您可以使用regex101.com或其他许多regex测试站点之一来帮助调试正则表达式。

#1


8  

mc[0].Captures is equivalent to mc[0].Groups[0].Captures. Groups[0] always refers to the whole match, so there will only ever be the one Capture associated with it. The part you're looking for is captured in group #1, so you should be using mc[0].Groups[1].Captures.

mc[0]。capture相当于mc[0]. groups [0]. capture。组[0]总是引用整个匹配,因此只有一个捕获与之关联。您正在寻找的部分在组#1中被捕获,因此您应该使用mc[0]. groups [1]. capture。

But your regex is designed to match the whole input in one attempt, so the Matches() method will always return a MatchCollection with only one Match in it (assuming the match is successful). You might as well use Match() instead:

但是,regex被设计为在一次尝试中匹配整个输入,因此Matches()方法将总是返回一个只有一个匹配项的MatchCollection(假设匹配成功)。您不妨使用Match()代替:

  Match m = Regex.Match(source, jointPattern);
  if (m.Success)
  {
    foreach (Capture c in m.Groups[1].Captures)
    {
      Console.WriteLine(c.Value);
    }
  }

output:

输出:

1            0.000000E+00           0.975415E+01           0.616921E+01
2            0.000000E+00           0.000000E+00           0.000000E+00

#2


2  

I would just not use Regex for heavy lifting and parse the text.

我不会使用Regex进行繁重的操作并解析文本。

var data = @"     JOINTS               DISPL.-X               DISPL.-Y               ROTATION


         1            0.000000E+00           0.975415E+01           0.616921E+01
         2            0.000000E+00           0.000000E+00           0.000000E+00";

var lines = data.Split('\r', '\n').Where(s => !string.IsNullOrWhiteSpace(s));
var regex = new Regex(@"(\S+)");

var dataItems = lines.Select(s => regex.Matches(s)).Select(m => m.Cast<Match>().Select(c => c.Value));

在c#中具有非捕获组的Regex

#3


1  

Why not just capture the values and ignore the rest. Here is a regex which gets the values.

为什么不只是获取这些值,而忽略其余的呢?这是一个获取值的regex。

string data = @"JOINTS DISPL.-X DISPL.-Y ROTATION
 1 0.000000E+00 0.975415E+01 0.616921E+01
 2 0.000000E+00 0.000000E+00 0.000000E+00";

string pattern = @"^
\s+
 (?<Joint>\d+)
\s+
 (?<ValX>[^\s]+)
\s+
 (?<ValY>[^\s]+)
\s+
 (?<Rotation>[^\s]+)";

var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
                  .OfType<Match>()
                  .Select (mt => new
                  {
                    Joint = mt.Groups["Joint"].Value,
                    ValX  = mt.Groups["ValX"].Value,
                    ValY  = mt.Groups["ValY"].Value,
                    Rotation = mt.Groups["Rotation"].Value,
                  });
/* result is
IEnumerable<> (2 items)
Joint ValX ValY Rotation
1 0.000000E+00 0.975415E+01 0.616921E+01
2 0.000000E+00 0.000000E+00 0.000000E+00
*/

#4


1  

There's two problems: The repeating part (?:...) is not matching properly; and the .* is greedy and consumes all the input, so the repeating part never matches even if it could.

有两个问题:重复部分(?:…)没有正确匹配;*是贪婪的,它会消耗所有的输入,所以重复的部分即使有可能也不会匹配。

Use this instead:

取代它可使用:

JOINTS.*?[\r\n]+(?:\s*(\d+\s*\S*\s*\S*\s*\S*)[\r\n\s]*)*

This has a non-greedy leading part, ensures that the line-matching part starts on a new line (not in the middle of a title), and uses [\r\n\s]* in case the newlines are not exactly as you expect.

它有一个非贪心的前导部分,确保行匹配部分开始于新的行(不在标题中间),并使用[\r\n\s]*以防新行不完全符合您的期望。

Personally, I would use regexes for this, but I like regexes :-) If you happen to know that the structure of the string will always be [title][newline][newline][lines] then perhaps it's more straightforward (if less flexible) to just split on newlines and process accordingly.

就我个人而言,我将使用regexes来实现这一点,但我喜欢regexes:-)如果您碰巧知道字符串的结构将始终是[title][newline][newline][lines],那么,仅仅对换行进行拆分并相应地进行处理,可能会更简单(如果不那么灵活的话)。

Finally, you can use regex101.com or one of the many other regex testing sites to help debug your regular expressions.

最后,您可以使用regex101.com或其他许多regex测试站点之一来帮助调试正则表达式。