捕捉模式,但在引号内忽略它

时间:2020-12-16 22:30:06

So, what I need to do in c# regex is basically split a string whenever I find a certain pattern, but ignore that pattern if it is surrounded by double quotes in the string.

所以,我需要在c#regex中做的事情基本上是每当我找到某个模式时拆分一个字符串,但如果它被字符串中的双引号包围,则忽略该模式。

Example:

string text = "abc , def , a\" , \"d , oioi";
string pattern = "[ \t]*,[ \t]*";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (3 splits, 4 strings):

拆分后想要的结果(3个分裂,4个字符串):

    {"abc",
     "def",
     "a\" , \"d",
     "oioi"}

Actual result (4 splits, 5 strings):

实际结果(4个分组,5个字符串):

    {"abc",
     "def",
     "a\"",
     "\"d",
     "oioi"}

Another example:

string text = "a%2% 6y % \"ad%t6%&\" %(7y) %";
string pattern = "%";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (5 splits, 6 strings):

拆分后想要的结果(5个分裂,6个字符串):

    {"a",
     "2",
     " 6y ",
     " \"ad%t6%&\" ",
     "(7y) ",
     ""}

Actual result (7 splits, 8 strings):

实际结果(7个分组,8个字符串):

    {"a",
     "2",
     " 6y ",
     "\"ad",
     "t6",
     "&\" ",
     "(7y) ",
     ""}

A 3rd example, to exemplify a tricky split where only the first case should be ignored:

第三个例子,举例说明一个棘手的分裂,只应忽略第一种情况:

string text = "!!\"!!\"!!\"";
string pattern = "!!";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (2 splits, 3 strings):

拆分后想要的结果(2个分裂,3个字符串):

    {"",
     "\"!!\"",
     "\""}

Actual result (3 splits, 4 strings):

实际结果(3个分组,4个字符串):

    {"",
     "\"",
     "\"",
     "\"",}

So, how do I move from pattern to a new pattern that achieves the desired result?

那么,我如何从模式转变为实现预期结果的新模式?

Sidenote: If you're going to mark someone's question as duplicate (and I have nothing against that), at least point them to the right answer, not to some random post (yes, I'm looking at you, Mr. Avinash Raj)...

旁注:如果你要将某人的问题标记为重复(并且我没有反对),至少指出正确的答案,而不是一些随机的帖子(是的,我在看着你,Avinash Raj先生)...

2 个解决方案

#1


2  

The rules are more or less like in a csv line except that:

规则或多或少类似于csv行,除了:

  • the delimiter can be a single character, but it can be a string or a pattern too (in these last cases items must be trimmed if they start or end with the last or first possible tokens of the pattern delimiter),
  • 分隔符可以是单个字符,但它也可以是字符串或模式(在这些最后的情况下,如果它们以模式分隔符的最后或第一个可能的标记开始或结束,则必须修剪项目),

  • an orphan quote is allowed for the last item.
  • 最后一项允许使用孤立引号。

First, when you want to separate items (to split) with a little advanced rules, the split method is no more a good choice. The split method is only handy for simple situations, not for your case. (even without orphan quotes, using split with ,(?=(?:[^"]*"[^"]*")*[^"]*$) is a very bad idea since the number of steps needed to parse the string grows exponentially with the string size.)

首先,当您想要使用一些高级规则来分隔项目(拆分)时,拆分方法不再是一个好的选择。拆分方法仅适用于简单情况,而不适用于您的情况。 (即使没有孤儿引号,使用split with,(?=(?:[^“] *”[^“] *”)* [^“] * $)是一个非常糟糕的主意,因为解析所需的步骤数字符串以字符串大小呈指数增长。)

The other approach consists to capture items. That is more simple and faster. (bonus: it checks the format of the whole string at the same time).

另一种方法是捕获物品。这更简单,更快捷。 (奖励:它同时检查整个字符串的格式)。

Here is a general way to do it:

这是一种通用的方法:

^
(?>
  (?:delimiter | start_of_the_string)
  (
      simple_part
      (?>
          (?: quotes | delim_first_letter_1 | delim_first_letter_2 | etc. )
          simple_part
      )*
  )
)+
$

Example with \s*,\s* as delimiter:

使用\ s *,\ s *作为分隔符的示例:

^
# non-capturing group for one delimiter and one item
(?>
    (?: \s*,\s* | ^ ) # delimiter or start of the string
                      # (eventually change "^" to "^ \s*" to trim the first item)

    # capture group 1 for the item 
    (   # simple part of the item (maybe empty):
        [^\s,"]* # all that is not the quote character or one of the  possible first
                 # character of the delimiter
        # edge case followed by a simple part
        (?>
            (?: # edge cases
                " [^"]* (?:"|$) # a quoted part or an orphan quote in the last item (*)
              |   # OR
                (?> \s+ ) # start of the delimiter
                (?!,)     # but not the delimiter
            )

            [^\s,"]* # simple part
        )*
    )
)+
$

demo (click on the table link)

演示(点击表格链接)

The pattern is designed for the Regex.Match method since it describes all the string. All items are available in group 1 since the .net regex flavor is able to store repeated capture groups.

该模式是为Regex.Match方法设计的,因为它描述了所有字符串。所有项目都在组1中可用,因为.net正则表达式风格能够存储重复的捕获组。

This example can be easily adapted to all cases.

这个例子可以很容易地适应所有情况。

(*) if you want to allow escaped quotes inside quoted parts, you can use one more time simple_part (?: edge_case simple_part)* instead of " [^"]* (?:"|$),
i.e: "[^\\"]* (?: \\. [^\\"]*)* (?:"|$)

(*)如果你想在引用的部分中允许转义引号,你可以再使用一次simple_part(?:edge_case simple_part)*而不是“[^”] *(?:“| $),即:”[^ \ \“] *(?:\\。[^ \\”] *)*(?:“| $)

#2


0  

I think this is a two step process and it has been overthought trying to make it a one step regex.

我认为这是一个两步的过程,它已被推翻,试图让它成为一步到位的正则表达式。


Steps

  1. Simply remove any quotes from a string.
  2. 只需从字符串中删除任何引号即可。

  3. Split on the target character(s).
  4. 拆分目标角色。

Example of Process

过程示例

I will split on the , for step 2.

对于第2步,我将分开。

var data = string.Format("abc , def , a{0}, {0}d , oioi", "\"");

 // `\x22` is hex for a quote (") which for easier reading in C# editing.
var stage1 = Regex.Replace(data, @"\x22", string.Empty);

// abc , def , a", "d , oioi
// becomes
// abc , def , a, d , oioi

Regex.Matches(stage1, @"([^\s,]+)[\s,]*")
     .OfType<Match>()
     .Select(mt => mt.Groups[1].Value )

Result

捕捉模式,但在引号内忽略它

#1


2  

The rules are more or less like in a csv line except that:

规则或多或少类似于csv行,除了:

  • the delimiter can be a single character, but it can be a string or a pattern too (in these last cases items must be trimmed if they start or end with the last or first possible tokens of the pattern delimiter),
  • 分隔符可以是单个字符,但它也可以是字符串或模式(在这些最后的情况下,如果它们以模式分隔符的最后或第一个可能的标记开始或结束,则必须修剪项目),

  • an orphan quote is allowed for the last item.
  • 最后一项允许使用孤立引号。

First, when you want to separate items (to split) with a little advanced rules, the split method is no more a good choice. The split method is only handy for simple situations, not for your case. (even without orphan quotes, using split with ,(?=(?:[^"]*"[^"]*")*[^"]*$) is a very bad idea since the number of steps needed to parse the string grows exponentially with the string size.)

首先,当您想要使用一些高级规则来分隔项目(拆分)时,拆分方法不再是一个好的选择。拆分方法仅适用于简单情况,而不适用于您的情况。 (即使没有孤儿引号,使用split with,(?=(?:[^“] *”[^“] *”)* [^“] * $)是一个非常糟糕的主意,因为解析所需的步骤数字符串以字符串大小呈指数增长。)

The other approach consists to capture items. That is more simple and faster. (bonus: it checks the format of the whole string at the same time).

另一种方法是捕获物品。这更简单,更快捷。 (奖励:它同时检查整个字符串的格式)。

Here is a general way to do it:

这是一种通用的方法:

^
(?>
  (?:delimiter | start_of_the_string)
  (
      simple_part
      (?>
          (?: quotes | delim_first_letter_1 | delim_first_letter_2 | etc. )
          simple_part
      )*
  )
)+
$

Example with \s*,\s* as delimiter:

使用\ s *,\ s *作为分隔符的示例:

^
# non-capturing group for one delimiter and one item
(?>
    (?: \s*,\s* | ^ ) # delimiter or start of the string
                      # (eventually change "^" to "^ \s*" to trim the first item)

    # capture group 1 for the item 
    (   # simple part of the item (maybe empty):
        [^\s,"]* # all that is not the quote character or one of the  possible first
                 # character of the delimiter
        # edge case followed by a simple part
        (?>
            (?: # edge cases
                " [^"]* (?:"|$) # a quoted part or an orphan quote in the last item (*)
              |   # OR
                (?> \s+ ) # start of the delimiter
                (?!,)     # but not the delimiter
            )

            [^\s,"]* # simple part
        )*
    )
)+
$

demo (click on the table link)

演示(点击表格链接)

The pattern is designed for the Regex.Match method since it describes all the string. All items are available in group 1 since the .net regex flavor is able to store repeated capture groups.

该模式是为Regex.Match方法设计的,因为它描述了所有字符串。所有项目都在组1中可用,因为.net正则表达式风格能够存储重复的捕获组。

This example can be easily adapted to all cases.

这个例子可以很容易地适应所有情况。

(*) if you want to allow escaped quotes inside quoted parts, you can use one more time simple_part (?: edge_case simple_part)* instead of " [^"]* (?:"|$),
i.e: "[^\\"]* (?: \\. [^\\"]*)* (?:"|$)

(*)如果你想在引用的部分中允许转义引号,你可以再使用一次simple_part(?:edge_case simple_part)*而不是“[^”] *(?:“| $),即:”[^ \ \“] *(?:\\。[^ \\”] *)*(?:“| $)

#2


0  

I think this is a two step process and it has been overthought trying to make it a one step regex.

我认为这是一个两步的过程,它已被推翻,试图让它成为一步到位的正则表达式。


Steps

  1. Simply remove any quotes from a string.
  2. 只需从字符串中删除任何引号即可。

  3. Split on the target character(s).
  4. 拆分目标角色。

Example of Process

过程示例

I will split on the , for step 2.

对于第2步,我将分开。

var data = string.Format("abc , def , a{0}, {0}d , oioi", "\"");

 // `\x22` is hex for a quote (") which for easier reading in C# editing.
var stage1 = Regex.Replace(data, @"\x22", string.Empty);

// abc , def , a", "d , oioi
// becomes
// abc , def , a, d , oioi

Regex.Matches(stage1, @"([^\s,]+)[\s,]*")
     .OfType<Match>()
     .Select(mt => mt.Groups[1].Value )

Result

捕捉模式,但在引号内忽略它