So, what I need to do in c# regex is basically split a string whenever I find a certain pattern, but ignore that pattern if it is surrounded by double quotes in the string.
所以,我需要在c#regex中做的事情基本上是每当我找到某个模式时拆分一个字符串,但如果它被字符串中的双引号包围,则忽略该模式。
Example:
string text = "abc , def , a\" , \"d , oioi";
string pattern = "[ \t]*,[ \t]*";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (3 splits, 4 strings):
拆分后想要的结果(3个分裂,4个字符串):
{"abc",
"def",
"a\" , \"d",
"oioi"}
Actual result (4 splits, 5 strings):
实际结果(4个分组,5个字符串):
{"abc",
"def",
"a\"",
"\"d",
"oioi"}
Another example:
string text = "a%2% 6y % \"ad%t6%&\" %(7y) %";
string pattern = "%";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (5 splits, 6 strings):
拆分后想要的结果(5个分裂,6个字符串):
{"a",
"2",
" 6y ",
" \"ad%t6%&\" ",
"(7y) ",
""}
Actual result (7 splits, 8 strings):
实际结果(7个分组,8个字符串):
{"a",
"2",
" 6y ",
"\"ad",
"t6",
"&\" ",
"(7y) ",
""}
A 3rd example, to exemplify a tricky split where only the first case should be ignored:
第三个例子,举例说明一个棘手的分裂,只应忽略第一种情况:
string text = "!!\"!!\"!!\"";
string pattern = "!!";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (2 splits, 3 strings):
拆分后想要的结果(2个分裂,3个字符串):
{"",
"\"!!\"",
"\""}
Actual result (3 splits, 4 strings):
实际结果(3个分组,4个字符串):
{"",
"\"",
"\"",
"\"",}
So, how do I move from pattern to a new pattern that achieves the desired result?
那么,我如何从模式转变为实现预期结果的新模式?
Sidenote: If you're going to mark someone's question as duplicate (and I have nothing against that), at least point them to the right answer, not to some random post (yes, I'm looking at you, Mr. Avinash Raj)...
旁注:如果你要将某人的问题标记为重复(并且我没有反对),至少指出正确的答案,而不是一些随机的帖子(是的,我在看着你,Avinash Raj先生)...
2 个解决方案
#1
2
The rules are more or less like in a csv line except that:
规则或多或少类似于csv行,除了:
- the delimiter can be a single character, but it can be a string or a pattern too (in these last cases items must be trimmed if they start or end with the last or first possible tokens of the pattern delimiter),
- an orphan quote is allowed for the last item.
分隔符可以是单个字符,但它也可以是字符串或模式(在这些最后的情况下,如果它们以模式分隔符的最后或第一个可能的标记开始或结束,则必须修剪项目),
最后一项允许使用孤立引号。
First, when you want to separate items (to split) with a little advanced rules, the split method is no more a good choice. The split method is only handy for simple situations, not for your case. (even without orphan quotes, using split with ,(?=(?:[^"]*"[^"]*")*[^"]*$)
is a very bad idea since the number of steps needed to parse the string grows exponentially with the string size.)
首先,当您想要使用一些高级规则来分隔项目(拆分)时,拆分方法不再是一个好的选择。拆分方法仅适用于简单情况,而不适用于您的情况。 (即使没有孤儿引号,使用split with,(?=(?:[^“] *”[^“] *”)* [^“] * $)是一个非常糟糕的主意,因为解析所需的步骤数字符串以字符串大小呈指数增长。)
The other approach consists to capture items. That is more simple and faster. (bonus: it checks the format of the whole string at the same time).
另一种方法是捕获物品。这更简单,更快捷。 (奖励:它同时检查整个字符串的格式)。
Here is a general way to do it:
这是一种通用的方法:
^
(?>
(?:delimiter | start_of_the_string)
(
simple_part
(?>
(?: quotes | delim_first_letter_1 | delim_first_letter_2 | etc. )
simple_part
)*
)
)+
$
Example with \s*,\s*
as delimiter:
使用\ s *,\ s *作为分隔符的示例:
^
# non-capturing group for one delimiter and one item
(?>
(?: \s*,\s* | ^ ) # delimiter or start of the string
# (eventually change "^" to "^ \s*" to trim the first item)
# capture group 1 for the item
( # simple part of the item (maybe empty):
[^\s,"]* # all that is not the quote character or one of the possible first
# character of the delimiter
# edge case followed by a simple part
(?>
(?: # edge cases
" [^"]* (?:"|$) # a quoted part or an orphan quote in the last item (*)
| # OR
(?> \s+ ) # start of the delimiter
(?!,) # but not the delimiter
)
[^\s,"]* # simple part
)*
)
)+
$
demo (click on the table link)
演示(点击表格链接)
The pattern is designed for the Regex.Match
method since it describes all the string. All items are available in group 1 since the .net regex flavor is able to store repeated capture groups.
该模式是为Regex.Match方法设计的,因为它描述了所有字符串。所有项目都在组1中可用,因为.net正则表达式风格能够存储重复的捕获组。
This example can be easily adapted to all cases.
这个例子可以很容易地适应所有情况。
(*) if you want to allow escaped quotes inside quoted parts, you can use one more time simple_part (?: edge_case simple_part)*
instead of " [^"]* (?:"|$)
,
i.e: "[^\\"]* (?: \\. [^\\"]*)* (?:"|$)
(*)如果你想在引用的部分中允许转义引号,你可以再使用一次simple_part(?:edge_case simple_part)*而不是“[^”] *(?:“| $),即:”[^ \ \“] *(?:\\。[^ \\”] *)*(?:“| $)
#2
0
I think this is a two step process and it has been overthought trying to make it a one step regex.
我认为这是一个两步的过程,它已被推翻,试图让它成为一步到位的正则表达式。
Steps
- Simply remove any quotes from a string.
- Split on the target character(s).
只需从字符串中删除任何引号即可。
拆分目标角色。
Example of Process
过程示例
I will split on the ,
for step 2.
对于第2步,我将分开。
var data = string.Format("abc , def , a{0}, {0}d , oioi", "\"");
// `\x22` is hex for a quote (") which for easier reading in C# editing.
var stage1 = Regex.Replace(data, @"\x22", string.Empty);
// abc , def , a", "d , oioi
// becomes
// abc , def , a, d , oioi
Regex.Matches(stage1, @"([^\s,]+)[\s,]*")
.OfType<Match>()
.Select(mt => mt.Groups[1].Value )
Result
#1
2
The rules are more or less like in a csv line except that:
规则或多或少类似于csv行,除了:
- the delimiter can be a single character, but it can be a string or a pattern too (in these last cases items must be trimmed if they start or end with the last or first possible tokens of the pattern delimiter),
- an orphan quote is allowed for the last item.
分隔符可以是单个字符,但它也可以是字符串或模式(在这些最后的情况下,如果它们以模式分隔符的最后或第一个可能的标记开始或结束,则必须修剪项目),
最后一项允许使用孤立引号。
First, when you want to separate items (to split) with a little advanced rules, the split method is no more a good choice. The split method is only handy for simple situations, not for your case. (even without orphan quotes, using split with ,(?=(?:[^"]*"[^"]*")*[^"]*$)
is a very bad idea since the number of steps needed to parse the string grows exponentially with the string size.)
首先,当您想要使用一些高级规则来分隔项目(拆分)时,拆分方法不再是一个好的选择。拆分方法仅适用于简单情况,而不适用于您的情况。 (即使没有孤儿引号,使用split with,(?=(?:[^“] *”[^“] *”)* [^“] * $)是一个非常糟糕的主意,因为解析所需的步骤数字符串以字符串大小呈指数增长。)
The other approach consists to capture items. That is more simple and faster. (bonus: it checks the format of the whole string at the same time).
另一种方法是捕获物品。这更简单,更快捷。 (奖励:它同时检查整个字符串的格式)。
Here is a general way to do it:
这是一种通用的方法:
^
(?>
(?:delimiter | start_of_the_string)
(
simple_part
(?>
(?: quotes | delim_first_letter_1 | delim_first_letter_2 | etc. )
simple_part
)*
)
)+
$
Example with \s*,\s*
as delimiter:
使用\ s *,\ s *作为分隔符的示例:
^
# non-capturing group for one delimiter and one item
(?>
(?: \s*,\s* | ^ ) # delimiter or start of the string
# (eventually change "^" to "^ \s*" to trim the first item)
# capture group 1 for the item
( # simple part of the item (maybe empty):
[^\s,"]* # all that is not the quote character or one of the possible first
# character of the delimiter
# edge case followed by a simple part
(?>
(?: # edge cases
" [^"]* (?:"|$) # a quoted part or an orphan quote in the last item (*)
| # OR
(?> \s+ ) # start of the delimiter
(?!,) # but not the delimiter
)
[^\s,"]* # simple part
)*
)
)+
$
demo (click on the table link)
演示(点击表格链接)
The pattern is designed for the Regex.Match
method since it describes all the string. All items are available in group 1 since the .net regex flavor is able to store repeated capture groups.
该模式是为Regex.Match方法设计的,因为它描述了所有字符串。所有项目都在组1中可用,因为.net正则表达式风格能够存储重复的捕获组。
This example can be easily adapted to all cases.
这个例子可以很容易地适应所有情况。
(*) if you want to allow escaped quotes inside quoted parts, you can use one more time simple_part (?: edge_case simple_part)*
instead of " [^"]* (?:"|$)
,
i.e: "[^\\"]* (?: \\. [^\\"]*)* (?:"|$)
(*)如果你想在引用的部分中允许转义引号,你可以再使用一次simple_part(?:edge_case simple_part)*而不是“[^”] *(?:“| $),即:”[^ \ \“] *(?:\\。[^ \\”] *)*(?:“| $)
#2
0
I think this is a two step process and it has been overthought trying to make it a one step regex.
我认为这是一个两步的过程,它已被推翻,试图让它成为一步到位的正则表达式。
Steps
- Simply remove any quotes from a string.
- Split on the target character(s).
只需从字符串中删除任何引号即可。
拆分目标角色。
Example of Process
过程示例
I will split on the ,
for step 2.
对于第2步,我将分开。
var data = string.Format("abc , def , a{0}, {0}d , oioi", "\"");
// `\x22` is hex for a quote (") which for easier reading in C# editing.
var stage1 = Regex.Replace(data, @"\x22", string.Empty);
// abc , def , a", "d , oioi
// becomes
// abc , def , a, d , oioi
Regex.Matches(stage1, @"([^\s,]+)[\s,]*")
.OfType<Match>()
.Select(mt => mt.Groups[1].Value )
Result