正则表达式以逗号分割,除非引用

时间:2022-02-11 21:40:19

What is the regular expression to split on comma (,) except if surrounded by double quotes? For example:

除了用双引号括起来之外,在逗号(,)上拆分的正则表达式是什么?例如:

max,emily,john = ["max", "emily", "john"]

BUT

max,"emily,kate",john = ["max", "emily,kate", "john"]

Looking to use in C#: Regex.Split(string, "PATTERN-HERE");

希望在C#中使用:Regex.Split(字符串,“PATTERN-HERE”);

Thanks.

谢谢。

4 个解决方案

#1


14  

Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.

像这样的情况通常需要除正则表达式之外的其他东西。它们很漂亮,但处理这种事情的模式比它们有用的更复杂。

You might try something like this instead:

您可以尝试这样的事情:

public static IEnumerable<string> SplitCSV(string csvString)
{
    var sb = new StringBuilder();
    bool quoted = false;

    foreach (char c in csvString) {
        if (quoted) {
            if (c == '"')
                quoted = false;
            else
                sb.Append(c);
        } else {
            if (c == '"') {
                quoted = true;
            } else if (c == ',') {
                yield return sb.ToString();
                sb.Length = 0;
            } else {
                sb.Append(c);
            }
        }
    }

    if (quoted)
        throw new ArgumentException("csvString", "Unterminated quotation mark.");

    yield return sb.ToString();
}

It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.

可能需要一些调整才能完全遵循CSV规范,但基本逻辑是合理的。

#2


1  

This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.

对于CSV解析器来说,这是一个明确的案例,因此您应该使用.NET自己的CSV解析功能或cdhowie的解决方案。

Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():

纯粹是为了您的信息,而不是一个可行的解决方案,下面是使用正则表达式与Regex.Split()进行的扭曲:

You could use the regex (please don't!)

你可以使用正则表达式(请不要!)

(?<=^(?:[^"]*"[^"]*")*[^"]*)  # assert that there is an even number of quotes before...
\s*,\s*                       # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$)   # as well as after the comma.

if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.

如果您引用的字符串从不包含转义引号,并且您不介意引号本身成为匹配的一部分。

This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.

这非常低效,读取和调试很痛苦,只能在.NET中运行,并且在转义引号上失败(至少如果你没有使用“”来逃避单引号)。当然可以修改正则表达式来处理它,但那时它将是完全可怕的。

#3


0  

A little late maybe but I hope I can help someone else

可能有点晚,但我希望我可以帮助别人

     String[] cols = Regex.Split("max, emily, john", @"\s*,\s*");
     foreach ( String s in cols ) {
        Console.WriteLine(s);
     }

#4


0  

Justin, resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc.

贾斯汀,复活这个问题,因为它有一个简单的正则表达式解决方案,没有提到。除了在情况s1,s2,s3等情况下,这种情况直接来自匹配(或替换)模式。

Here's our simple regex:

这是我们简单的正则表达式:

"[^"]*"|(,)

The left side of the alternation matches complete "quoted strings" tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left. We replace these commas with SplitHere, then we split on SplitHere.

交替的左侧匹配完整的“引用字符串”标签。我们将忽略这些匹配。右侧匹配并捕获第1组的逗号,我们知道它们是正确的逗号,因为它们与左侧的表达式不匹配。我们用SplitHere替换这些逗号,然后我们拆分SplitHere。

This program shows how to use the regex (see the results at the bottom of the online demo):

该程序显示了如何使用正则表达式(请参阅在线演示底部的结果):

using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main()  {
string s1 = @"max,""emily,kate"",john";
var myRegex = new Regex(@"""[^""]*""|(,)");
string replaced = myRegex.Replace(s1, delegate(Match m) {
    if (m.Groups[1].Value == "") return m.Value;
    else return "SplitHere";
    });
string[] splits = Regex.Split(replaced,"SplitHere");
foreach (string split in splits) Console.WriteLine(split);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program

Reference

参考

How to match (or replace) a pattern except in situations s1, s2, s3...

如何匹配(或替换)模式除了情况s1,s2,s3 ......

#1


14  

Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.

像这样的情况通常需要除正则表达式之外的其他东西。它们很漂亮,但处理这种事情的模式比它们有用的更复杂。

You might try something like this instead:

您可以尝试这样的事情:

public static IEnumerable<string> SplitCSV(string csvString)
{
    var sb = new StringBuilder();
    bool quoted = false;

    foreach (char c in csvString) {
        if (quoted) {
            if (c == '"')
                quoted = false;
            else
                sb.Append(c);
        } else {
            if (c == '"') {
                quoted = true;
            } else if (c == ',') {
                yield return sb.ToString();
                sb.Length = 0;
            } else {
                sb.Append(c);
            }
        }
    }

    if (quoted)
        throw new ArgumentException("csvString", "Unterminated quotation mark.");

    yield return sb.ToString();
}

It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.

可能需要一些调整才能完全遵循CSV规范,但基本逻辑是合理的。

#2


1  

This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.

对于CSV解析器来说,这是一个明确的案例,因此您应该使用.NET自己的CSV解析功能或cdhowie的解决方案。

Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():

纯粹是为了您的信息,而不是一个可行的解决方案,下面是使用正则表达式与Regex.Split()进行的扭曲:

You could use the regex (please don't!)

你可以使用正则表达式(请不要!)

(?<=^(?:[^"]*"[^"]*")*[^"]*)  # assert that there is an even number of quotes before...
\s*,\s*                       # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$)   # as well as after the comma.

if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.

如果您引用的字符串从不包含转义引号,并且您不介意引号本身成为匹配的一部分。

This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.

这非常低效,读取和调试很痛苦,只能在.NET中运行,并且在转义引号上失败(至少如果你没有使用“”来逃避单引号)。当然可以修改正则表达式来处理它,但那时它将是完全可怕的。

#3


0  

A little late maybe but I hope I can help someone else

可能有点晚,但我希望我可以帮助别人

     String[] cols = Regex.Split("max, emily, john", @"\s*,\s*");
     foreach ( String s in cols ) {
        Console.WriteLine(s);
     }

#4


0  

Justin, resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc.

贾斯汀,复活这个问题,因为它有一个简单的正则表达式解决方案,没有提到。除了在情况s1,s2,s3等情况下,这种情况直接来自匹配(或替换)模式。

Here's our simple regex:

这是我们简单的正则表达式:

"[^"]*"|(,)

The left side of the alternation matches complete "quoted strings" tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left. We replace these commas with SplitHere, then we split on SplitHere.

交替的左侧匹配完整的“引用字符串”标签。我们将忽略这些匹配。右侧匹配并捕获第1组的逗号,我们知道它们是正确的逗号,因为它们与左侧的表达式不匹配。我们用SplitHere替换这些逗号,然后我们拆分SplitHere。

This program shows how to use the regex (see the results at the bottom of the online demo):

该程序显示了如何使用正则表达式(请参阅在线演示底部的结果):

using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main()  {
string s1 = @"max,""emily,kate"",john";
var myRegex = new Regex(@"""[^""]*""|(,)");
string replaced = myRegex.Replace(s1, delegate(Match m) {
    if (m.Groups[1].Value == "") return m.Value;
    else return "SplitHere";
    });
string[] splits = Regex.Split(replaced,"SplitHere");
foreach (string split in splits) Console.WriteLine(split);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program

Reference

参考

How to match (or replace) a pattern except in situations s1, s2, s3...

如何匹配(或替换)模式除了情况s1,s2,s3 ......