正则表达式,csv类型布局,允许内部引用字符串?

时间:2022-12-06 20:29:14

I need a regex that will parse a csv-style file, something like 57 fields wide, most fields enclosed in quotes (but maybe not all), separated by commas, with quoted fields having the possibility of embedded doubles ("") that represent single quotes in the evaluated string.

我需要一个解析csv样式文件的正则表达式,类似于57个字段宽,大多数字段用引号括起来(但可能不是全部),用逗号分隔,带引号的字段有可能代表嵌入式双精度(“”)已计算字符串中的单引号。

I'm a regex beginner/intermediate, and I think I can get pretty quickly to the basic expression to do the field parsing, but it's the embedded double-quotes (and commas) I can't get my head around.

我是一个正则表达式初学者/中级,我认为我可以很快地得到基本表达式来进行字段解析,但它是嵌入式双引号(和逗号)我无法理解。

Anyone? (Not that it matters but specific language is Matlab.)

任何人? (这不重要,但具体的语言是Matlab。)

7 个解决方案

#1


If you really have to do it with a regex, I would do it in two passes; firstly separate the fields by splitting on the commas with something such as:

如果你真的必须使用正则表达式,我会在两个通道中做到这一点;首先通过用以下内容分割逗号来分隔字段:

regexp(theString, '(?<!\\),', 'split');

This should split on commas, only when there isn't a preceding slash (I'm assuming this is what you mean by escaped commas). (I think in matlab you'll end up with an array of indexes into the original strings)

这应该在逗号上分开,只有当没有前面的斜杠时(我假设这是你所说的转义逗号)。 (我认为在matlab中你最终会得到一个原始字符串的索引数组)

Then you should check each matched field for escaped quotes, and replace them with something like:

然后,您应检查每个匹配的字段以获取转义引号,并将其替换为:

regexprep(individualString, '""', '"');

Similarly for commas:

同样的逗号:

regexprep(individualString, '\\,', ',');

I'm not sure about the doubly escaped \'s in matlab having not had much experience with it.

我不确定matlab中没有多少经验的双重逃脱。

As others have said, it's probably better to use a csv library for handling the initial file.

正如其他人所说,使用csv库处理初始文件可能更好。

#2


I know there i great hype around regular expressions nowadays, but I would really recommend using a library for tasks that have already been implemented by others - it will be easier to implement, easier to read and easier to maintain (want to read csvs separated by quotes next time? The library can possibly do it, but your regex will need a rewrite). A quick google search should give you a good start.

我知道现在我对正则表达式进行了大量宣传,但我真的建议将库用于已经由其他人实现的任务 - 它将更容易实现,更易于阅读和更易于维护(想要读取由csvs分隔的下次引用?库可能会这样做,但你的正则表达式需要重写)。快速谷歌搜索应该给你一个良好的开端。

#3


escape the quotes - ? makes it optional.

逃避报价 - ?使它成为可选的。

\"?

#4


It took me a while to work this out, since many of the regexp's on the net don't handle one part or another. Here is code in F#/.NET. Sorry, but I don't speak matlab:

我花了一段时间来解决这个问题,因为网上的许多正则表达式都没有处理这部分或另一部分。这是F#/ .NET中的代码。对不起,但我不会说matlab:

let splitCsv (s:string) =
    let re = new Regex("\\s*((?:\"(?:(?:\"\")|[^\"])*\")|[^\"]*?)\\s*(?:,|$)")

    re.Matches( s + " ")
    |> Seq.cast<Match>
    |> Seq.map (fun m -> m.Groups.[1].Value)
    |> Seq.map (fun s -> s.Replace( "\"\"", "\"" ))
    |> Seq.map (fun s -> s.Trim( [| '"'; ' ' |] ))
    |> List.of_seq

This version handles quoted strings, quotes escaped as double-quotes, and trims extra (escaped) quotes and spaces around the whole string (original: "Test", double-quoted: """Test"""). It also properly handles an empty field in the last position (hence the s + " ") and it also properly handles commas inside quoted strings.

此版本处理引用的字符串,引号转义为双引号,并修剪整个字符串周围的额外(转义)引号和空格(原始:“Test”,双引号:“”“Test”“”)。它还可以正确处理最后一个位置的空字段(因此是s +“”),并且它还可以正确处理引用字符串中的逗号。

#5


Thanks for replies. Classic case of beginner thinking the problem is easy, experts knowing the problem is hard.

谢谢你的回复。经典案例初学者认为问题很容易,专家知道问题很难。

After reading your posts, I browsed for a canned csv parser library in Matlab; found a couple, neither of which could get the job done (first tried to do whole file at once, failed on memory; second failed to my specific bugaboo, doubled quotes in a quoted string).

阅读完帖子后,我在Matlab中浏览了一个罐装csv解析器库;发现了一对,其中任何一个都无法完成工作(首先尝试一次完成整个文件,内存失败;第二个未能通过我的特定bugaboo,在引用的字符串中加倍引号)。

So we rolled our own, with the help of a regex I found on the web and modified. Remains to be moved to Matlab but Python code is as follows:

所以我们在网上找到并修改了一个正则表达式的帮助我们自己推出了。仍然要转移到Matlab,但Python代码如下:

import re

text = ["<omitted>"]

# Regex: empty before comma OR string w/ no quote or comma OR quote-surrounded string w/ optional doubles
p = re.compile('(?=,)|[^",]+|"(?:[^"]|"")*"')

for line in text:
    print 'Line: %s' % line
    m = p.search(line)                                  
    fld = 1
    while m:                                            
        val = m.group().strip('"').replace('""', '"')   
        print 'Field %d: %s' % (fld, val)
        line = re.sub(p, '', line, 1)        
        if line and line[0] == ',':          
            line = line[1:]
        fld += 1
        m = p.search(line)                   
    print

#6


Page 271 of Friedl's Mastering Regular Expressions has a regular expression for extracting possibly quoted CSV fields, but it requires a bit of postprocessing:

Friedl的Mastering正则表达式的第271页有一个正则表达式,用于提取可能引用的CSV字段,但它需要一些后处理:

>>> re.findall('(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))', '"a,b,c",d,e,f')
[('a,b,c', ''), ('', 'd'), ('', 'e'), ('', 'f')]
>>> re.findall('(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))', '"a,b,c",d,,f')
[('a,b,c', ''), ('', 'd'), ('', ''), ('', 'f')]

Same pattern with the verbose flag:

与详细标志相同的模式:

csv = re.compile(r"""
    (?:^|,)
    (?: # now match either a double-quoted field
        # (inside, paired double quotes are allowed)...
        " # (double-quoted field's opening quote)
          (    (?: [^"] | "" )*    )
        " # (double-quoted field's closing quote)
    |
      # ...or some non-quote/non-comma text...
        ( [^",]* )
    )""", re.X)

#7


It's possible to do using a single regex with lookahead. Illustrated here in perl:

可以使用单个正则表达式进行前瞻。这里用perl说明:

my @rows;

foreach my $line (@lines) {

    my @cells;
    while ($line =~ /( ("|').+?\2 | [^,]+? ) (?=(,|$))/gx) {
        push @cells, $1;
    }

    push @rows, \@cells;
}

#1


If you really have to do it with a regex, I would do it in two passes; firstly separate the fields by splitting on the commas with something such as:

如果你真的必须使用正则表达式,我会在两个通道中做到这一点;首先通过用以下内容分割逗号来分隔字段:

regexp(theString, '(?<!\\),', 'split');

This should split on commas, only when there isn't a preceding slash (I'm assuming this is what you mean by escaped commas). (I think in matlab you'll end up with an array of indexes into the original strings)

这应该在逗号上分开,只有当没有前面的斜杠时(我假设这是你所说的转义逗号)。 (我认为在matlab中你最终会得到一个原始字符串的索引数组)

Then you should check each matched field for escaped quotes, and replace them with something like:

然后,您应检查每个匹配的字段以获取转义引号,并将其替换为:

regexprep(individualString, '""', '"');

Similarly for commas:

同样的逗号:

regexprep(individualString, '\\,', ',');

I'm not sure about the doubly escaped \'s in matlab having not had much experience with it.

我不确定matlab中没有多少经验的双重逃脱。

As others have said, it's probably better to use a csv library for handling the initial file.

正如其他人所说,使用csv库处理初始文件可能更好。

#2


I know there i great hype around regular expressions nowadays, but I would really recommend using a library for tasks that have already been implemented by others - it will be easier to implement, easier to read and easier to maintain (want to read csvs separated by quotes next time? The library can possibly do it, but your regex will need a rewrite). A quick google search should give you a good start.

我知道现在我对正则表达式进行了大量宣传,但我真的建议将库用于已经由其他人实现的任务 - 它将更容易实现,更易于阅读和更易于维护(想要读取由csvs分隔的下次引用?库可能会这样做,但你的正则表达式需要重写)。快速谷歌搜索应该给你一个良好的开端。

#3


escape the quotes - ? makes it optional.

逃避报价 - ?使它成为可选的。

\"?

#4


It took me a while to work this out, since many of the regexp's on the net don't handle one part or another. Here is code in F#/.NET. Sorry, but I don't speak matlab:

我花了一段时间来解决这个问题,因为网上的许多正则表达式都没有处理这部分或另一部分。这是F#/ .NET中的代码。对不起,但我不会说matlab:

let splitCsv (s:string) =
    let re = new Regex("\\s*((?:\"(?:(?:\"\")|[^\"])*\")|[^\"]*?)\\s*(?:,|$)")

    re.Matches( s + " ")
    |> Seq.cast<Match>
    |> Seq.map (fun m -> m.Groups.[1].Value)
    |> Seq.map (fun s -> s.Replace( "\"\"", "\"" ))
    |> Seq.map (fun s -> s.Trim( [| '"'; ' ' |] ))
    |> List.of_seq

This version handles quoted strings, quotes escaped as double-quotes, and trims extra (escaped) quotes and spaces around the whole string (original: "Test", double-quoted: """Test"""). It also properly handles an empty field in the last position (hence the s + " ") and it also properly handles commas inside quoted strings.

此版本处理引用的字符串,引号转义为双引号,并修剪整个字符串周围的额外(转义)引号和空格(原始:“Test”,双引号:“”“Test”“”)。它还可以正确处理最后一个位置的空字段(因此是s +“”),并且它还可以正确处理引用字符串中的逗号。

#5


Thanks for replies. Classic case of beginner thinking the problem is easy, experts knowing the problem is hard.

谢谢你的回复。经典案例初学者认为问题很容易,专家知道问题很难。

After reading your posts, I browsed for a canned csv parser library in Matlab; found a couple, neither of which could get the job done (first tried to do whole file at once, failed on memory; second failed to my specific bugaboo, doubled quotes in a quoted string).

阅读完帖子后,我在Matlab中浏览了一个罐装csv解析器库;发现了一对,其中任何一个都无法完成工作(首先尝试一次完成整个文件,内存失败;第二个未能通过我的特定bugaboo,在引用的字符串中加倍引号)。

So we rolled our own, with the help of a regex I found on the web and modified. Remains to be moved to Matlab but Python code is as follows:

所以我们在网上找到并修改了一个正则表达式的帮助我们自己推出了。仍然要转移到Matlab,但Python代码如下:

import re

text = ["<omitted>"]

# Regex: empty before comma OR string w/ no quote or comma OR quote-surrounded string w/ optional doubles
p = re.compile('(?=,)|[^",]+|"(?:[^"]|"")*"')

for line in text:
    print 'Line: %s' % line
    m = p.search(line)                                  
    fld = 1
    while m:                                            
        val = m.group().strip('"').replace('""', '"')   
        print 'Field %d: %s' % (fld, val)
        line = re.sub(p, '', line, 1)        
        if line and line[0] == ',':          
            line = line[1:]
        fld += 1
        m = p.search(line)                   
    print

#6


Page 271 of Friedl's Mastering Regular Expressions has a regular expression for extracting possibly quoted CSV fields, but it requires a bit of postprocessing:

Friedl的Mastering正则表达式的第271页有一个正则表达式,用于提取可能引用的CSV字段,但它需要一些后处理:

>>> re.findall('(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))', '"a,b,c",d,e,f')
[('a,b,c', ''), ('', 'd'), ('', 'e'), ('', 'f')]
>>> re.findall('(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))', '"a,b,c",d,,f')
[('a,b,c', ''), ('', 'd'), ('', ''), ('', 'f')]

Same pattern with the verbose flag:

与详细标志相同的模式:

csv = re.compile(r"""
    (?:^|,)
    (?: # now match either a double-quoted field
        # (inside, paired double quotes are allowed)...
        " # (double-quoted field's opening quote)
          (    (?: [^"] | "" )*    )
        " # (double-quoted field's closing quote)
    |
      # ...or some non-quote/non-comma text...
        ( [^",]* )
    )""", re.X)

#7


It's possible to do using a single regex with lookahead. Illustrated here in perl:

可以使用单个正则表达式进行前瞻。这里用perl说明:

my @rows;

foreach my $line (@lines) {

    my @cells;
    while ($line =~ /( ("|').+?\2 | [^,]+? ) (?=(,|$))/gx) {
        push @cells, $1;
    }

    push @rows, \@cells;
}