I'm having some issues with parsing CSV data with quotes. My main problem is with quotes within a field. In the following example lines 1 - 4 work correctly but 5,6 and 7 don't.
我在使用引号解析CSV数据时遇到了一些问题。我的主要问题是字段中的引号。在以下示例中,第1-4行正常工作,但5,6和7不能正常工作。
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
I'd like to avoid Text::CSV as it isn't installed on the target server. Realising that CSV's are are more complicated than they look I'm using a recipe from the Perl Cookbook.
我想避免使用Text :: CSV,因为它没有安装在目标服务器上。意识到CSV比他们看起来更复杂我正在使用Perl Cookbook中的食谱。
sub parse_csv {
my $text = shift; #record containg CSVs
my @columns = ();
push(@columns ,$+) while $text =~ m{
# The first part groups the phrase inside quotes
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(@columns ,undef) if substr($text, -1,1) eq ',';
return @columns ; # list of vars that was comma separated.
}
Does anyone have a suggestion for improving the regex to handle the above cases?
有没有人建议改进正则表达式来处理上述情况?
7 个解决方案
#1
35
Please, Try Using CPAN
There's no reason you couldn't download a copy of Text::CSV, or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.
您无法下载Text :: CSV的副本或CSV解析器的任何其他非基于XS的实现,并将其安装在您的本地目录或项目的lib / sub目录中,以便安装在与您的项目推出。
If you can't store text files in your project, then I'm wondering how it is you are coding your project.
如果您无法在项目中存储文本文件,那么我想知道您是如何编写项目的。
http://novosial.org/perl/life-with-cpan/non-root/
http://novosial.org/perl/life-with-cpan/non-root/
Should be a good guide on how to get these into a working state locally.
应该是如何让这些在当地进入工作状态的良好指南。
Not using CPAN is really a recipe for disaster.
Please consider this before trying to write your own CSV implementation.
在尝试编写自己的CSV实现之前,请考虑这一点。
Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.
Text :: CSV超过一百行代码,包括修复的bug和边缘情况,从头开始重写这些只会让你了解CSV是多么糟糕。
note: I learnt this the hard way. Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version. It really is something awful.
注意:我很难学到这一点。花了一整天的时间才能在PHP中找到一个有效的CSV解析器,然后才发现在以后的版本中添加了一个内置的解析器。这真的很可怕。
#2
19
You can parse CSV using Text::ParseWords which ships with Perl.
您可以使用Perl附带的Text :: ParseWords解析CSV。
use Text::ParseWords;
while (<DATA>) {
chomp;
my @f = quotewords ',', 0, $_;
say join ":" => @f;
}
__DATA__
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
which parses your CSV correctly....
正确解析你的CSV ....
# => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
# => S:BELT,FAN:003541547:
# => S:BELT V,FAN:000324244:
# => S:SHROUD SPRING SCREW:000868265:
# => S:D REL VALVE ASSY:000771881:
# => S:YBELT,V:000323030:
# => S:YBELT,'V':000322933:
The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly. However this is badly built CSV data and would cause problems with most CSV parsers ;-)
我对Text :: ParseWords唯一的问题是数据中的嵌套引号没有正确转义。然而,这是错误构建的CSV数据,并会导致大多数CSV解析器出现问题;-)
So you may notice that
你可能会注意到这一点
# S,"YBELT,"V"",000323030,
came out as (ie. quotes dropped around "V")
出来了(即报价在“V”附近下降)
# S:YBELT,V:000323030:
however if its escaped like so
但是,如果它像这样逃脱
# S,"YBELT,\"V\"",000323030,
then quotes will be retained
那么报价将被保留
# S:YBELT,"V":000323030:
#3
2
This works like charm
这就像魅力一样
line is assumed to be comma separated with embeded ,
假设行以逗号分隔嵌入,
my @columns = Text::ParseWords::parse_line(',', 0, $line);
my @columns = Text :: ParseWords :: parse_line(',',0,$ line);
#4
1
tested; working:-
$_.=','; # fake an ending delimiter
while($_=~/"((?:""|[^"])*)",|([^,]*),/g) {
$cell=defined($1) ? $1:$2; $cell=~s/""/"/g;
print "$cell\n";
}
# The regexp strategy is as follows:
# First - we attempt a match on any quoted part starting the CSV line:-
# "((?:""|[^"])*)",
# It must start with a quote, and end with a quote followed by a comma, and is allowed to contain either doublequotes - "" - or anything except a sinlge quote [^"] - this goes into $1
# If we can't match that, we accept anything up to the next comma instead, & put it into $2
# Lastly, we convert "" to " and print out the cell.
be warned that CSV files can contain cells with embedded newlines inside the quotes, so you'll need to do this if reading the data in line-at-a-time:
请注意,CSV文件可以包含引号内嵌有换行符的单元格,因此如果一次读取数据,则需要执行此操作:
if("$pre$_"=~/,"[^,]*\z/) {
$pre.=$_; next;
}
$_="$pre$_";
#5
0
Finding matching pairs using regexs is non-trivial and generally unsolvable task. There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. I don't have it at hand now, but I remember that he used CSV for some examples, too.
使用正则表达式查找匹配对是非平凡且通常无法解决的任务。 Jeffrey Friedl的Mastering正则表达式书中有很多例子。我现在还没有它,但我记得他也使用了CSV作为例子。
#6
0
You can (try to) use CPAN.pm to simply have your program install/update Text::CSV. As said before, you can even "install" it to a home or local directory, and add that directory to @INC (or, if you prefer not to use BEGIN
blocks, you can use lib 'dir';
- it's probably better).
您可以(尝试)使用CPAN.pm简单地让您的程序安装/更新Text :: CSV。如前所述,您甚至可以将其“安装”到家庭或本地目录,并将该目录添加到@INC(或者,如果您不想使用BEGIN块,则可以使用lib'dir'; - 它可能更好) 。
#7
0
Tested:
测试:
use Test::More tests => 2;
use strict;
sub splitCommaNotQuote {
my ( $line ) = @_;
my @fields = ();
while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
if ( $2 ) {
push( @fields, $3 );
} else {
push( @fields, $1 );
}
last if ( ! $4 );
}
return( @fields );
}
is_deeply(
+[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
+['S', '"D" REL VALVE ASSY', '000771881', ''],
"Quote in value"
);
is_deeply(
+[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
+['S', 'BELT V,FAN', '000324244', ''],
"Strip quotes from entire value"
);
#1
35
Please, Try Using CPAN
There's no reason you couldn't download a copy of Text::CSV, or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.
您无法下载Text :: CSV的副本或CSV解析器的任何其他非基于XS的实现,并将其安装在您的本地目录或项目的lib / sub目录中,以便安装在与您的项目推出。
If you can't store text files in your project, then I'm wondering how it is you are coding your project.
如果您无法在项目中存储文本文件,那么我想知道您是如何编写项目的。
http://novosial.org/perl/life-with-cpan/non-root/
http://novosial.org/perl/life-with-cpan/non-root/
Should be a good guide on how to get these into a working state locally.
应该是如何让这些在当地进入工作状态的良好指南。
Not using CPAN is really a recipe for disaster.
Please consider this before trying to write your own CSV implementation.
在尝试编写自己的CSV实现之前,请考虑这一点。
Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.
Text :: CSV超过一百行代码,包括修复的bug和边缘情况,从头开始重写这些只会让你了解CSV是多么糟糕。
note: I learnt this the hard way. Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version. It really is something awful.
注意:我很难学到这一点。花了一整天的时间才能在PHP中找到一个有效的CSV解析器,然后才发现在以后的版本中添加了一个内置的解析器。这真的很可怕。
#2
19
You can parse CSV using Text::ParseWords which ships with Perl.
您可以使用Perl附带的Text :: ParseWords解析CSV。
use Text::ParseWords;
while (<DATA>) {
chomp;
my @f = quotewords ',', 0, $_;
say join ":" => @f;
}
__DATA__
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
which parses your CSV correctly....
正确解析你的CSV ....
# => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
# => S:BELT,FAN:003541547:
# => S:BELT V,FAN:000324244:
# => S:SHROUD SPRING SCREW:000868265:
# => S:D REL VALVE ASSY:000771881:
# => S:YBELT,V:000323030:
# => S:YBELT,'V':000322933:
The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly. However this is badly built CSV data and would cause problems with most CSV parsers ;-)
我对Text :: ParseWords唯一的问题是数据中的嵌套引号没有正确转义。然而,这是错误构建的CSV数据,并会导致大多数CSV解析器出现问题;-)
So you may notice that
你可能会注意到这一点
# S,"YBELT,"V"",000323030,
came out as (ie. quotes dropped around "V")
出来了(即报价在“V”附近下降)
# S:YBELT,V:000323030:
however if its escaped like so
但是,如果它像这样逃脱
# S,"YBELT,\"V\"",000323030,
then quotes will be retained
那么报价将被保留
# S:YBELT,"V":000323030:
#3
2
This works like charm
这就像魅力一样
line is assumed to be comma separated with embeded ,
假设行以逗号分隔嵌入,
my @columns = Text::ParseWords::parse_line(',', 0, $line);
my @columns = Text :: ParseWords :: parse_line(',',0,$ line);
#4
1
tested; working:-
$_.=','; # fake an ending delimiter
while($_=~/"((?:""|[^"])*)",|([^,]*),/g) {
$cell=defined($1) ? $1:$2; $cell=~s/""/"/g;
print "$cell\n";
}
# The regexp strategy is as follows:
# First - we attempt a match on any quoted part starting the CSV line:-
# "((?:""|[^"])*)",
# It must start with a quote, and end with a quote followed by a comma, and is allowed to contain either doublequotes - "" - or anything except a sinlge quote [^"] - this goes into $1
# If we can't match that, we accept anything up to the next comma instead, & put it into $2
# Lastly, we convert "" to " and print out the cell.
be warned that CSV files can contain cells with embedded newlines inside the quotes, so you'll need to do this if reading the data in line-at-a-time:
请注意,CSV文件可以包含引号内嵌有换行符的单元格,因此如果一次读取数据,则需要执行此操作:
if("$pre$_"=~/,"[^,]*\z/) {
$pre.=$_; next;
}
$_="$pre$_";
#5
0
Finding matching pairs using regexs is non-trivial and generally unsolvable task. There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. I don't have it at hand now, but I remember that he used CSV for some examples, too.
使用正则表达式查找匹配对是非平凡且通常无法解决的任务。 Jeffrey Friedl的Mastering正则表达式书中有很多例子。我现在还没有它,但我记得他也使用了CSV作为例子。
#6
0
You can (try to) use CPAN.pm to simply have your program install/update Text::CSV. As said before, you can even "install" it to a home or local directory, and add that directory to @INC (or, if you prefer not to use BEGIN
blocks, you can use lib 'dir';
- it's probably better).
您可以(尝试)使用CPAN.pm简单地让您的程序安装/更新Text :: CSV。如前所述,您甚至可以将其“安装”到家庭或本地目录,并将该目录添加到@INC(或者,如果您不想使用BEGIN块,则可以使用lib'dir'; - 它可能更好) 。
#7
0
Tested:
测试:
use Test::More tests => 2;
use strict;
sub splitCommaNotQuote {
my ( $line ) = @_;
my @fields = ();
while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
if ( $2 ) {
push( @fields, $3 );
} else {
push( @fields, $1 );
}
last if ( ! $4 );
}
return( @fields );
}
is_deeply(
+[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
+['S', '"D" REL VALVE ASSY', '000771881', ''],
"Quote in value"
);
is_deeply(
+[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
+['S', 'BELT V,FAN', '000324244', ''],
"Strip quotes from entire value"
);