尝试解析格式不正确的CSV

So, I'm parsing thousands of lines from a publicly available government CSV file. The problem is that they've included commas inside values with double quotes, making it really difficult to consistently parse. The number of matches should be 251. I've tried negating the double quotes, but that also doesn't seem to work.

因此，我正在解析一个公共可用的*CSV文件中的数千行。问题是，它们在值中包含了逗号和双引号，这使得很难始终进行解析。匹配的数量应该是251。我试过否定双引号，但似乎也行不通。

Example:

例子:

GS08P12VJP0107,0,0,,,,0,5300.00,5300.00,5300.00,2012-09-21,2012-09-21 00:00:00,2012-11-01 00:00:00,2012-11-01 00:00:00,,047,GENERAL SERVICES ADMINISTRATION (GSA),4740,PUBLIC BUILDINGS SERVICE,VJ000,"GSA/PBS/MTN PLAINS SVS CTR, NORTH DAKOTA FIELD OFFICE",047,GENERAL SERVICES ADMINISTRATION (GSA),4740,PUBLIC BUILDINGS SERVICE,VJ000,"GSA/PBS/MTN PLAINS SVS CTR, NORTH DAKOTA FIELD OFFICE",,,043570956,MIKE AUSTFJORD & SONS INC,,MIKE AUSTFJORD & SONS INC,043570956,UNITED STATES,,9469 138TH AVE NE,,,CAVALIER,ND,,582209505,ND00,7012654255,7012653110,USA,UNITED STATES,PEMBINA,PEMBINA,ND,NORTH DAKOTA,582719745,00,,B,PO,,,,,,NAN,J,FIRM FIXED PRICE,"EXCAVATE WETLANDS AS REMEDIATION AT US BORDER STATION, 10980 I-29, PEMBINA, NORTH DAKOTA.",,,,1,Z2AA,REPAIR OR ALTERATION OF OFFICE BUILDINGS,D,NOT A BUNDLED REQUIREMENT,,,238910,SITE PREPARATION CONTRACTORS,A,FAR 52.223-4 INCLUDED,A,U.S. OWNED BUSINESS,,,,,B,JUSTIFICATION - TIME,USA,,C,NOT A MANUFACTURED END PRODUCT,B,PLAN NOT REQUIRED,F,COMPETED UNDER SAP,SP1,SIMPLIFIED ACQUISITION,SBA,SMALL BUSINESS SET ASIDE - TOTAL,NONE,NO PREFERENCE USED,,NAN,,NAN,,,1,D,,f,N,NO,NO,,X,NOT APPLICABLE,N,,,N: NO,,X,NOT APPLICABLE,X,NOT APPLICABLE,Y,YES,X,NOT APPLICABLE,,,,,,,,NONE,NONE,,,,NAN,N,TRANSACTION DOES NOT USE GFE/GFP,,,X,NO,N,NO,N,NO - SERVICE WHERE PBA IS NOT USED.,,,,,N,NO,X,NOT APPLICABLE,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,SMALL BUSINESS,S,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,f,f,f,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,2012-09-21 00:00:00

Could somebody please assist? I'm doing this through Java Pattern/Matcher..

有人能帮助吗?我是通过Java模式/Matcher来实现这一点的。

1 个解决方案

#1

There are some different pattern groups to consider. Breaking your example down into various cases, I came up with the following regex

有一些不同的模式组需要考虑。将您的示例分解为不同的情况，我提出了以下regex

(\"(.*?)\")|(.*?(,))|(.*)

The first capture group, (\"(.*?)\") deals with values in quotes.

第一个捕获组(\“(.*?)\”)处理引号中的值。

The second, (.*?(,)) deals with the other cases (no quotes).

第二，(.*?()处理其他情况(无引号)。

The last, (.*) is for the final part of the csv, with no ending comma.

最后，(.*)是csv的最后一部分，没有逗号。

EDIT

编辑

This post got more comments than I would have ever expected.

这篇文章得到的评论比我预想的要多。

Of course the above solution has room for improvement, such as it not considering double quotes and it including the trailing comma in the value. The user mentioned they were trying to solve a problem with pattern / matcher, so with a regex that fits their use case, something like this

当然，上面的解决方案还有改进的空间，比如它不考虑双引号，并且在值中包含尾逗号。用户提到他们正在尝试用模式/ matcher解决一个问题，所以用一个适合他们用例的regex，类似这样

Pattern p = Pattern.compile(someRegex);
String line = ... // get line from somewhere
Matcher m = p.matcher(line);

while (m.find()) {
    // do stuff
}

may be sufficient.

可能是足够的。

One user suggested Apache Commons CSV which can be found at https://mvnrepository.com/artifact/org.apache.commons/commons-csv/1.5 (latest version at the time of writing).

一个用户建议使用Apache Commons CSV，可以在https://mvnrepository.com/artifact/org.apache.commons/commons/commons-csv/1.5中找到(撰写本文时最新版本)。

for (CSVRecord record : CSVFormat.DEFAULT.parse(new FileReader(source))) {
    Iterator<String> it = record.iterator();
    while (it.hasNext()) {
        String colVal = it.next();
        // do stuff
    }
}

See the documentation at https://commons.apache.org/proper/commons-csv/user-guide.html for practical use cases.

请参阅https://commons.apache.org/proper/commons-csv/user-guide.html中的文档，了解实际的用例。

#1