在bash中解析类似.csv的文件

I have a file formatted as follows:

我有一个格式如下的文件:

string1,string2,string3,...
...

I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:

我必须分析第二列,计算每个字符串的出现次数,并生成格式如下的文件:

"number of occurrences of x",x
"number of occurrences of y",y        
...

I managed to write the following script, that works fine:

我设法编写以下脚本,工作正常:

#!/bin/bash

> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
    if [[ "$line" =~ $regExp ]]
    then
        printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
    fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"

My question is: There is a better and simpler way to do the job?

我的问题是:有一种更好,更简单的方法来完成这项工作吗?

In particular I don't know how to fix that:

特别是我不知道如何解决这个问题:

gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'

The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string. Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.

问题是string2可以包含空格,如果是这样,第二次调用gawk会截断字符串。我不知道如何打印所有字段“从2到NF”,保持分隔符,这可以连续发生几次。

Thank very much, Goodbye

非常感谢,再见

EDIT:

As asked, here there is some sample data:

如上所述,这里有一些示例数据:

(It is an exercise, sorry for the inventive)

(这是一项练习,对于创造性而言遗憾)

Input:

*,*,*
test,  test  ,test
prova, * , prova
test,test,test
prova,  prova   ,prova
leonardo,da vinci,leonardo
in,o    u   t   ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o    u   t   ,pr
test,  test  ,test
,   tabs    ,
,   tabs    ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
,   tabs    ,

Output:

3, * 
4,*
4,da vinci
2,o u   t   
3,po
1,  prova   
3, spaces 
3,  tabs    
1,test
2,  test

3 个解决方案

#1

A one-liner in awk:

awk中的单行:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv

It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.

它将每个第二列字符串的计数存储在关联数组x中,最后通过数组循环并打印结果。

To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:

要获得您为此示例显示的确切输出,您需要将其传递给sort(1),将字段分隔符设置为,以及将排序键设置为第二个字段:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2

The only condition, of course, is that the 2nd column of each line doesn't contain a ,

当然,唯一的条件是每行的第2列不包含a,

#2

You can make your final awk:

你可以做出最后的awk:

gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'

or use sed for this sort of thing:

或者使用sed来做这类事情:

sed 's/ *\([0-9]*\) /\1,/'

#3

Here is a Perl one-liner, similar to Filipe's awk solution:

这是一个Perl单线程,类似于Filipe的awk解决方案:

perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv

The output is sorted alphabetically according to the second column.
The @F autosplit array starts at index $F[0] while awk fields start with $1

输出根据第二列按字母顺序排序。 @F autosplit数组从索引$ F [0]开始,而awk字段以$ 1开头

#1