I have a file formatted as follows:
我有一个格式如下的文件:
string1,string2,string3,...
...
I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:
我必须分析第二列,计算每个字符串的出现次数,并生成格式如下的文件:
"number of occurrences of x",x
"number of occurrences of y",y
...
I managed to write the following script, that works fine:
我设法编写以下脚本,工作正常:
#!/bin/bash
> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
if [[ "$line" =~ $regExp ]]
then
printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"
My question is: There is a better and simpler way to do the job?
我的问题是:有一种更好,更简单的方法来完成这项工作吗?
In particular I don't know how to fix that:
特别是我不知道如何解决这个问题:
gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'
The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string. Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.
问题是string2可以包含空格,如果是这样,第二次调用gawk会截断字符串。我不知道如何打印所有字段“从2到NF”,保持分隔符,这可以连续发生几次。
Thank very much, Goodbye
非常感谢,再见
EDIT:
As asked, here there is some sample data:
如上所述,这里有一些示例数据:
(It is an exercise, sorry for the inventive)
(这是一项练习,对于创造性而言遗憾)
Input:
*,*,*
test, test ,test
prova, * , prova
test,test,test
prova, prova ,prova
leonardo,da vinci,leonardo
in,o u t ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o u t ,pr
test, test ,test
, tabs ,
, tabs ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
, tabs ,
Output:
3, *
4,*
4,da vinci
2,o u t
3,po
1, prova
3, spaces
3, tabs
1,test
2, test
3 个解决方案
#1
5
A one-liner in awk:
awk中的单行:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv
It stores the count for each 2nd column string in the associative array x
, and in the end loops through the array and prints the results.
它将每个第二列字符串的计数存储在关联数组x中,最后通过数组循环并打印结果。
To get the exact output you showed for this example, you need to pipe it to sort(1)
, setting the field delimiter to ,
and the sort key to the 2nd field:
要获得您为此示例显示的确切输出,您需要将其传递给sort(1),将字段分隔符设置为,以及将排序键设置为第二个字段:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2
The only condition, of course, is that the 2nd column of each line doesn't contain a ,
当然,唯一的条件是每行的第2列不包含a,
#2
1
You can make your final awk:
你可以做出最后的awk:
gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'
or use sed for this sort of thing:
或者使用sed来做这类事情:
sed 's/ *\([0-9]*\) /\1,/'
#3
0
Here is a Perl one-liner, similar to Filipe's awk solution:
这是一个Perl单线程,类似于Filipe的awk解决方案:
perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv
The output is sorted alphabetically according to the second column.
The @F
autosplit array starts at index $F[0]
while awk fields start with $1
输出根据第二列按字母顺序排序。 @F autosplit数组从索引$ F [0]开始,而awk字段以$ 1开头
#1
5
A one-liner in awk:
awk中的单行:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv
It stores the count for each 2nd column string in the associative array x
, and in the end loops through the array and prints the results.
它将每个第二列字符串的计数存储在关联数组x中,最后通过数组循环并打印结果。
To get the exact output you showed for this example, you need to pipe it to sort(1)
, setting the field delimiter to ,
and the sort key to the 2nd field:
要获得您为此示例显示的确切输出,您需要将其传递给sort(1),将字段分隔符设置为,以及将排序键设置为第二个字段:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2
The only condition, of course, is that the 2nd column of each line doesn't contain a ,
当然,唯一的条件是每行的第2列不包含a,
#2
1
You can make your final awk:
你可以做出最后的awk:
gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'
or use sed for this sort of thing:
或者使用sed来做这类事情:
sed 's/ *\([0-9]*\) /\1,/'
#3
0
Here is a Perl one-liner, similar to Filipe's awk solution:
这是一个Perl单线程,类似于Filipe的awk解决方案:
perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv
The output is sorted alphabetically according to the second column.
The @F
autosplit array starts at index $F[0]
while awk fields start with $1
输出根据第二列按字母顺序排序。 @F autosplit数组从索引$ F [0]开始,而awk字段以$ 1开头