I have big data with billion of rows on unix txt file. I need to provide a specification for my C++ developers and a summary example.
我在unix txt文件中拥有数十亿行的大数据。我需要为我的C ++开发人员提供一个规范和一个摘要示例。
I need to summarize the following format:
我需要总结以下格式:
原始数据表示例
Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12 Column13 Column14 Column15 Column16 Column17 Column18
Ax1,Ay1 Bx1,By1 C1 D1 E1 F1 G1 H1 Ix1,Iy1 Jx1,Jy1 K1 L1 M1 N1 O1 P1 Ix1,Iy1 Jx1,Jy1
Ax2,Ay2 Bx2,By2 C1 D1 E1 F1 G1 H1 Ix1,Iy1 Jx1,Jy1 K1 L1 M1 N1 O1 P1 Ix1,Iy1 Jx1,Jy1
Ax2,Ay3 Bx2,By3 C3 D3 E3 F3 G3 H1 Ix1,Iy1 Jx1,Jy1 K3 L3 M3 N3 O3 P3 Qx3,Qy3 Rx3,Ry3
Ax4,Ay4 Bx4,By4 C4 D4 E4 F4 G4 H4 Ix4,Iy4 Jx1,Jy4 K4 L4 M4 N4 O4 P4 Qx4,Qy4 Rx4,Ry4
Ax5,Ay5 Bx5,By5 C5 D5 E5 F5 G5 H5 Ix5,Iy5 Jx1,Jy5 K5 L5 M5 N5 O5 H5 Ix5,Iy5 Jx1,Jy5
Ax6,Ay6 Bx6,By6 C2 D2 E3 F3 G3 H3 Ix3,Iy3 Jx1,Jy3 K2 L2 M3 N3 O3 P3 Ix3,Iy3 Jx1,Jy3
Ax7,Ay7 Bx7,By7 C7 D7 E3 F3 G3 H3 Ix3,Iy3 Jx1,Jy3 K7 L7 M3 N3 O3 P3 Ix3,Iy3 Jx1,Jy3
Ax8,Ay8 Bx8,By8 C8 D8 E8 F8 G8 H3 Ix3,Iy3 Jx1,Jy3 K8 L8 M8 N8 O8 P3 Ix3,Iy3 Jx1,Jy3
Ax9,Ay9 Bx9,By9 C9 D9 E9 F9 G9 H9 Ix9,Iy9 Jx1,Jy9 K9 L9 M9 N9 O9 P9 Qx9,Qy9 Rx9,Ry9
Ax10,Ay10 Bx10,By10 C10 D10 E10 F10 G10 H10 Ix10,Iy10 Jx1,Jy10 K10 L10 M10 N10 O10 P10 Qx10,Qy10 Rx10,Ry10
I would like to summarize the number of counts of column 8 by 9, 10, 16,17 and 18.
我想用9,10,16,17和18总结第8栏的计数。
And obtain format this.
并获得格式。
Count Column8 Column9 Column10 Column16 Column17 Column18
2 H1 Ix1,Iy1 Jx1,Jy1 P1 Ix1,Iy1 Jx1,Jy1
1 H1 Ix1,Iy1 Jx1,Jy1 P3 Qx3,Qy3 Rx3,Ry3
1 H4 Ix4,Iy4 Jx1,Jy4 P4 Qx4,Qy4 Rx4,Ry4
1 H5 Ix5,Iy5 Jx1,Jy5 H5 Ix5,Iy5 Jx1,Jy5
3 H3 Ix3,Iy3 Jx1,Jy3 P3 Ix3,Iy3 Jx1,Jy3
1 H9 Ix9,Iy9 Jx1,Jy9 P9 Qx9,Qy9 Rx9,Ry9
1 H10 Ix10,Iy10 Jx1,Jy10 P10 Qx10,Qy10 Rx10,Ry10
People told me the right way to do this is by using AWK. Any suggestion? Any faster alternative? I have been google'ing this but, i do not find this so straight forward on AWK. Let me know. Thanks
人们告诉我,正确的方法是使用AWK。有什么建议吗?有更快的选择吗?我一直在谷歌这个但是,我没有在AWK上发现这么直接。让我知道。谢谢
1 个解决方案
#1
0
awk 'NR!=1{uniqueSet[$8" "$9" "$10" "$16" "$17" "$18]++}END{print "Count Column8 Column9 Column10 Column16 Column17 Column18"; for(i in uniqueSet) print uniqueSet[i]" "i}' <file_name>
First the awk will iterate over each row and count the iterations of the rows based on your "key" and in the end it will print the count with the "key"
首先,awk将遍历每一行,并根据您的“密钥”计算行的迭代次数,最后它将使用“密钥”打印计数
Output
Count Column8 Column9 Column10 Column16 Column17 Column18
1 H4 Ix4,Iy4 Jx1,Jy4 P4 Qx4,Qy4 Rx4,Ry4
3 H3 Ix3,Iy3 Jx1,Jy3 P3 Ix3,Iy3 Jx1,Jy3
1 H10 Ix10,Iy10 Jx1,Jy10 P10 Qx10,Qy10 Rx10,Ry10
1 H1 Ix1,Iy1 Jx1,Jy1 P3 Qx3,Qy3 Rx3,Ry3
2 H1 Ix1,Iy1 Jx1,Jy1 P1 Ix1,Iy1 Jx1,Jy1
1 H9 Ix9,Iy9 Jx1,Jy9 P9 Qx9,Qy9 Rx9,Ry9
1 H5 Ix5,Iy5 Jx1,Jy5 H5 Ix5,Iy5 Jx1,Jy5
#1
0
awk 'NR!=1{uniqueSet[$8" "$9" "$10" "$16" "$17" "$18]++}END{print "Count Column8 Column9 Column10 Column16 Column17 Column18"; for(i in uniqueSet) print uniqueSet[i]" "i}' <file_name>
First the awk will iterate over each row and count the iterations of the rows based on your "key" and in the end it will print the count with the "key"
首先,awk将遍历每一行,并根据您的“密钥”计算行的迭代次数,最后它将使用“密钥”打印计数
Output
Count Column8 Column9 Column10 Column16 Column17 Column18
1 H4 Ix4,Iy4 Jx1,Jy4 P4 Qx4,Qy4 Rx4,Ry4
3 H3 Ix3,Iy3 Jx1,Jy3 P3 Ix3,Iy3 Jx1,Jy3
1 H10 Ix10,Iy10 Jx1,Jy10 P10 Qx10,Qy10 Rx10,Ry10
1 H1 Ix1,Iy1 Jx1,Jy1 P3 Qx3,Qy3 Rx3,Ry3
2 H1 Ix1,Iy1 Jx1,Jy1 P1 Ix1,Iy1 Jx1,Jy1
1 H9 Ix9,Iy9 Jx1,Jy9 P9 Qx9,Qy9 Rx9,Ry9
1 H5 Ix5,Iy5 Jx1,Jy5 H5 Ix5,Iy5 Jx1,Jy5