比较单个文件中的多个列,计算这些列中值的出现次数

时间:2021-11-25 15:42:23

I have a file with 21 columns;

我有一个21列的文件;

Name    Grade1    Grade2   Grade3   Grade4   Grade5  .... Grade20
Zoe       60        70      NA       NA        NA            67

Now, I want to keep only the names that have more than 5 Grades that are NA. Some names contains NA.

现在,我想只保留超过5等级NA的名称。有些名字包含NA。

I know that awk can do the job. But I'm not sure how to compare all of the columns without having to compare them individually.

我知道awk可以胜任这项工作。但我不确定如何比较所有列而不必单独比较它们。

I tried:

我试过了:

    more input_file.txt | awk 'total = count20[$2,$3,$4,$5,$6,$7,$8,$9,$10,
    $11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21]++, if (($2 == "NA" || $3 == "NA" || 
    $4 == "NA" || $5 == "NA" || $6 == "NA" || $7 == "NA" || $8 == "NA" || $9 == "NA" 
    || $10 == "NA" || $11 == "NA" || $12 == "NA" || $13 == "NA" || $14 == "NA" || 
    $15 == "NA" || $16 == "NA" || $17 == "NA" || $18 == "NA" || $19 == "NA" || 
    $20 == "NA" || $21 == "NA") && total > 4) { print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"
    $6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"
    $18"\t"$19"\t"$20"\t"$21 }' > test.txt

It doesn't work and I'm not sure why or how to do it more efficiently.

它不起作用,我不知道为什么或如何更有效地做到这一点。

Edit: to be more precise, the wanted output is a file containing name and all columns for students having more than 5 columns containing NA.

编辑:更准确地说,想要的输出是一个包含名称的文件,以及包含超过5列包含NA的学生的所有列。

6 个解决方案

#1


0  

Here's a fairly basic way to do it using awk:

这是使用awk执行此操作的一种相当基本的方法:

awk '{ count = 0; for (i = 2; i <= NF; ++i) if ($i == "NA") ++count } count > 5' file

This loops through each field and checks whether it is equal to "NA". If so, it adds to the total count for that record. Records are printed when the count is greater than 5.

这循环遍历每个字段并检查它是否等于“NA”。如果是这样,它会增加该记录的总计数。当计数大于5时,将打印记录。

#2


1  

awk to the rescue!

拯救!

$ awk -F'NA' 'NF>5'

assuming your names column doesn't include NA as a substring. Use NA as the field delimiter and count the fields, if there are more than 5 fields means there are at least 5 NAs, default action is printing the line so no need to specify. This will give you the records with 5 or more NA, change the threshold to 6 if you want strictly more than 5.

假设您的名称列不包含NA作为子字符串。使用NA作为字段分隔符并计算字段,如果有超过5个字段表示至少有5个NA,则默认操作是打印该行,因此无需指定。这将为您提供5个或更多NA的记录,如果您想要严格超过5,则将阈值更改为6。

#3


1  

This command prints all lines containing NA at least six times:

此命令至少打印六次包含NA的所有行:

grep -E '(NA.*){6}' infile

It can be tripped up if the name of the student contains NA as well. To fix that, you can use

如果学生的姓名也包含NA,则可以将其绊倒。要解决这个问题,您可以使用

grep -E '^[^[:blank:]]+[[:blank:]]+(NA.*){6,}' infile

which counts the NA only after the first column.

仅在第一列之后计算NA。

#4


0  

It's tagged perl so a perl answer:

它被标记为perl所以perl答案:

perl -ne 'print if (grep /^NA$/, split ) > 5'

Which:

哪一个:

  • iterates line by line
  • 逐行迭代
  • does a whitespace split
  • 是一个空白分裂
  • greps fields that are just NA
  • greps只是NA的字段
  • prints if the count of those are > 5
  • 如果计数> 5,则打印

#5


0  

With GNU awk for word boundaries:

使用GNU awk进行单词边界:

awk -F'\\<NA\\>' 'NF>6'

with any awk if your fields are tab-separated as they seem to be:

任何awk,如果你的字段按标签分隔,因为它们似乎是:

awk -F'\tNA(\t|$)' 'NF>6'

with any POSIX awk for any type of space between fields:

任何POSIX awk用于字段之间的任何类型的空间:

awk -F'[[:space:]]NA([[:space:]]|$)' 'NF>6'

#6


0  

In Perl, just count the number of fields that are equal to NA

在Perl中,只计算等于NA的字段数

perl -ne 'print if 5 <= grep { $_ eq "NA" } split' test.txt

or, as Tom Fenech points out in his comment

或者,正如Tom Fenech在评论中指出的那样

perl -ane 'print if 5 <= grep { $_ eq "NA" } @F' test.txt

#1


0  

Here's a fairly basic way to do it using awk:

这是使用awk执行此操作的一种相当基本的方法:

awk '{ count = 0; for (i = 2; i <= NF; ++i) if ($i == "NA") ++count } count > 5' file

This loops through each field and checks whether it is equal to "NA". If so, it adds to the total count for that record. Records are printed when the count is greater than 5.

这循环遍历每个字段并检查它是否等于“NA”。如果是这样,它会增加该记录的总计数。当计数大于5时,将打印记录。

#2


1  

awk to the rescue!

拯救!

$ awk -F'NA' 'NF>5'

assuming your names column doesn't include NA as a substring. Use NA as the field delimiter and count the fields, if there are more than 5 fields means there are at least 5 NAs, default action is printing the line so no need to specify. This will give you the records with 5 or more NA, change the threshold to 6 if you want strictly more than 5.

假设您的名称列不包含NA作为子字符串。使用NA作为字段分隔符并计算字段,如果有超过5个字段表示至少有5个NA,则默认操作是打印该行,因此无需指定。这将为您提供5个或更多NA的记录,如果您想要严格超过5,则将阈值更改为6。

#3


1  

This command prints all lines containing NA at least six times:

此命令至少打印六次包含NA的所有行:

grep -E '(NA.*){6}' infile

It can be tripped up if the name of the student contains NA as well. To fix that, you can use

如果学生的姓名也包含NA,则可以将其绊倒。要解决这个问题,您可以使用

grep -E '^[^[:blank:]]+[[:blank:]]+(NA.*){6,}' infile

which counts the NA only after the first column.

仅在第一列之后计算NA。

#4


0  

It's tagged perl so a perl answer:

它被标记为perl所以perl答案:

perl -ne 'print if (grep /^NA$/, split ) > 5'

Which:

哪一个:

  • iterates line by line
  • 逐行迭代
  • does a whitespace split
  • 是一个空白分裂
  • greps fields that are just NA
  • greps只是NA的字段
  • prints if the count of those are > 5
  • 如果计数> 5,则打印

#5


0  

With GNU awk for word boundaries:

使用GNU awk进行单词边界:

awk -F'\\<NA\\>' 'NF>6'

with any awk if your fields are tab-separated as they seem to be:

任何awk,如果你的字段按标签分隔,因为它们似乎是:

awk -F'\tNA(\t|$)' 'NF>6'

with any POSIX awk for any type of space between fields:

任何POSIX awk用于字段之间的任何类型的空间:

awk -F'[[:space:]]NA([[:space:]]|$)' 'NF>6'

#6


0  

In Perl, just count the number of fields that are equal to NA

在Perl中,只计算等于NA的字段数

perl -ne 'print if 5 <= grep { $_ eq "NA" } split' test.txt

or, as Tom Fenech points out in his comment

或者,正如Tom Fenech在评论中指出的那样

perl -ane 'print if 5 <= grep { $_ eq "NA" } @F' test.txt