如何获取行的ID,这些行在某些列中相同但在其他列中具有NA

时间:2020-12-08 09:09:10

I have a data frame with some rows, that are the same in some columns, and should be identical, but are instead filled with NA.

我有一个包含某些行的数据框,在某些列中是相同的,并且应该相同,但是用NA填充。

Example:

     ID   NAME   SURNAME      value1     value2
1     1  Luke Skywalker            1         3 
2     2  Luke Skywalker            NA        NA
3     3  Luke Skywalker            NA        NA
4     6   Han      Solo            NA        NA
5     7   Han      Solo            5         5 
6     8   Han      Solo            4         NA

In reality, this is a big dataset and there are more than just two values.

实际上,这是一个大数据集,并且不仅仅有两个值。

I would like get a vector of IDof the rows, that have the same Nameand Surname, but have have NA values in columns, where the column with the same name and surname has actual values. If there is a case, where there is mixed data (as is the case here with Han), I would like to just get the ID of the row, that has only NAdata, except if there is a full row with values, that are the same as the ones in the incomplete row, then I would also like to get the ID of the incomplete one.

我想获得行ID的向量,它具有相同的Nameand Surname,但在列中具有NA值,其中具有相同名称和姓氏的列具有实际值。如果有一个案例,那里有混合数据(就像汉语的情况一样),我想获得只有NAdata的行的ID,除非有一个包含值的完整行,与不完整行中的那些相同,那么我也想得到不完整行的ID。

So the return from my example would be c(2,3,6)

所以我的例子的回报是c(2,3,6)

Edit: As asked in the question, the names and surnames are important, as I would only like to get the ID if and only if there is a full or more complete row for that name surname combination. The variables are actually the results of test, that should happen only once per year (in my df, I will also group by the testing date, I skipped that here as the grouping variables should have no influence on the solution, as far as I know).

编辑:正如问题中所提到的,姓名和姓氏很重要,因为我只想获得ID,当且仅当该名称姓氏组合有完整或更完整的行时。变量实际上是测试的结果,每年应该只发生一次(在我的df中,我也会按测试日期分组,我在这里跳过,因为分组变量应该对解决方案没有影响,就我而言知道)。

4 个解决方案

#1


1  

This would be an example to get "vector of IDof the rows, that have the same Nameand Surname, but have have NA values in columns" & "just get the ID of the row, that has only NAdata":

这将是一个示例,以获取“行ID的向量,具有相同的Nameand Surname,但在列中具有NA值”&“只获取行的ID,只有NAdata”:

df <- read.table(header = TRUE, text = " ID   NAME   SURNAME      value1     value2
1     1  Luke Skywalker            1         3 
             2     2  Luke Skywalker            NA        NA
             3     3  Luke Skywalker            NA        NA
             4     6   Han      Solo            NA        NA
             5     7   Han      Solo            5         5 
             6     8   Han      Solo            4         NA ")

df[apply(df[ , c("value1", "value2")], 1, function(x) all(is.na(x))), ]

#2


1  

Another option is to use rowSums on logical matrix created using the subset of dataset that have only 'value' columns. It is vectorized and should work on any number of 'value' columns in the dataset

另一种选择是在使用仅具有“值”列的数据集子集创建的逻辑矩阵上使用rowSums。它是矢量化的,应该适用于数据集中任意数量的“值”列

df[!rowSums(!is.na(df[grep("value", names(df))])),]
#  ID NAME   SURNAME value1 value2
#2  2 Luke Skywalker     NA     NA
#3  3 Luke Skywalker     NA     NA
#4  6  Han      Solo     NA     NA

#3


0  

A solution using dplyr.

使用dplyr的解决方案。

library(dplyr)

df %>% filter_at(vars(-ID, -NAME, -SURNAME), all_vars(is.na(.)))

  ID NAME   SURNAME value1 value2
1  2 Luke Skywalker     NA     NA
2  3 Luke Skywalker     NA     NA
3  6  Han      Solo     NA     NA

filter_at is a way to filter a condition for multiple columns. We can use vars(...) to select the columns we want. In the above example, vars(-ID, -NAME, -SURNAME) means the filtering condition is not applied to the ID, NAME, and SURNAME columns. Because you said you need to filter the condition for more than two columns, I want to demonstrate other ways to specify the columns for filtering. The following code, with other ways to specify columns, will also work.

filter_at是一种过滤多列条件的方法。我们可以使用变量(...)来选择我们想要的列。在上面的示例中,vars(-ID,-NAME,-SURNAME)表示过滤条件未应用于ID,NAME和SURNAME列。因为您说您需要过滤两列以上的条件,所以我想演示指定过滤列的其他方法。以下代码以及指定列的其他方法也可以使用。

# Select columns to begin with "value"
df %>% filter_at(vars(starts_with("value")), all_vars(is.na(.)))

# Select columns to contain "value"
df %>% filter_at(vars(contains("value")), all_vars(is.na(.)))

# Select columns to match "value" using regular expression
df %>% filter_at(vars(matches("value")), all_vars(is.na(.)))

# Select columns by column index numbers, not using the first three columns
df %>% filter_at(vars(-1:-3), all_vars(is.na(.)))

# Select columns by column index numbers, starting the fourth column to the end
df %>% filter_at(vars(4:ncol(.)), all_vars(is.na(.)))

all_vars(is.na(.)) means all the columns specified need to meet the filtering condition (is.na(.) == TRUE).

all_vars(is.na(。))表示指定的所有列都需要满足过滤条件(is.na(。)== TRUE)。

Data

df <- read.table(header = TRUE, text = " ID   NAME   SURNAME      value1     value2
1     1  Luke Skywalker            1         3 
                 2     2  Luke Skywalker            NA        NA
                 3     3  Luke Skywalker            NA        NA
                 4     6   Han      Solo            NA        NA
                 5     7   Han      Solo            5         5 
                 6     8   Han      Solo            4         NA ")

#4


0  

If I understood correctly :)

如果我理解正确:)

df <- read.table(header = TRUE, text = " ID   NAME   SURNAME      value1     value2
1     1  Luke Skywalker            1         3 
             2     2  Luke Skywalker            NA        NA
             3     3  Luke Skywalker            NA        NA
             4     6   Han      Solo            NA        NA
             5     7   Han      Solo            5         5 
             6     8   Han      Solo            4         NA ")

all_or_some_na  <- which(unname(apply(df[4:ncol(df)],1,anyNA)))
all_na          <- which(unname(apply(df[4:ncol(df)],1,function(x) all(is.na(x)))))
some_na         <- setdiff(all_or_some_na,all_na)
complete_rows   <- setdiff(1:nrow(df),all_or_some_na)

#1


1  

This would be an example to get "vector of IDof the rows, that have the same Nameand Surname, but have have NA values in columns" & "just get the ID of the row, that has only NAdata":

这将是一个示例,以获取“行ID的向量,具有相同的Nameand Surname,但在列中具有NA值”&“只获取行的ID,只有NAdata”:

df <- read.table(header = TRUE, text = " ID   NAME   SURNAME      value1     value2
1     1  Luke Skywalker            1         3 
             2     2  Luke Skywalker            NA        NA
             3     3  Luke Skywalker            NA        NA
             4     6   Han      Solo            NA        NA
             5     7   Han      Solo            5         5 
             6     8   Han      Solo            4         NA ")

df[apply(df[ , c("value1", "value2")], 1, function(x) all(is.na(x))), ]

#2


1  

Another option is to use rowSums on logical matrix created using the subset of dataset that have only 'value' columns. It is vectorized and should work on any number of 'value' columns in the dataset

另一种选择是在使用仅具有“值”列的数据集子集创建的逻辑矩阵上使用rowSums。它是矢量化的,应该适用于数据集中任意数量的“值”列

df[!rowSums(!is.na(df[grep("value", names(df))])),]
#  ID NAME   SURNAME value1 value2
#2  2 Luke Skywalker     NA     NA
#3  3 Luke Skywalker     NA     NA
#4  6  Han      Solo     NA     NA

#3


0  

A solution using dplyr.

使用dplyr的解决方案。

library(dplyr)

df %>% filter_at(vars(-ID, -NAME, -SURNAME), all_vars(is.na(.)))

  ID NAME   SURNAME value1 value2
1  2 Luke Skywalker     NA     NA
2  3 Luke Skywalker     NA     NA
3  6  Han      Solo     NA     NA

filter_at is a way to filter a condition for multiple columns. We can use vars(...) to select the columns we want. In the above example, vars(-ID, -NAME, -SURNAME) means the filtering condition is not applied to the ID, NAME, and SURNAME columns. Because you said you need to filter the condition for more than two columns, I want to demonstrate other ways to specify the columns for filtering. The following code, with other ways to specify columns, will also work.

filter_at是一种过滤多列条件的方法。我们可以使用变量(...)来选择我们想要的列。在上面的示例中,vars(-ID,-NAME,-SURNAME)表示过滤条件未应用于ID,NAME和SURNAME列。因为您说您需要过滤两列以上的条件,所以我想演示指定过滤列的其他方法。以下代码以及指定列的其他方法也可以使用。

# Select columns to begin with "value"
df %>% filter_at(vars(starts_with("value")), all_vars(is.na(.)))

# Select columns to contain "value"
df %>% filter_at(vars(contains("value")), all_vars(is.na(.)))

# Select columns to match "value" using regular expression
df %>% filter_at(vars(matches("value")), all_vars(is.na(.)))

# Select columns by column index numbers, not using the first three columns
df %>% filter_at(vars(-1:-3), all_vars(is.na(.)))

# Select columns by column index numbers, starting the fourth column to the end
df %>% filter_at(vars(4:ncol(.)), all_vars(is.na(.)))

all_vars(is.na(.)) means all the columns specified need to meet the filtering condition (is.na(.) == TRUE).

all_vars(is.na(。))表示指定的所有列都需要满足过滤条件(is.na(。)== TRUE)。

Data

df <- read.table(header = TRUE, text = " ID   NAME   SURNAME      value1     value2
1     1  Luke Skywalker            1         3 
                 2     2  Luke Skywalker            NA        NA
                 3     3  Luke Skywalker            NA        NA
                 4     6   Han      Solo            NA        NA
                 5     7   Han      Solo            5         5 
                 6     8   Han      Solo            4         NA ")

#4


0  

If I understood correctly :)

如果我理解正确:)

df <- read.table(header = TRUE, text = " ID   NAME   SURNAME      value1     value2
1     1  Luke Skywalker            1         3 
             2     2  Luke Skywalker            NA        NA
             3     3  Luke Skywalker            NA        NA
             4     6   Han      Solo            NA        NA
             5     7   Han      Solo            5         5 
             6     8   Han      Solo            4         NA ")

all_or_some_na  <- which(unname(apply(df[4:ncol(df)],1,anyNA)))
all_na          <- which(unname(apply(df[4:ncol(df)],1,function(x) all(is.na(x)))))
some_na         <- setdiff(all_or_some_na,all_na)
complete_rows   <- setdiff(1:nrow(df),all_or_some_na)