在行中找到数据帧的唯一性?

时间:2021-04-17 09:16:17

I have a data frame like below. I would like to find unique rows (uniqueness). But in this data I have 'NA'. I like if all value in one row with NA value is the same with other rows (like rows: 1,2,5) I want to ignore it, but if not same (like rows : 2,4) I like to keep it as unique row. For example, in rows 1 ,2 and 6 all values except NA are the same so because NA can be value '1 and 3' I like to remove this row and just keep row 2. Also, in row 6 values 2 and 3 (exclude NA) are the same as row c2 and c5 and there is possible NAs in c6 get same value like as c2 and c5, so this row is not unique.

我有一个如下所示的数据框架。我希望找到唯一的行(惟一性)。但在这些数据中,我有“NA”。我喜欢如果一行中的所有值与其他行相同(比如行:1、2、5),我想忽略它,但如果不相同(比如行:2、4),我喜欢将它保持为唯一行。例如,在第1、2和6行中,除了NA之外的所有值都是相同的,因此,因为NA可以是值'1和3'我喜欢删除这一行,只保留第2行。另外,第6行中的值2和3(排除NA)与第c2和第c5行相同,而第6行中可能有与第c2和第c5一样的值,所以这一行不是唯一的。

Also, @ Sotos solution help me more but in last part after removing NA when make pattern for rows , his solution consider same pattern (23) for c8 and c6 and remove them. But actually c8 is unique.

另外,@ Sotos的解决方案对我的帮助更大,但是在最后一部分,当为行创建模式时,他的解决方案考虑c8和c6使用相同的模式(23)并删除它们。但是c8是唯一的。

data:

数据:

      a1  a2   a3   a4
c1    2    1    3   NA
c2    2    1    3    3
c3    2    1    4    3
c4    2    2    3   NA
c5    2    1    3    3
c6    2    NA   3   NA
c7    2    NA   0   NA
c8    2    3   NA   NA

I would like to have this output:

我想要这个输出:

output:

输出:

     a1    a2  a3   a4
c2    2    1    3    3
c3    2    1    4    3
c4    2    2    3   NA
c7    2    NA   0   NA
c8    2    3   NA   NA

2 个解决方案

#1


2  

library(stringr) 
df <- unique(df)
#paste rows omitting NAs
df$new <- apply(df, 1, function(i) paste(na.omit(i), collapse = ''))
#use str_detect to determine whether each pattern is found more than once
df$new2 <- rowSums(sapply(df$new, function(i) str_detect(i, df$new)))
new_df <- subset(df, df$new2 == 1)
new_df <- new_df[, !names(new_df) %in% c('new', 'new2')]
new_df
#   V2 V3 V4 V5
#2  2  1  3  3
#3  2  1  4  3
#4  2  2  3 NA

Testing the code with the additional row as per your comment:

根据您的评论使用附加行测试代码:

new_df
#   a1 a2 a3 a4
#c2  2  1  3  3
#c3  2  1  4  3
#c4  2  2  3 NA
#c7  2 NA  0 NA

DATA

数据

dput(df)
structure(list(a1 = c(2L, 2L, 2L, 2L, 2L, 2L, 2L), a2 = c(1L, 
1L, 1L, 2L, 1L, NA, NA), a3 = c(3L, 3L, 4L, 3L, 3L, 3L, 0L), 
    a4 = c(NA, 3L, 3L, NA, 3L, NA, NA)), .Names = c("a1", "a2", 
"a3", "a4"), class = "data.frame", row.names = c("c1", "c2", 
"c3", "c4", "c5", "c6", "c7"))

#2


0  

My solution would be to :

我的解决办法是:

1) Take all unique solutions in row that do not have a NA.

1)将所有唯一的解放在一行中,没有NA。

2) Among those that have NAs, see if the rest of the values is identical to another row

2)在具有NAs的值中,看看其余的值是否与另一行相同

Reproduce data

df<-data.frame(V1 = rep(2,times = 6),
    V2 = c(1,1,1,2,1,NA),
    V3=c(3,3,4,3,3,3),
    V4=c(NA,3,3,NA,3,NA))

Create two unique data frames (one with NAs, the other without

df1<-unique(df[apply(df,MARGIN=1,FUN =function(z) sum(is.na(z)))==0,])
df2<-unique(df[apply(df,MARGIN=1,FUN =function(z) sum(is.na(z)))>0,])

Add rows from NAs matching your condition

for(i in 1:nrow(df2)){
  vec<-df2[i,] 
  w<-is.na(vec)
  if(nrow(merge(vec[!w],df1[,w]))>0){ ###I remove columns where you have NAs
    df1<-rbind(df1,vec)
  }

}

#1


2  

library(stringr) 
df <- unique(df)
#paste rows omitting NAs
df$new <- apply(df, 1, function(i) paste(na.omit(i), collapse = ''))
#use str_detect to determine whether each pattern is found more than once
df$new2 <- rowSums(sapply(df$new, function(i) str_detect(i, df$new)))
new_df <- subset(df, df$new2 == 1)
new_df <- new_df[, !names(new_df) %in% c('new', 'new2')]
new_df
#   V2 V3 V4 V5
#2  2  1  3  3
#3  2  1  4  3
#4  2  2  3 NA

Testing the code with the additional row as per your comment:

根据您的评论使用附加行测试代码:

new_df
#   a1 a2 a3 a4
#c2  2  1  3  3
#c3  2  1  4  3
#c4  2  2  3 NA
#c7  2 NA  0 NA

DATA

数据

dput(df)
structure(list(a1 = c(2L, 2L, 2L, 2L, 2L, 2L, 2L), a2 = c(1L, 
1L, 1L, 2L, 1L, NA, NA), a3 = c(3L, 3L, 4L, 3L, 3L, 3L, 0L), 
    a4 = c(NA, 3L, 3L, NA, 3L, NA, NA)), .Names = c("a1", "a2", 
"a3", "a4"), class = "data.frame", row.names = c("c1", "c2", 
"c3", "c4", "c5", "c6", "c7"))

#2


0  

My solution would be to :

我的解决办法是:

1) Take all unique solutions in row that do not have a NA.

1)将所有唯一的解放在一行中,没有NA。

2) Among those that have NAs, see if the rest of the values is identical to another row

2)在具有NAs的值中,看看其余的值是否与另一行相同

Reproduce data

df<-data.frame(V1 = rep(2,times = 6),
    V2 = c(1,1,1,2,1,NA),
    V3=c(3,3,4,3,3,3),
    V4=c(NA,3,3,NA,3,NA))

Create two unique data frames (one with NAs, the other without

df1<-unique(df[apply(df,MARGIN=1,FUN =function(z) sum(is.na(z)))==0,])
df2<-unique(df[apply(df,MARGIN=1,FUN =function(z) sum(is.na(z)))>0,])

Add rows from NAs matching your condition

for(i in 1:nrow(df2)){
  vec<-df2[i,] 
  w<-is.na(vec)
  if(nrow(merge(vec[!w],df1[,w]))>0){ ###I remove columns where you have NAs
    df1<-rbind(df1,vec)
  }

}