如何在R中的列中获得唯一ID?

时间:2021-11-26 20:37:17

I have Legal data that looks like this. I'm using RStudio.

我有法律数据,看起来像这样。我正在使用RStudio。

> head(gsu[,107:117])
    HtoODay PAOSLDME DUSHD POELRD XCAB WESDF BILOE HYPERDIF IMPSENS      Billing MALLAMP
42        0     <NA>    No     No  <NA>  <NA>  <NA>       No    <NA>  Hourly      NA
61        0     <NA>    Yes    Yes <NA>   Yes  <NA>      Yes    <NA>  Hourly      NA
230       0     <NA>    No     Yes <NA>  <NA>  <NA>      Yes    <NA>  Hourly      NA
235       0     <NA>    No     No  <NA>  <NA>  <NA>      Yes    <NA>  Hourly      NA
302       0     <NA>    No     No  <NA>  <NA>   No        No    <NA>  Hourly      NA
336       3     <NA>    No     No   Yes  <NA>  <NA>       No    <NA> Consult      NA
> 

I want to get a row count of unique Yes occurrences. By which I mean, if Yes occurs in one column, this registers as a count of 1 regardless of the Yes or No value of another column.

我想得到一个独特的Yes出现的行数。我的意思是,如果在一列中出现“是”,则无论另一列的“是”或“否”值如何,都会将其记录为1。

For example, Row 61 would count as 1 count of Yes, even though the row contains multiples Yes's across columns, whereas Row 336 would also register in the overall count as 1, given only one instance of Yes.

例如,行61将计为1的计数,即使该行包含跨列的倍数,而行336也将在整体计数中注册为1,仅给出一个是的实例。

Essentially, how do I count unique rows of binary instances across columns, without accounting for multiple within-row instances?

本质上,如何计算跨列的唯一二进制实例行,而不考虑多个行内实例?

2 个解决方案

#1


Another option is

另一种选择是

(1:nrow(gsu) %in% which(gsu=='Yes', arr.ind=TRUE)[,1])+0L
#[1] 0 1 1 1 0 1

Or

 apply(gsu=='Yes' & !is.na(gsu), 1, any) + 0L
 #   42  61 230 235 302 336 
 #   0   1   1   1   0   1 

Or

 Reduce(`|`,as.data.frame(gsu=='Yes' & !is.na(gsu))) + 0L
 #[1] 0 1 1 1 0 1

Or

  do.call(`pmax`, c(lapply(gsu,`==`, 'Yes'), na.rm=TRUE))
  #[1] 0 1 1 1 0 1

Benchmarks

set.seed(24)
gsu1 <- as.data.frame(matrix(sample(c(NA, 'Yes', 'No', LETTERS), 
    4000*4000, replace=TRUE), ncol=4000), stringsAsFactors=FALSE) 

akrun1 <- function() (1:nrow(gsu1) %in% which(gsu1=='Yes', 
           arr.ind=TRUE)[,1]) +0L
akrun2 <- function() do.call(`pmax`, c(lapply(gsu1, `==`, 'Yes'), 
           na.rm=TRUE))
ExperimenteR <- function() rowSums(gsu1=="Yes", na.rm=TRUE)>=1

library(microbenchmark)
microbenchmark(akrun1(), akrun2(), ExperimenteR(), unit='relative', times=20L)
 #Unit: relative
 #        expr      min       lq     mean   median       uq      max neval cld
 #     akrun1() 1.244682 1.293628 1.293696 1.294336 1.319209 1.277138    20   b
 #     akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a 
 # ExperimenteR() 1.213802 1.296464 1.276666 1.295421 1.280282 1.209436    20   b

#2


rowSums(df=="Yes", na.rm=TRUE)>=1

gives

#   42    61   230   235   302   336 
#FALSE  TRUE  TRUE  TRUE FALSE  TRUE 

#1


Another option is

另一种选择是

(1:nrow(gsu) %in% which(gsu=='Yes', arr.ind=TRUE)[,1])+0L
#[1] 0 1 1 1 0 1

Or

 apply(gsu=='Yes' & !is.na(gsu), 1, any) + 0L
 #   42  61 230 235 302 336 
 #   0   1   1   1   0   1 

Or

 Reduce(`|`,as.data.frame(gsu=='Yes' & !is.na(gsu))) + 0L
 #[1] 0 1 1 1 0 1

Or

  do.call(`pmax`, c(lapply(gsu,`==`, 'Yes'), na.rm=TRUE))
  #[1] 0 1 1 1 0 1

Benchmarks

set.seed(24)
gsu1 <- as.data.frame(matrix(sample(c(NA, 'Yes', 'No', LETTERS), 
    4000*4000, replace=TRUE), ncol=4000), stringsAsFactors=FALSE) 

akrun1 <- function() (1:nrow(gsu1) %in% which(gsu1=='Yes', 
           arr.ind=TRUE)[,1]) +0L
akrun2 <- function() do.call(`pmax`, c(lapply(gsu1, `==`, 'Yes'), 
           na.rm=TRUE))
ExperimenteR <- function() rowSums(gsu1=="Yes", na.rm=TRUE)>=1

library(microbenchmark)
microbenchmark(akrun1(), akrun2(), ExperimenteR(), unit='relative', times=20L)
 #Unit: relative
 #        expr      min       lq     mean   median       uq      max neval cld
 #     akrun1() 1.244682 1.293628 1.293696 1.294336 1.319209 1.277138    20   b
 #     akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a 
 # ExperimenteR() 1.213802 1.296464 1.276666 1.295421 1.280282 1.209436    20   b

#2


rowSums(df=="Yes", na.rm=TRUE)>=1

gives

#   42    61   230   235   302   336 
#FALSE  TRUE  TRUE  TRUE FALSE  TRUE