I have Legal data that looks like this. I'm using RStudio.
我有法律数据,看起来像这样。我正在使用RStudio。
> head(gsu[,107:117])
HtoODay PAOSLDME DUSHD POELRD XCAB WESDF BILOE HYPERDIF IMPSENS Billing MALLAMP
42 0 <NA> No No <NA> <NA> <NA> No <NA> Hourly NA
61 0 <NA> Yes Yes <NA> Yes <NA> Yes <NA> Hourly NA
230 0 <NA> No Yes <NA> <NA> <NA> Yes <NA> Hourly NA
235 0 <NA> No No <NA> <NA> <NA> Yes <NA> Hourly NA
302 0 <NA> No No <NA> <NA> No No <NA> Hourly NA
336 3 <NA> No No Yes <NA> <NA> No <NA> Consult NA
>
I want to get a row count of unique Yes occurrences. By which I mean, if Yes occurs in one column, this registers as a count of 1 regardless of the Yes or No value of another column.
我想得到一个独特的Yes出现的行数。我的意思是,如果在一列中出现“是”,则无论另一列的“是”或“否”值如何,都会将其记录为1。
For example, Row 61 would count as 1 count of Yes, even though the row contains multiples Yes's across columns, whereas Row 336 would also register in the overall count as 1, given only one instance of Yes.
例如,行61将计为1的计数,即使该行包含跨列的倍数,而行336也将在整体计数中注册为1,仅给出一个是的实例。
Essentially, how do I count unique rows of binary instances across columns, without accounting for multiple within-row instances?
本质上,如何计算跨列的唯一二进制实例行,而不考虑多个行内实例?
2 个解决方案
#1
Another option is
另一种选择是
(1:nrow(gsu) %in% which(gsu=='Yes', arr.ind=TRUE)[,1])+0L
#[1] 0 1 1 1 0 1
Or
apply(gsu=='Yes' & !is.na(gsu), 1, any) + 0L
# 42 61 230 235 302 336
# 0 1 1 1 0 1
Or
Reduce(`|`,as.data.frame(gsu=='Yes' & !is.na(gsu))) + 0L
#[1] 0 1 1 1 0 1
Or
do.call(`pmax`, c(lapply(gsu,`==`, 'Yes'), na.rm=TRUE))
#[1] 0 1 1 1 0 1
Benchmarks
set.seed(24)
gsu1 <- as.data.frame(matrix(sample(c(NA, 'Yes', 'No', LETTERS),
4000*4000, replace=TRUE), ncol=4000), stringsAsFactors=FALSE)
akrun1 <- function() (1:nrow(gsu1) %in% which(gsu1=='Yes',
arr.ind=TRUE)[,1]) +0L
akrun2 <- function() do.call(`pmax`, c(lapply(gsu1, `==`, 'Yes'),
na.rm=TRUE))
ExperimenteR <- function() rowSums(gsu1=="Yes", na.rm=TRUE)>=1
library(microbenchmark)
microbenchmark(akrun1(), akrun2(), ExperimenteR(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun1() 1.244682 1.293628 1.293696 1.294336 1.319209 1.277138 20 b
# akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
# ExperimenteR() 1.213802 1.296464 1.276666 1.295421 1.280282 1.209436 20 b
#2
rowSums(df=="Yes", na.rm=TRUE)>=1
gives
# 42 61 230 235 302 336
#FALSE TRUE TRUE TRUE FALSE TRUE
#1
Another option is
另一种选择是
(1:nrow(gsu) %in% which(gsu=='Yes', arr.ind=TRUE)[,1])+0L
#[1] 0 1 1 1 0 1
Or
apply(gsu=='Yes' & !is.na(gsu), 1, any) + 0L
# 42 61 230 235 302 336
# 0 1 1 1 0 1
Or
Reduce(`|`,as.data.frame(gsu=='Yes' & !is.na(gsu))) + 0L
#[1] 0 1 1 1 0 1
Or
do.call(`pmax`, c(lapply(gsu,`==`, 'Yes'), na.rm=TRUE))
#[1] 0 1 1 1 0 1
Benchmarks
set.seed(24)
gsu1 <- as.data.frame(matrix(sample(c(NA, 'Yes', 'No', LETTERS),
4000*4000, replace=TRUE), ncol=4000), stringsAsFactors=FALSE)
akrun1 <- function() (1:nrow(gsu1) %in% which(gsu1=='Yes',
arr.ind=TRUE)[,1]) +0L
akrun2 <- function() do.call(`pmax`, c(lapply(gsu1, `==`, 'Yes'),
na.rm=TRUE))
ExperimenteR <- function() rowSums(gsu1=="Yes", na.rm=TRUE)>=1
library(microbenchmark)
microbenchmark(akrun1(), akrun2(), ExperimenteR(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun1() 1.244682 1.293628 1.293696 1.294336 1.319209 1.277138 20 b
# akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
# ExperimenteR() 1.213802 1.296464 1.276666 1.295421 1.280282 1.209436 20 b
#2
rowSums(df=="Yes", na.rm=TRUE)>=1
gives
# 42 61 230 235 302 336
#FALSE TRUE TRUE TRUE FALSE TRUE