Suppose I have a data frame that looks like this:
假设我有一个这样的数据框架:
ID T X Y Z
1 1 A A NA
1 2 B A NA
1 3 B B NA
1 4 B A NA
2 1 A B NA
2 2 A A NA
2 3 B A NA
2 4 A B NA
3 1 B B NA
3 2 B B NA
3 3 B B NA
3 4 B A NA
And I would like to replace the value of Z based on some conditionals that depend on both row and (previous) column values so that the above ends up looking like this:
我想基于一些依赖于行和(先前)列值的条件来替换Z的值,以便上面的结果是这样的:
ID T X Y Z
1 1 A A 0
1 2 B A 0
1 3 B B 1
1 4 B A NA
2 1 A B 0
2 2 A A 0
2 3 B A 0
2 4 A B 0
3 1 B B 1
3 2 B B NA
3 3 B B NA
3 4 B A NA
The rules:
规则:
- Z takes the value of 1 the first time (in order by T, and within an ID) that both X and Y one that row have the value B.
- Z第一次取1的值(以T为单位,ID为单位)X和Y都取B。
- Z takes (or retains) the value NA if and only if for any smaller value of T, it has taken the value of 1 already.
- Z取(或保留)值NA,当且仅当对于T的任何较小值,它已经取了1的值。
- When T = 1, Z takes the value of 0 if X and Y on that row do not both equal B.
- 当T = 1时,Z取0如果X和Y不同时等于B。
- When T > 1, Z takes the value of 0 if X and Y on that row do not both equal B, AND the value of Z on the previous row = zero.
- 当T > 1时,Z取0,如果X和Y在那一行上不同时等于B,而Z在前一行上的值= 0。
I want the following to work, and it gets me kinda close but no dice:
我想要下面的方法奏效,它让我有点接近但没有机会:
df$Z <- NA
for (t in 1:4) {
df$Z[ (df$X=="B" & df$Y=="B") & df$T==1] <- 1
df$Z[!(df$X=="B" & df$Y=="B") & df$T==1] <- 0
if (t>1) {
df$Z[ (df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 0
df$Z[!(df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 1
}
}
On the other hand, I can write series of nested if... then
statements looping across all observations, but that is excruciatingly slow (at least, compared to the program I am translating from on Stata).
另一方面,我可以写一系列嵌套if…然后语句在所有的观察中循环,但是这是非常缓慢的(至少,与我在Stata上翻译的程序相比)。
I am sure I am committing twelve kinds of gaffes in my attempt above, but a few hours of banging my head on this has not resolved it.
我确信我在上面的尝试中犯了十二种错误,但是几个小时的碰头并没有解决这个问题。
So I come to you begging, hat in hand. :)
所以我带着帽子向你乞讨。:)
Edit: it occurs to me that sharing the Stata code (which resolves this so much faster than what I have come up with in R, which is ironic, given my preference for R over Stata's language :) might help with suggestions. This does what I want, and does it fast (even with, say, N=1600, T=11):
编辑:在我看来,共享Stata代码(它比我在R中找到的要快得多,这很讽刺,因为我更喜欢R而不是Stata的语言:)可能会对建议有所帮助。这是我想要的,而且速度快(即使N=1600, T=11):
replace Z = .
forvalues t = 1(1)4 {
replace Z = 1 if X == "B" & Y == "B" & T == 1
replace Z = 0 if X == "B" & Y == "B" & T == 1
replace Z = 1 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
replace Z = 0 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
}
2 个解决方案
#1
2
Here's one approach using ave
and transform
:
这里有一个使用ave和transform的方法:
transform(dat[order(dat$ID, dat$T), ],
Z = ave(X == "B" & Y == "B", ID, FUN = function(x) {
as.integer("is.na<-"(x, (duplicated(x) & cumsum(x)) |
c(0, diff(x)) < 0)) }))
# ID T X Y Z
# 1 1 1 A A 0
# 2 1 2 B A 0
# 3 1 3 B B 1
# 4 1 4 B A NA
# 5 2 1 A B 0
# 6 2 2 A A 0
# 7 2 3 B A 0
# 8 2 4 A B 0
# 9 3 1 B B 1
# 10 3 2 B B NA
# 11 3 3 B B NA
# 12 3 4 B A NA
where dat
is the name of your data frame. The reordering (dat[order(dat$ID, dat$T), ]
) is not necessary if the rows are already ordered along ID
and T
.
dat是数据帧的名称。如果已经沿着ID和T对行进行了排序,则不需要重新排序(dat[order(dat$ID, dat$T))。
#2
1
Another possibillity using by
另一个possibillity使用
ll <- by(df, df$ID, function(x){
x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
x$Z[x$Z > 1] <- NA
x
})
df2 <- do.call(rbind, ll)
df2
# ID T X Y Z
# 1.1 1 1 A A 0
# 1.2 1 2 B A 0
# 1.3 1 3 B B 1
# 1.4 1 4 B A NA
# 2.5 2 1 A B 0
# 2.6 2 2 A A 0
# 2.7 2 3 B A 0
# 2.8 2 4 A B 0
# 3.9 3 1 B B 1
# 3.10 3 2 B B NA
# 3.11 3 3 B B NA
# 3.12 3 4 B A NA
Same function but using ddply
instead:
功能相同,但使用ddply替代:
library(plyr)
df2 <- ddply(.data = df, .variables = .(ID), function(x){
x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
x$Z[x$Z > 1] <- NA
x
})
df2
#1
2
Here's one approach using ave
and transform
:
这里有一个使用ave和transform的方法:
transform(dat[order(dat$ID, dat$T), ],
Z = ave(X == "B" & Y == "B", ID, FUN = function(x) {
as.integer("is.na<-"(x, (duplicated(x) & cumsum(x)) |
c(0, diff(x)) < 0)) }))
# ID T X Y Z
# 1 1 1 A A 0
# 2 1 2 B A 0
# 3 1 3 B B 1
# 4 1 4 B A NA
# 5 2 1 A B 0
# 6 2 2 A A 0
# 7 2 3 B A 0
# 8 2 4 A B 0
# 9 3 1 B B 1
# 10 3 2 B B NA
# 11 3 3 B B NA
# 12 3 4 B A NA
where dat
is the name of your data frame. The reordering (dat[order(dat$ID, dat$T), ]
) is not necessary if the rows are already ordered along ID
and T
.
dat是数据帧的名称。如果已经沿着ID和T对行进行了排序,则不需要重新排序(dat[order(dat$ID, dat$T))。
#2
1
Another possibillity using by
另一个possibillity使用
ll <- by(df, df$ID, function(x){
x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
x$Z[x$Z > 1] <- NA
x
})
df2 <- do.call(rbind, ll)
df2
# ID T X Y Z
# 1.1 1 1 A A 0
# 1.2 1 2 B A 0
# 1.3 1 3 B B 1
# 1.4 1 4 B A NA
# 2.5 2 1 A B 0
# 2.6 2 2 A A 0
# 2.7 2 3 B A 0
# 2.8 2 4 A B 0
# 3.9 3 1 B B 1
# 3.10 3 2 B B NA
# 3.11 3 3 B B NA
# 3.12 3 4 B A NA
Same function but using ddply
instead:
功能相同,但使用ddply替代:
library(plyr)
df2 <- ddply(.data = df, .variables = .(ID), function(x){
x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
x$Z[x$Z > 1] <- NA
x
})
df2