My starting condition is something like the df
data frame
我的初始条件类似于df数据帧
df<-data.frame(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
id year event
1 2 2005 1
2 2 2006 0
3 2 2007 0
4 4 2005 0
5 4 2006 1
I have a series of actors (identified through an id) who happen to experience an event in a certain year.
我有一群演员(通过身份证明),他们碰巧在某一年经历了一件事。
Here I am trying to build is a series of additional columns that describe a) the distance from events and b) whether such distance is observable.
我在这里尝试构建的是一系列附加的列,它们描述a)离事件的距离,b)是否可以观察到这种距离。
This is what I would like to obtain.
这是我想要的。
id year event evm2 evm1 evp1 evp2 ndm2 ndm1 ndp1 ndp2
1 2 2005 1 0 0 0 0 1 1 0 0
2 2 2006 0 0 1 0 0 1 0 0 1
3 2 2007 0 1 0 0 0 0 0 1 1
4 4 2005 0 0 0 1 0 1 1 0 1
5 4 2006 1 0 0 0 0 1 0 1 1
event
equals 1 when there is an event in a certain year. evm1
equals 1 when an event is observable in the year before. Similarly, evp1
is 1 when the event is in the following year - the letters p
or m
stand for 'plus' and 'minus' and the numbers represent the distance in years from the event. For some of these observations the distance is not observable because the available time window is too short. This is the case of df[1,]
for which we don't know if in the previous years an event took place or not. In such a case, ndm1
and ndm2
are coded 1. If we consider the case df[5,]
, it will be ndp1
(and ndp2
) to be coded 1. ev
and nd
variables work exactly in the same way. But the former tells if at a certain distance there is an event or not and the latter reveals whether such a distance is actually observable.
当某一年发生事件时,事件等于1。当一个事件在前一年可以观察到时,evm1等于1。类似地,当事件发生在第二年时,evp1为1——字母p或m代表“加号”和“减号”,数字代表距离事件的年份。对于其中的一些观测,由于可用的时间窗口太短,所以无法观测到距离。这是df[1]的情况,我们不知道前几年是否发生过事件。在这种情况下,ndm1和ndm2编码为1。如果考虑案例df[5,],则ndp1(和ndp2)将被编码为1。ev和nd变量的工作方式完全相同。但是前者告诉我们在一定的距离内是否有一个事件,而后者则揭示了这样的距离是否确实可见。
I tried to accomplish this using the following nested for loops, but I didn't succeed.
我尝试使用以下嵌套的for循环来实现这一点,但没有成功。
lag<-c(-2, -1, 1, 2)
df2<-df
df2[,4:11]<-0
colnames(df2)<-c("id", "year", "event", "evm2", "evm1", "evp1", "evp2", "ndm2", "ndm1", "ndp1", "ndp2")
for (i in length(df2$id)) {
id<-df2[i,1]
yr<-df2[i,2]
sta<-3
sta2<-7
for (j in lag){
sta<-sta+1
sta2<-sta2+1
if !is.null(df2[df2$id==id & df2$year==yr+j])==TRUE {
rw<-which(df2[df2$id==id & df2$year==yr+j])
if (df2[rw,3]==1) df2[i, sta]==1
} else {
df2[i, sta2]==1
}
}
}
Do you see anything that may be responsible for the errors? I have been going mad for two days trying to make it work and I would be really thankful if you could help.
你是否看到了可能导致错误的原因?我已经疯狂了两天,想让它发挥作用,如果你能帮忙,我真的很感激。
1 个解决方案
#1
3
Following my comment, here is what I had in mind as a potential rewrite:
根据我的评论,以下是我想要重写的内容:
lag.it <- function(x, n = 0L) {
l <- length(x)
neg.lag <- min(max(0L, -n), l)
pos.lag <- min(max(0L, +n), l)
c(rep(NA, +neg.lag),
head(x, -neg.lag),
tail(x, -pos.lag),
rep(NA, +pos.lag))
}
library(plyr)
ddply(df, "id", transform,
evm2 = lag.it(event, -2),
evm1 = lag.it(event, -1),
evp1 = lag.it(event, +1),
evp2 = lag.it(event, +2))
# id year event evm2 evm1 evp1 evp2
# 1 2 2005 1 NA NA 0 0
# 2 2 2006 0 NA 1 0 NA
# 3 2 2007 0 1 0 NA NA
# 4 4 2005 0 NA NA 1 NA
# 5 4 2006 1 NA 0 NA NA
Notice how I use NA
s instead of using two sets of variables. While I'd recommend you keep it this way, you can easily get what you asked for by defining e.g. ndm2
as is.na(evm2)
then replace the NA
s by zeroes.
注意我如何使用NAs而不是使用两组变量。虽然我建议您保持这种方式,但是您可以通过定义ndm2为is.na(evm2),然后用0替换NAs,很容易得到您所要的结果。
#1
3
Following my comment, here is what I had in mind as a potential rewrite:
根据我的评论,以下是我想要重写的内容:
lag.it <- function(x, n = 0L) {
l <- length(x)
neg.lag <- min(max(0L, -n), l)
pos.lag <- min(max(0L, +n), l)
c(rep(NA, +neg.lag),
head(x, -neg.lag),
tail(x, -pos.lag),
rep(NA, +pos.lag))
}
library(plyr)
ddply(df, "id", transform,
evm2 = lag.it(event, -2),
evm1 = lag.it(event, -1),
evp1 = lag.it(event, +1),
evp2 = lag.it(event, +2))
# id year event evm2 evm1 evp1 evp2
# 1 2 2005 1 NA NA 0 0
# 2 2 2006 0 NA 1 0 NA
# 3 2 2007 0 1 0 NA NA
# 4 4 2005 0 NA NA 1 NA
# 5 4 2006 1 NA 0 NA NA
Notice how I use NA
s instead of using two sets of variables. While I'd recommend you keep it this way, you can easily get what you asked for by defining e.g. ndm2
as is.na(evm2)
then replace the NA
s by zeroes.
注意我如何使用NAs而不是使用两组变量。虽然我建议您保持这种方式,但是您可以通过定义ndm2为is.na(evm2),然后用0替换NAs,很容易得到您所要的结果。