比'for'循环使用R更有效的方法

I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.

我是R的相对新人,所以如果对此有明显的答案,我很抱歉。我已经看过其他问题了,我认为'apply'就是答案,但在这种情况下我无法弄清楚如何使用它。

I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).

我有一个纵向调查,每年邀请参与者。有些年份他们没有参加,有时他们会死。从调查开始以来,我需要确定哪些参与者参与了一致的'连胜'(即如果他们停止,他们就会停下来)。

I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?

我用'for'循环完成了这个,在下面的例子中可以正常工作。但我有很多年和很多参与者,而且循环非常缓慢。我可以使用更快的方法吗?

In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).

在示例中,TRUE表示他们参加了那一年。这个循环创造了两个向量 - 他们参与的最后一年是'finalyear',并且'streak'表示他们是否在最后一年之前完成了所有年份(即案例1,3和5)。

dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
    x <- as.numeric(dat[i,2:6])
    y <- max(grep(1, x))
    finalyear[i] <- y
    streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak

Thanks!

4 个解决方案

#1

We could use max.col and rowSums as a vectorized approach.

我们可以使用max.col和rowSums作为矢量化方法。

dat$finalyear <- max.col(dat[-1], 'last')

If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.

如果有没有TRUE值的行,我们可以确保通过乘以rowSums的双重否定为该行返回0。 FALSE将被强制为0并且乘以0将返回该行的0。

dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])

Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'

然后,我们通过比较第2:6列的rowSums和'finalyear'的列来创建'streak'列。

dat$streak <-  rowSums(dat[,2:6])==dat$finalyear
dat
#   ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1   1  TRUE  TRUE  TRUE FALSE FALSE         3   TRUE
#2   2  TRUE FALSE  TRUE  TRUE  TRUE         5  FALSE
#3   3  TRUE  TRUE  TRUE  TRUE  TRUE         5   TRUE
#4   4 FALSE FALSE  TRUE  TRUE  TRUE         5  FALSE
#5   5  TRUE  TRUE  TRUE  TRUE FALSE         4   TRUE

Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by @ColonelBeauvel

或@ColonelBeauvel建议的单行代码(它可以适用于单行,但决定通过2行显而易见)

library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'), 
            streak=rowSums(dat[-1])==finalyear)

#2

For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:

For循环在R中本身并不坏,但如果你迭代地增长向量(就像你正在做的那样)它们会很慢。通常有更好的方法来做事。仅包含apply-functions的解决方案示例:

dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <-  apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})

Or option 2, based on comment by @Spacedman:

或者选项2,基于@Spacedman的评论:

dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <-  apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})

> dat
  ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1   1  TRUE  TRUE  TRUE FALSE FALSE         3   TRUE
2   2  TRUE FALSE  TRUE  TRUE  TRUE         5  FALSE
3   3  TRUE  TRUE  TRUE  TRUE  TRUE         5   TRUE
4   4 FALSE FALSE  TRUE  TRUE  TRUE         5  FALSE
5   5  TRUE  TRUE  TRUE  TRUE FALSE         4   TRUE

#3

Here is a solution with dplyr and tidyr.

这是dplyr和tidyr的解决方案。

gather(data = dat,year,value,-ids) %>%
  mutate(year=as.integer(gsub("X","",year))) %>%
  group_by(ids) %>%
  summarize(finalyear=last(year[value]),
            streak=!any(value[first(year):finalyear] == FALSE))

output

  ids finalyear streak
1   1      2001   TRUE
2   2      2003  FALSE
3   3      2003   TRUE
4   4      2003  FALSE
5   5      2002   TRUE

#4

Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:

这是一个基本版本,使用apply to loop over rows和rle来查看状态变化的频率。您的条件似乎等于从TRUE开始的状态,并且最多只更改为FALSE一次,因此我将rle测试为短于3且第一个值为TRUE:

> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
> 
> dat
  ids X1999 X2000 X2001 X2002 X2003 streak
1   1  TRUE  TRUE  TRUE FALSE FALSE   TRUE
2   2  TRUE FALSE  TRUE  TRUE  TRUE  FALSE
3   3  TRUE  TRUE  TRUE  TRUE  TRUE   TRUE
4   4 FALSE FALSE  TRUE  TRUE  TRUE  FALSE
5   5  TRUE  TRUE  TRUE  TRUE FALSE   TRUE

There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:

可能有很多方法可以解决finalyear,这只是找到每行的最后一个元素是TRUE:

> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
  ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1   1  TRUE  TRUE  TRUE FALSE FALSE   TRUE         3
2   2  TRUE FALSE  TRUE  TRUE  TRUE  FALSE         5
3   3  TRUE  TRUE  TRUE  TRUE  TRUE   TRUE         5
4   4 FALSE FALSE  TRUE  TRUE  TRUE  FALSE         5
5   5  TRUE  TRUE  TRUE  TRUE FALSE   TRUE         4

#1