定量时间序列数据，状态持续时间

I have a longitudinal dataset with a time variable and a qualitative variable. My subject can be in one of three states, sometimes the state changes, sometimes it stays the same.

我有一个带有时间变量和定性变量的纵向数据集。我的主题可以处于三种状态中的一种，有时状态发生变化，有时它保持不变。

What I would like to produce is a new dataframe which gives me, for every time a subject is in a state, at what time it first was in that state and how long the subject stayed in that same state. I want to do this because my end goal is to see whether state-switches occur more/less often for different treatments, length of states differ per state, length of states changes over time, etcetera.

我想要产生的是一个新的数据框，它给我每次一个主体处于一个状态，它在该状态的第一个时间以及主体保持在同一状态的时间。我想这样做是因为我的最终目标是查看状态切换是否针对不同的处理更多/更少地发生，状态长度因状态而异，状态长度随时间变化等等。

Example data:

示例数据：

set.seed(1)
Data=data.frame(time=1:100,State=sample(c('a','b','c'),100,replace=TRUE))

The first few lines of Data look like this

数据的前几行看起来像这样

       time       State
1      1          a
2      2          b
3      3          b
4      4          c
5      5          a
6      6          c
7      7          c

I would like to produce this:

我想产生这个：

       StartTime  State    Duration
1      1          a        1
2      2          b        2
3      4          c        1
4      5          a        1
5      6          c        2

I can probably achieve this with a while-loop but this seems highly inefficient, especially since my actual data is 700000 lines per subject. Is there a better way to do it? Maybe something with the diff-function and %in%. I can't figure it out.

我可以通过while循环实现这一点，但这看起来非常低效，特别是因为我的实际数据是每个主题700000行。有没有更好的方法呢？也许是diff-function和％in％的东西。我无法弄明白。

1 个解决方案

#1

set.seed(1)
Data=data.frame(time=1:100,State=sample(c('a','b','c'),100,replace=TRUE))

Use data.table with data of that size:

将data.table与该大小的数据一起使用：

library(data.table)
setDT(Data)
head(Data)
#   time State
#1:    1     a
#2:    2     b
#3:    3     b
#4:    4     c
#5:    5     a
#6:    6     c

Give each state run a number:

给每个州运行一个数字：

Data[, state_run := cumsum(c(TRUE, diff(as.integer(Data$State)) != 0L))]
#Note that this assumes that State is a factor variable

Find the values of interest for each state run:

找到每个州运行的感兴趣的值：

Data2 <- Data[, list(StartTime = min(time),
                     State = State[1],
                     Duration = diff(range(time)) + 1), by = state_run]
head(Data2)
#   state_run StartTime State Duration
#1:         1         1     a        1
#2:         2         2     b        2
#3:         3         4     c        1
#4:         4         5     a        1
#5:         5         6     c        2
#6:         6         8     b        2

#1

set.seed(1)
Data=data.frame(time=1:100,State=sample(c('a','b','c'),100,replace=TRUE))