I am trying to extract interesting statistics for an irregular time series data set, but coming up short on finding the right tools for the job. The tools for manipulating regularly sampled time series or index-based series of any time are pretty easily found, though I'm not having much luck with the problems I'm trying to solve.
我正在尝试为一个不规则的时间序列数据集提取有趣的统计数据,但是在找到适合这项工作的工具方面还不够。用于操作定期采样的时间序列或基于索引的任何时间序列的工具都很容易找到,尽管我对我试图解决的问题没有多少把握。
First, a reproducible data set:
第一,可再现的数据集:
library(zoo)
set.seed(0)
nSamples <- 5000
vecDT <- rexp(nSamples, 3)
vecTimes <- cumsum(c(0,vecDT))
vecDrift <- c(0, rnorm(nSamples, mean = 1/nSamples, sd = 0.01))
vecVals <- cumsum(vecDrift)
vecZ <- zoo(vecVals, order.by = vecTimes)
rm(vecDT, vecDrift)
Assume the times are in seconds. There are almost 1700 seconds (just shy of 30 minutes) in the vecZ
series, and 5001 entries during that time. (NB: I'd try using xts
, but xts
seems to need date information, and I'd rather not use a particular date when it's not relevant.)
假设时间以秒为单位。在vecZ系列中大约有1700秒(大约30分钟),在此期间有5001个条目。(NB:我会尝试使用xts,但xts似乎需要日期信息,而且我宁愿不使用不相关的特定日期。)
My goals are the following:
我的目标如下:
-
Identify the indices of the values 3 minutes before and 3 minutes after each point. As the times are continuous, I doubt that any two points are precisely 3 minutes apart. What I'd like to find are the points that are at most 3 minutes prior, and at least 3 minutes after, the given point, i.e. something like the following (in pseudocode):
每点前3分钟和后3分钟确定值的指标。由于时间是连续的,我怀疑任何两点之间是否有三分钟的距离。我想要找到的是在给定的点之前最多3分钟,之后至少3分钟的点,也就是以下点(伪代码):
backIX(t, vecZ, tDelta) = min{ix in length(vecZ) : t - time(ix) < tDelta}
forwardIX(t, vecZ, tDelta) = min{ix in length(vecZ) : time(ix) - t > tDelta}
(t, vecZ, tDelta) = min{ix in length(vecZ): t - time(ix) < tDelta} (t, vecZ, tDelta) = min{ix in length(vecZ): time(ix) - t > tDelta}
So, for 3 minutes,
tDelta = 180
. Ift=2500
, then the result forforwardIX()
would be 3012 (i.e. time(vecZ)[2500] is 860.1462, and time(vecZ)[3012] is 1040.403, or just over 180 seconds later), and the output ofbackwardIX()
would be 2020 (corresponding to time 680.7162 seconds).3分钟,tDelta = 180。如果t=2500,那么forwardIX()的结果将是3012(即time(vecZ)[2500]为860.1462,time(vecZ)[3012]为1040.403,即略超过180秒),backwardIX()的输出将是2020(对应于time 680.7162)。
Ideally, I would like to use a function that does not require
t
, as that is going to requirelength(vecZ)
calls to the function, which ignores the fact that sliding windows of time can be calculated more efficiently.理想情况下,我希望使用一个不需要t的函数,因为它需要对函数进行长度(vecZ)调用,它忽略了可以更有效地计算时间的滑动窗口的事实。
-
Apply a function to all values in a rolling window of time. I've seen
rollapply
, which takes a fixed window size (i.e. fixed number of indices, but not a fixed window of time). I can solve this the naive way, with a loop (orforeach
;-)) that is calculated per indext
, but I wondered if there are some simple functions already implemented, e.g. a function to calculate the mean of all values in a given time frame. Since this can be done efficiently via simple summary statistics that slide over a window, it should be computationally cheaper than a function that accesses all of the data multiple times to calculate each statistic. Some fairly natural functions: mean, min, max, and median.对滚动时间窗口中的所有值应用函数。我曾见过rollapply,它具有固定的窗口大小(即固定的索引数量,但不是固定的时间窗口)。我可以用简单的方法来解决这个问题,用一个循环(或foreach;-)来计算每个索引t,但是我想知道是否已经实现了一些简单的函数,例如一个函数来计算给定时间范围内所有值的平均值。由于这可以通过滑过窗口的简单汇总统计信息有效地完成,因此它的计算成本应该比多次访问所有数据以计算每个统计信息的函数要低。一些相当自然的函数:平均值,最小值,最大值和中值。
Even if the window isn't varying by time, the ability to vary the window size would be adequate, and I can find that window size using the result of the question above. However, that still seems to require excess calculations, so being able to specify time-based intervals seems more efficient.
即使窗口不随时间变化,更改窗口大小的能力也足够了,我可以使用上面问题的结果找到窗口大小。然而,这似乎仍然需要额外的计算,因此能够指定基于时间的间隔似乎更有效。
Are there packages in R that facilitate such manipulations of data in time-windows, or am I out of luck and I should write my own functions?
在R中是否有一些包可以在时间窗中方便地操作数据,还是我运气不好,我应该编写自己的函数?
Note 1: This question seeks to do something similar, except over disjoint intervals, rather than rolling windows of time, e.g. I could adapt this to do my analysis on every successive 3 minute block, but I don't see a way to adapt this for rolling 3 minute intervals.
注1:这个问题试图做一些类似的事情,除了不间断的间隔,而不是滚动的时间窗口,例如,我可以使它在每一个连续3分钟的块上做分析,但是我没有看到一种方法来适应这个3分钟的间隔。
Note 2: I've found that switching from a zoo
object to a numeric vector (for the times) has significantly sped up the issue of range-finding / window endpoint identification for the first goal. That's still a naive algorithm, but it's worth mentioning that working with zoo
objects may not be optimal for the naive approach.
注意2:我发现从zoo对象切换到数字向量(for the times)大大加快了第一个目标的距离查找/窗口端点标识的问题。这仍然是一种幼稚的算法,但值得一提的是,使用动物园的对象可能不是最理想的方法。
1 个解决方案
#1
1
Here's what I was suggeting, but I'm not sure it exactly answers your question
这是我的建议,但我不确定它是否能确切地回答你的问题
#Picking up where your code left off
library(xts)
library(TTR)
x <- .xts(vecZ, vecTimes)
xx <- na.locf(cbind(xts(, seq.POSIXt(from=start(x), to=end(x), by='sec')), x))
x$means <- runMean(xx, n=180)
out <- x[!is.na(x[, 1]), ]
tail(out)
x means
1969-12-31 18:28:17.376141 0.2053531 0.1325938
1969-12-31 18:28:17.379140 0.2101565 0.1329065
1969-12-31 18:28:17.619840 0.2139770 0.1332403
1969-12-31 18:28:17.762765 0.2072574 0.1335843
1969-12-31 18:28:17.866473 0.2065790 0.1339608
1969-12-31 18:28:17.924270 0.2114755 0.1344264
#1
1
Here's what I was suggeting, but I'm not sure it exactly answers your question
这是我的建议,但我不确定它是否能确切地回答你的问题
#Picking up where your code left off
library(xts)
library(TTR)
x <- .xts(vecZ, vecTimes)
xx <- na.locf(cbind(xts(, seq.POSIXt(from=start(x), to=end(x), by='sec')), x))
x$means <- runMean(xx, n=180)
out <- x[!is.na(x[, 1]), ]
tail(out)
x means
1969-12-31 18:28:17.376141 0.2053531 0.1325938
1969-12-31 18:28:17.379140 0.2101565 0.1329065
1969-12-31 18:28:17.619840 0.2139770 0.1332403
1969-12-31 18:28:17.762765 0.2072574 0.1335843
1969-12-31 18:28:17.866473 0.2065790 0.1339608
1969-12-31 18:28:17.924270 0.2114755 0.1344264