I have data (values) collected over several years by numerous groups. I would like to calculate the 12 month rolling median (using the 12 previous months) for each group. I've looked into the zoo (and other) roll functions and they seem to all be fixed interval (eg 12 months = 12 rows), but the date intervals in my data are not consistent for each group. Data values were collected every month or every other month, but there are additional gaps as well. I guess I need a rolling median function that gathers values for the previous 364 days for each group. I'd appreciate some help!
我的数据(价值)被许多团体收集了好几年。我想计算每个组12个月的滚动中位数(使用前12个月)。我研究过动物园(和其他)滚动函数,它们似乎都是固定的间隔(例如12个月= 12行),但是我的数据中的日期间隔对于每个组来说并不一致。每个月或每隔一个月收集数据值,但也有额外的差距。我想我需要一个滚动中值函数来为每组收集前364天的值。我欣赏一些帮助!
Below is an example of my data:
下面是我的数据示例:
Date Group Value
8/17/2013 A 5
10/2/2013 A 13
1/15/2014 A 11
3/15/2014 A 2
5/22/2014 A 7
7/15/2014 A 1
9/3/2014 A 1
11/15/2014 A 7
7/22/2013 B 13
8/5/2013 B 13
9/7/2013 B 12
10/16/2013 B 6
11/17/2013 B 5
12/9/2013 B 15
1/30/2014 B 1
2/23/2014 B 10
3/24/2014 B 15
4/5/2014 B 3
5/26/2014 B 3
6/16/2014 B 4
8/5/2014 B 6
9/26/2014 B 8
10/16/2014 B 15
11/29/2014 B 12
12/13/2016 B 1
I want to add a column to this table "Rolling Median" that contains the rolling 12 month (or 365 day) median for each group.
我想在这个“滚动中位数”表中添加一列,其中包含每个组12个月(或365天)的中位数。
2 个解决方案
#1
2
This package might be helpful to you:
这个包裹可能对你有帮助:
https://github.com/mgahan/boRingTrees
https://github.com/mgahan/boRingTrees
It is called boRingTrees
and it handles this type of problem. If you don't want to download the package, you can use the following code as well:
它被称为boRingTrees,它处理这类问题。如果您不想下载该软件包,也可以使用以下代码:
##Utilize the data.table package
library(data.table)
setDT(data)
data[, Date2 := as.Date(Date,format="%m/%d/%Y")] #Format date field
#Apply rollingByCalcs function (full function code is below)
data[, Roll_Median := rollingByCalcs(data,bylist=c("Group"),dates="Date2",target="Value",
lower=0,upper=365,incbounds=T,stat=median,na.rm=T,cores=1)]
rollingByCalcs <- function(data,bylist=NULL,dates,target=NULL,
lower,upper,incbounds=T,stat=length,na.rm=T,cores=1){
tic <- Sys.time()
require("data.table")
require("parallel")
data <- data.table(data)
if (is.null(bylist)){
data[, id.filler := 1]
bylist <- "id.filler"
}
if (is.null(target)){
data[,target:=1]
target <- "target"
}
##Create group by variable
data[,Grp.Var:=.GRP,by=bylist]
##Assign variable names
data[,target:=data[,eval(parse(text=target))]]
data[,dates:=data[,eval(parse(text=dates))]]
##Create "list" of comparison dates
Ref <- data[,list(Compare_Value=list(I(target)),Compare_Date=list(I(dates))), by=c("Grp.Var")]
##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mcmapply(FUN = function(RD, NUM) {
d <- as.numeric(RD-Ref$Compare_Date[[NUM]])
true.vals <- between(x=d,lower=lower,upper=upper,incbounds=incbounds)
out <- stat(Ref$Compare_Value[[NUM]][true.vals])
return(out)
}, RD = data$dates,NUM=data$Grp.Var,mc.cores=cores)
print(Sys.time()-tic)
return(data$Roll.Val)
}
#2
0
You could write a helper function. Here's one using the dplyr package:
你可以写一个辅助函数。这里有一个使用dplyr包的:
library(dplyr)
rollingMedian <- function(targetDate, targetGroup) {
dat %>%
mutate(thisDiff = difftime(as.Date(Date), targetDate, unit = "days")) %>%
filter(thisDiff < 0, thisDiff > -366, Group == targetGroup) %>%
summarise(medValue = median(Value))
}
dat$rollingMed <- mapply(rollingMedian, dat$Date, dat$Group)
Result:
结果:
dat
Date Group Value rollingMed
1 2013-08-17 A 5 NA
2 2013-10-02 A 13 5
3 2014-01-15 A 11 9
4 2014-03-15 A 2 11
5 2014-05-22 A 7 8
6 2014-07-15 A 1 7
7 2014-09-03 A 1 7
8 2014-11-15 A 7 2
9 2013-07-22 B 13 NA
10 2013-08-05 B 13 13
...
Data used:
数据使用:
dat <- structure(list(Date = structure(c(1376697600, 1380672000, 1389744000,
1394841600, 1400716800, 1405382400, 1409702400, 1416009600, 1374451200,
1375660800, 1378512000, 1381881600, 1384646400, 1386547200, 1391040000,
1393113600, 1395619200, 1396656000, 1401062400, 1402876800, 1407196800,
1411689600, 1413417600, 1417219200, 1481587200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Group = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), Value = c(5L, 13L, 11L, 2L, 7L, 1L, 1L,
7L, 13L, 13L, 12L, 6L, 5L, 15L, 1L, 10L, 15L, 3L, 3L, 4L, 6L,
8L, 15L, 12L, 1L)), .Names = c("Date", "Group", "Value"), row.names = c(NA,
-25L), class = "data.frame")
#1
2
This package might be helpful to you:
这个包裹可能对你有帮助:
https://github.com/mgahan/boRingTrees
https://github.com/mgahan/boRingTrees
It is called boRingTrees
and it handles this type of problem. If you don't want to download the package, you can use the following code as well:
它被称为boRingTrees,它处理这类问题。如果您不想下载该软件包,也可以使用以下代码:
##Utilize the data.table package
library(data.table)
setDT(data)
data[, Date2 := as.Date(Date,format="%m/%d/%Y")] #Format date field
#Apply rollingByCalcs function (full function code is below)
data[, Roll_Median := rollingByCalcs(data,bylist=c("Group"),dates="Date2",target="Value",
lower=0,upper=365,incbounds=T,stat=median,na.rm=T,cores=1)]
rollingByCalcs <- function(data,bylist=NULL,dates,target=NULL,
lower,upper,incbounds=T,stat=length,na.rm=T,cores=1){
tic <- Sys.time()
require("data.table")
require("parallel")
data <- data.table(data)
if (is.null(bylist)){
data[, id.filler := 1]
bylist <- "id.filler"
}
if (is.null(target)){
data[,target:=1]
target <- "target"
}
##Create group by variable
data[,Grp.Var:=.GRP,by=bylist]
##Assign variable names
data[,target:=data[,eval(parse(text=target))]]
data[,dates:=data[,eval(parse(text=dates))]]
##Create "list" of comparison dates
Ref <- data[,list(Compare_Value=list(I(target)),Compare_Date=list(I(dates))), by=c("Grp.Var")]
##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mcmapply(FUN = function(RD, NUM) {
d <- as.numeric(RD-Ref$Compare_Date[[NUM]])
true.vals <- between(x=d,lower=lower,upper=upper,incbounds=incbounds)
out <- stat(Ref$Compare_Value[[NUM]][true.vals])
return(out)
}, RD = data$dates,NUM=data$Grp.Var,mc.cores=cores)
print(Sys.time()-tic)
return(data$Roll.Val)
}
#2
0
You could write a helper function. Here's one using the dplyr package:
你可以写一个辅助函数。这里有一个使用dplyr包的:
library(dplyr)
rollingMedian <- function(targetDate, targetGroup) {
dat %>%
mutate(thisDiff = difftime(as.Date(Date), targetDate, unit = "days")) %>%
filter(thisDiff < 0, thisDiff > -366, Group == targetGroup) %>%
summarise(medValue = median(Value))
}
dat$rollingMed <- mapply(rollingMedian, dat$Date, dat$Group)
Result:
结果:
dat
Date Group Value rollingMed
1 2013-08-17 A 5 NA
2 2013-10-02 A 13 5
3 2014-01-15 A 11 9
4 2014-03-15 A 2 11
5 2014-05-22 A 7 8
6 2014-07-15 A 1 7
7 2014-09-03 A 1 7
8 2014-11-15 A 7 2
9 2013-07-22 B 13 NA
10 2013-08-05 B 13 13
...
Data used:
数据使用:
dat <- structure(list(Date = structure(c(1376697600, 1380672000, 1389744000,
1394841600, 1400716800, 1405382400, 1409702400, 1416009600, 1374451200,
1375660800, 1378512000, 1381881600, 1384646400, 1386547200, 1391040000,
1393113600, 1395619200, 1396656000, 1401062400, 1402876800, 1407196800,
1411689600, 1413417600, 1417219200, 1481587200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Group = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), Value = c(5L, 13L, 11L, 2L, 7L, 1L, 1L,
7L, 13L, 13L, 12L, 6L, 5L, 15L, 1L, 10L, 15L, 3L, 3L, 4L, 6L,
8L, 15L, 12L, 1L)), .Names = c("Date", "Group", "Value"), row.names = c(NA,
-25L), class = "data.frame")