I am trying to return the sum of values of one data frame between two dates in another data frame. The answers provided in Stack don't seem to work for my application. I have tried using data.table but to no avail, so here goes.
我试图在另一个数据框中的两个日期之间返回一个数据帧的值之和。 Stack中提供的答案似乎不适用于我的应用程序。我试过使用data.table但无济于事,所以这里。
Create the Date Ranges
MeanRemaining <- seq(as.Date("2017-01-01"),as.Date("2017-02-28"),2)
MeanRemaining<-as.data.frame(cbind(MeanRemaining,lag(MeanRemaining)))
colnames(MeanRemaining)<-c("InspDate", "PrevInspDate")
MeanRemaining$InspDate<-as.Date(MeanRemaining$InspDate, origin = "1970/01/01")
MeanRemaining$PrevInspDate<-as.Date(MeanRemaining$PrevInspDate, origin = "1970/01/01")
Important to note that the date ranges are not actually fixed like they are above and could be any range of possibilities up to about a week apart.
重要的是要注意,日期范围实际上并不像它们在上面那样固定,并且可以是相隔大约一周的任何可能范围。
Create the values to sum
DailyTonnes <- as.data.frame(cbind(as.data.frame(seq(as.Date
+ ("2016-12-01"),as.Date("2017-03-28"),1)),(replicate(1,sample(abs(rnorm(118))*1000,rep=TRUE)))))
colnames(DailyTonnes)<-c("date","Vol")
The Aim
I want to sum 'Vol' from 'DailyTonnes' between each of the date ranges in 'MeanRemaining' and return the total 'Vol' to the corresponding row in 'MeanRemaining'.
我想在'MeanRemaining'中的每个日期范围之间将'Daily'与'DailyTonnes'相加并将总'Vol'返回到'MeanRemaining'中的相应行。
With some assistance of similar questions I've tried
在我试过的类似问题的帮助下
library(data.table)
setDT(MeanRemaining)
setDT(DailyTonnes)
MeanRemaining[DailyTonnes[MeanRemaining, sum(Vol), on = .(date >= InspDate, date <= PrevInspDate),
by = .EACHI], TotalVol := V1, on = .(InspDate=date)]
However this returns NA values.
但是,这会返回NA值。
Any advice would be much appreciated.
任何建议将不胜感激。
1 个解决方案
#1
0
I believe your question had all the pieces you needed for the answer.
我相信你的问题包含了答案所需的所有部分。
I polished your code a bit and change the last line (which was the only wrong one). The join in this last line was overly complicated and I don't think it could bring any memory/performance gain.
我稍稍修改了你的代码并改变了最后一行(这是唯一错误的一行)。最后一行中的连接过于复杂,我认为它不会带来任何内存/性能提升。
library(data.table)
# Create MeanRemaining
MeanRemaining <-
data.table(InspDate = seq(as.Date("2017-01-01"), as.Date("2017-02-28"), 2))
# I changed lag by shift, I think it is clearer this way
MeanRemaining[, PrevInspDate := shift(InspDate, type = "lead", fill = 1000000L)]
# set seed for repetibility
set.seed(13)
# Create DailyTonnes, I changed the end date to generate empty intervals
DailyTonnes <- data.table(date = seq(as.Date("2016-12-01"),
as.Date("2017-01-28"), 1),
Vol = sample(abs(rnorm(118)) * 1000, rep = TRUE))
# I changed the <= condition to <, I think it fits PrevInspDate better
# This should be your final result if I'm not wrong
SingleCase <-
DailyTonnes[MeanRemaining, sum(Vol), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]
# SingleCase has two variables called date, this may be a small bug in data.table
print(names(SingleCase))
# change the names of the data.table to suit your needs
names(SingleCase) <- c("InspDate", "PrevInspDate", "TotalVol")
Edit: Recover multiple variables from table MeanRemaining
The case in which you retrieve multiple variables from MeanRemaining is quite tricky. It is easily solvable for a small amount of variables:
从MeanRemaining中检索多个变量的情况非常棘手。对于少量变量,它很容易解决:
# Add variables to MeanRemaining
for (i in 1:100) {
MeanRemaining[, paste0("extracol", i) := sample(.N)]
}
# Two variable case
smallmultiple <-
DailyTonnes[MeanRemaining, list(TotalVol = sum(Vol),
extracol1 = i.extracol1 ,
extracol2 = i.extracol2), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]
# Correct date names
names(smallmultiple)[1:2] <- c("InspDate", "PrevInspDate")
When it comes to a lot of variable it becomes hard. There is this feature request in github that would solve yout problem but it is not available at the moment. This question faces a similar issue but cannot be used in your case.
当谈到很多变量时,变得很难。 github中有此功能请求可以解决您的问题,但目前无法使用。此问题面临类似问题,但不能在您的情况下使用。
The way around for a big amount of variables is:
大量变量的方法是:
# obtain names of variables to be kept in the later join
joinkeepcols <-
setdiff(names(MeanRemaining), c("InspDate", "PrevInspDate"))
# the "i" indicates the table to take the variables from
joinkeepcols2 <- paste0("i.", joinkeepcols)
# Prepare a expression for the data.table environment
keepcols <-
paste(paste(joinkeepcols, joinkeepcols2, sep = " = "), collapse = ", ")
# Complete expression to be evaluated in data.table
evalexpression <- paste0("list(
TotalVol = sum(Vol),",
keepcols, ")")
# The magic comes here (you can assign it to MeanRemaining)
bigmultiple <-
DailyTonnes[MeanRemaining, eval(parse(text = evalexpression)), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]
# Correct date names
names(bigmultiple)[1:2] <- c("InspDate", "PrevInspDate")
#1
0
I believe your question had all the pieces you needed for the answer.
我相信你的问题包含了答案所需的所有部分。
I polished your code a bit and change the last line (which was the only wrong one). The join in this last line was overly complicated and I don't think it could bring any memory/performance gain.
我稍稍修改了你的代码并改变了最后一行(这是唯一错误的一行)。最后一行中的连接过于复杂,我认为它不会带来任何内存/性能提升。
library(data.table)
# Create MeanRemaining
MeanRemaining <-
data.table(InspDate = seq(as.Date("2017-01-01"), as.Date("2017-02-28"), 2))
# I changed lag by shift, I think it is clearer this way
MeanRemaining[, PrevInspDate := shift(InspDate, type = "lead", fill = 1000000L)]
# set seed for repetibility
set.seed(13)
# Create DailyTonnes, I changed the end date to generate empty intervals
DailyTonnes <- data.table(date = seq(as.Date("2016-12-01"),
as.Date("2017-01-28"), 1),
Vol = sample(abs(rnorm(118)) * 1000, rep = TRUE))
# I changed the <= condition to <, I think it fits PrevInspDate better
# This should be your final result if I'm not wrong
SingleCase <-
DailyTonnes[MeanRemaining, sum(Vol), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]
# SingleCase has two variables called date, this may be a small bug in data.table
print(names(SingleCase))
# change the names of the data.table to suit your needs
names(SingleCase) <- c("InspDate", "PrevInspDate", "TotalVol")
Edit: Recover multiple variables from table MeanRemaining
The case in which you retrieve multiple variables from MeanRemaining is quite tricky. It is easily solvable for a small amount of variables:
从MeanRemaining中检索多个变量的情况非常棘手。对于少量变量,它很容易解决:
# Add variables to MeanRemaining
for (i in 1:100) {
MeanRemaining[, paste0("extracol", i) := sample(.N)]
}
# Two variable case
smallmultiple <-
DailyTonnes[MeanRemaining, list(TotalVol = sum(Vol),
extracol1 = i.extracol1 ,
extracol2 = i.extracol2), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]
# Correct date names
names(smallmultiple)[1:2] <- c("InspDate", "PrevInspDate")
When it comes to a lot of variable it becomes hard. There is this feature request in github that would solve yout problem but it is not available at the moment. This question faces a similar issue but cannot be used in your case.
当谈到很多变量时,变得很难。 github中有此功能请求可以解决您的问题,但目前无法使用。此问题面临类似问题,但不能在您的情况下使用。
The way around for a big amount of variables is:
大量变量的方法是:
# obtain names of variables to be kept in the later join
joinkeepcols <-
setdiff(names(MeanRemaining), c("InspDate", "PrevInspDate"))
# the "i" indicates the table to take the variables from
joinkeepcols2 <- paste0("i.", joinkeepcols)
# Prepare a expression for the data.table environment
keepcols <-
paste(paste(joinkeepcols, joinkeepcols2, sep = " = "), collapse = ", ")
# Complete expression to be evaluated in data.table
evalexpression <- paste0("list(
TotalVol = sum(Vol),",
keepcols, ")")
# The magic comes here (you can assign it to MeanRemaining)
bigmultiple <-
DailyTonnes[MeanRemaining, eval(parse(text = evalexpression)), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]
# Correct date names
names(bigmultiple)[1:2] <- c("InspDate", "PrevInspDate")