用R中的另一列的日期范围求列的平均值

I have two data frames that look like this:

我有两个像这样的数据帧:

> head(y,n=4)
Source: local data frame [6 x 3]

  Start Date   End Date   Length

1 2006-06-08 2006-06-10        3
2 2006-06-12 2006-06-14        3
3 2006-06-18 2006-06-21        4
4 2006-06-24 2006-06-25        2

and

和

> head(x,n=19)
          Date   Group.Size
413 2006-06-07            6
414 2006-06-08            3
415 2006-06-09            1
416 2006-06-10            3
417 2006-06-11            15
418 2006-06-12            12
419 2006-06-13            NA
420 2006-06-14            4
421 2006-06-15            8
422 2006-06-16            3
423 2006-06-17            1
424 2006-06-18            3
425 2006-06-19            10
426 2006-06-20            2
427 2006-06-21            7
428 2006-06-22            6
429 2006-06-23            2
430 2006-06-24            1
431 2006-06-25            0

I'm looking for a way to add a new column in data frame y that will show the average Group.Size of data frame x (rounded to nearest integer), depending on the given Start Date and End Dates provided in y.

我正在寻找一种方法在数据框y中添加一个新的列，它将显示平均组。根据给定的开始日期和y中提供的结束日期，数据帧x的大小(四舍五入到最近的整数)。

For example, in the first row of y, I have 6/8/06 to 6/10/06. This is a length of 3 days, so I would want the new column to have the number 2, because the corresponding Group.Size values are 3, 1, and 3 for the respective days in data frame x (mean=2.33, rounded to nearest integer is 2).

例如，在y的第一行，我有6/8/06到6/10/06。这是3天的长度，所以我希望新的列有2号，因为对应的组。在数据帧x中，大小值分别为3、1和3(平均值=2.33，四舍五入到最接近的整数为2)。

If there is an NA in my dataframe x, I'd like to consider it a 0.

如果我的dataframe x中有一个NA，我想把它设为0。

There are multiple steps involved in this task, and there is probably a straightforward approach... I am relatively new to R, and am having a hard time breaking it down. Please let me know if I should clarify my example.

这个任务涉及多个步骤，可能有一个简单的方法……我对R比较陌生，很难把它分解。如果我要阐明我的例子，请告诉我。

5 个解决方案

#1

Assuming that x$Date, y$StartDate, and y$EndDate are of class Date (or, character), the following apply approach should be doing the trick:

假设x$Date、y$StartDate和y$EndDate是类日期(或字符)，下面的应用方法应该是这样的:

 y$AvGroupSize<- apply(y, 1, function(z) {
                 round(mean(x$Group.Size[which(x$Date >= z[1] & x$Date <=z[2])], na.rm=T),0)
    }
)

#2

#Replace missing values in x with 0
x[is.na(x)] <- 0

#Create new 'Group' variable and loop through x to create groups 
x$Group <-1
j <- 1
for(i in 1:nrow(x)){
  if(x[i,"Date"]==y[j,"StartDate"]){
    x[i,"Group"] <- j+1
    if(j<nrow(y)){
      j <- j+1
    } else{
      j <- j 
    }
  }else if(i>1){
    x[i,"Group"] <- x[i-1,"Group"]
  }else {
    x[i,"Group"] <- 1
  }
}

#Use tapply function to get the rounded mean of each Group
tapply(x$Group.Size, x$Group, function(z) round(mean(z)))

#3

Here is a different dplyr solution

这是一个不同的dplyr溶液

library(dplyr)

na2zero <- function(x) ifelse(is.na(x),0,x) # Convert NA to zero
ydf %>%
    group_by(Start_Date, End_Date) %>%
    mutate(avg = round(mean(na2zero(xdf$Group.Size[ between(xdf$Date, Start_Date, End_Date) ])), 0)) %>%
    ungroup

##   Start_Date   End_Date Length   avg
##       (time)     (time)  (int) (dbl)
## 1 2006-06-08 2006-06-10      3     2
## 2 2006-06-12 2006-06-14      3     5
## 3 2006-06-18 2006-06-21      4     6
## 4 2006-06-24 2006-06-25      2     0

#4

This is a solution that applies over the rows of the data frame y:

这是一种适用于数据帧y的方法:

library(dplyr)
get_mean_size <- function(start, end, length) {
   s <- sum(filter(x, Date >= start, Date <= end)$Group.Size, na.rm = TRUE)
   round(s/length)
}
y$Mean.Size = Map(get_mean_size, y$Start_Date, y$End_Date, y$Length)
y
##   Start_Date   End_Date Length Mean.Size
## 1 2006-06-08 2006-06-10      3         2
## 2 2006-06-12 2006-06-14      3         5
## 3 2006-06-18 2006-06-21      4         6
## 4 2006-06-24 2006-06-25      2         0

It uses two functions from the dplyr package: filter() and mutate().

它使用dplyr包中的两个函数:filter()和mutate()。

First I define the function get_mean_size that is supposed with the three values from a column in y: Start_Date, End_Date and length. It fist selects the relevant rows from x using filter and sums up the column Group.Size. Using na.rm = TRUE tells sum() to ignore NA values, which is the same as setting them to zero. Then the average is calculated by dividing by length and rounding. Note that round rounds half to even, thus 0.5 is rounded to 0, while 1.5 is rounded to 2.

首先，我定义函数get_mean_size，它应该包含来自y中的列的三个值:Start_Date、End_Date和length。它首先使用filter从x中选择相关的行，并对列Group.Size进行汇总。使用na。rm = TRUE会告诉sum()忽略NA值，这与将它们设置为0是一样的。然后通过除以长度和舍入来计算平均值。注意，圆轮的一半是偶数，因此0.5是四舍五入的0，而1.5是四舍五入的2。

This function is then applied to all rows of y using Map() and added as a new column to y.

然后使用Map()将该函数应用到所有y行，并将其添加为y的新列。

A final note regarding the dates in x and y. This solution assumes that the dates are stored as Date object. You can check this using, e. g.,

关于x和y中的日期的最后说明。这个解决方案假定日期被存储为Date对象。你可以使用，例如，

is(x$Date, "Date")

If they do not have class Date, you can convert them using

如果它们没有类日期，您可以使用

x$Date <- as.Date(x$Date)

(and simliarly for y$Start_Date and y$End_Date).

(并简单地为y$Start_Date和y$End_Date)。

#5

There are many ways but here is one. We can first create a list of date positions with lapply (SN: Be sure that the dates are in chronological order). Then we map the function round(mean(Group.Size)) to each of the values:

有很多方法，但这里有一个。我们可以先用lapply创建一个日期位置列表(SN:确保日期是按时间顺序排列的)。然后我们将函数圆(Group.Size)映射到每个值:

lst <- lapply(y[1:2], function(.x) match(.x, x[,"Date"]))
y$avg <- mapply(function(i,j) round(mean(x$Group.Size[i:j], na.rm=TRUE)), lst[[1]],lst[[2]])
y
#    StartDate    EndDate Length avg
# 1 2006-06-08 2006-06-10      3   2
# 2 2006-06-12 2006-06-14      3   8
# 3 2006-06-18 2006-06-21      4   6
# 4 2006-06-24 2006-06-25      2   0

#1