从R中的每分钟数据创建15分钟的时间间隔?

时间:2021-01-26 19:14:06

I have some data which is formatted in the following way:

我有一些数据按以下方式格式化:

time     count 
00:00    17
00:01    62
00:02    41

So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:

所以我从00:00到23:59,每分钟都有一个柜台。我想以15分钟的间隔对数据进行分组,以便:

time           count
00:00-00:15    148   
00:16-00:30    284

I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.

我试图手动完成,但这很累,所以我确信必须有一个功能或某事情,但我还没想出怎么做。

I'd really appreciate some help!!

我真的很感激一些帮助!!

Thank you very much!

非常感谢你!

2 个解决方案

#1


10  

For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.

对于POSIXct格式的数据,您可以使用cut函数创建15分钟分组,然后按这些分组进行汇总。下面的代码显示了如何在基本R和dplyr和data.table包中执行此操作。

First, create some fake data:

首先,创建一些假数据:

set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
                 count=sample(1:50, 100, replace=TRUE))

Base R

cut the data into 15 minute groups:

将数据分成15分钟组:

dat$by15 = cut(dat$time, breaks="15 min")
                   time count                by15
1   2016-05-01 00:00:00    22 2016-05-01 00:00:00
2   2016-05-01 00:01:00    11 2016-05-01 00:00:00
3   2016-05-01 00:02:00    31 2016-05-01 00:00:00
...
98  2016-05-01 01:37:00    20 2016-05-01 01:30:00
99  2016-05-01 01:38:00    29 2016-05-01 01:30:00
100 2016-05-01 01:39:00    37 2016-05-01 01:30:00

Now aggregate by the new grouping column, using sum as the aggregation function:

现在通过新的分组列聚合,使用sum作为聚合函数:

dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
                 by15 count
1 2016-05-01 00:00:00   312
2 2016-05-01 00:15:00   395
3 2016-05-01 00:30:00   341
4 2016-05-01 00:45:00   318
5 2016-05-01 01:00:00   349
6 2016-05-01 01:15:00   397
7 2016-05-01 01:30:00   341

dplyr

library(dplyr)

dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
  summarise(count=sum(count))

data.table

library(data.table)

dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]

UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.

更新:要回答注释,对于这种情况,每个分组间隔的终点是as.POSIXct(as.character(dat $ by15))+ 60 * 15 - 1.换句话说,分组间隔的终点是15从间隔开始的分钟减去一秒。我们添加60 * 15 - 1,因为POSIXct以秒为单位。 as.POSIXct(as.character(...))是因为cut返回一个因子,这只是将它转换回日期时间,以便我们可以对它进行数学运算。

If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.

如果你希望终点到下一个间隔之前的最近分钟(而不是最近的间隔),你可以as.POSIXct(as.character(dat $ by15))+ 60 * 14。

If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.

如果您不知道中断间隔,例如,因为您选择了中断的数量并让R选择间隔,您可以通过执行max来找到要添加的秒数(唯一(diff(as.POSIXct(as。)。 character(dat $ by15))))) - 1。

#2


0  

The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)

切割方法很方便,但数据帧较大。以下方法比切割方法快约1,000倍(使用400k记录进行测试。)

  #     Function: Truncate (floor) POSIXct to time interval (specified in seconds)
  #       Author: Stephen McDaniel @ PowerTrip Analytics
  #        Date : 2017MAY
  #    Copyright: (C) 2017 by Freakalytics, LLC
  #      License: MIT

  floor_datetime <- function(date_var, floor_seconds = 60, 
        origin = "1970-01-01") { # defaults to minute rounding
     if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
     if(is.na(date_var)) return(as.POSIXct(NA)) else {
        return(as.POSIXct(floor(as.numeric(date_var) / 
           (floor_seconds))*(floor_seconds), origin = origin))
     }
  }

Sample output:

test <- data.frame(good = as.POSIXct(Sys.time()), 
   bad1 = as.Date(Sys.time()),
   bad2 = as.POSIXct(NA))

test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) : 
  Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)

test

                        good       bad1 bad2             good_15 bad2_15
    1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00    <NA>

#1


10  

For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.

对于POSIXct格式的数据,您可以使用cut函数创建15分钟分组,然后按这些分组进行汇总。下面的代码显示了如何在基本R和dplyr和data.table包中执行此操作。

First, create some fake data:

首先,创建一些假数据:

set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
                 count=sample(1:50, 100, replace=TRUE))

Base R

cut the data into 15 minute groups:

将数据分成15分钟组:

dat$by15 = cut(dat$time, breaks="15 min")
                   time count                by15
1   2016-05-01 00:00:00    22 2016-05-01 00:00:00
2   2016-05-01 00:01:00    11 2016-05-01 00:00:00
3   2016-05-01 00:02:00    31 2016-05-01 00:00:00
...
98  2016-05-01 01:37:00    20 2016-05-01 01:30:00
99  2016-05-01 01:38:00    29 2016-05-01 01:30:00
100 2016-05-01 01:39:00    37 2016-05-01 01:30:00

Now aggregate by the new grouping column, using sum as the aggregation function:

现在通过新的分组列聚合,使用sum作为聚合函数:

dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
                 by15 count
1 2016-05-01 00:00:00   312
2 2016-05-01 00:15:00   395
3 2016-05-01 00:30:00   341
4 2016-05-01 00:45:00   318
5 2016-05-01 01:00:00   349
6 2016-05-01 01:15:00   397
7 2016-05-01 01:30:00   341

dplyr

library(dplyr)

dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
  summarise(count=sum(count))

data.table

library(data.table)

dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]

UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.

更新:要回答注释,对于这种情况,每个分组间隔的终点是as.POSIXct(as.character(dat $ by15))+ 60 * 15 - 1.换句话说,分组间隔的终点是15从间隔开始的分钟减去一秒。我们添加60 * 15 - 1,因为POSIXct以秒为单位。 as.POSIXct(as.character(...))是因为cut返回一个因子,这只是将它转换回日期时间,以便我们可以对它进行数学运算。

If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.

如果你希望终点到下一个间隔之前的最近分钟(而不是最近的间隔),你可以as.POSIXct(as.character(dat $ by15))+ 60 * 14。

If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.

如果您不知道中断间隔,例如,因为您选择了中断的数量并让R选择间隔,您可以通过执行max来找到要添加的秒数(唯一(diff(as.POSIXct(as。)。 character(dat $ by15))))) - 1。

#2


0  

The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)

切割方法很方便,但数据帧较大。以下方法比切割方法快约1,000倍(使用400k记录进行测试。)

  #     Function: Truncate (floor) POSIXct to time interval (specified in seconds)
  #       Author: Stephen McDaniel @ PowerTrip Analytics
  #        Date : 2017MAY
  #    Copyright: (C) 2017 by Freakalytics, LLC
  #      License: MIT

  floor_datetime <- function(date_var, floor_seconds = 60, 
        origin = "1970-01-01") { # defaults to minute rounding
     if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
     if(is.na(date_var)) return(as.POSIXct(NA)) else {
        return(as.POSIXct(floor(as.numeric(date_var) / 
           (floor_seconds))*(floor_seconds), origin = origin))
     }
  }

Sample output:

test <- data.frame(good = as.POSIXct(Sys.time()), 
   bad1 = as.Date(Sys.time()),
   bad2 = as.POSIXct(NA))

test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) : 
  Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)

test

                        good       bad1 bad2             good_15 bad2_15
    1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00    <NA>