I have data that looks like this:
我的数据看起来像这样:
library(plyr)
dates<-data.frame(datecol=as.POSIXct(c(
"2010-04-03 03:02:38 UTC",
"2010-04-03 03:03:14 UTC",
"2010-04-20 03:05:52 UTC",
"2010-04-20 03:07:42 UTC",
"2010-04-21 03:09:38 UTC",
"2010-04-21 03:10:14 UTC",
"2010-04-21 03:12:52 UTC",
"2010-04-23 03:13:42 UTC",
"2010-04-23 03:15:42 UTC",
"2010-04-23 03:16:38 UTC",
"2010-04-23 03:18:14 UTC",
"2010-04-24 03:21:52 UTC",
"2010-04-24 03:22:42 UTC",
"2010-04-24 03:24:19 UTC",
"2010-04-24 03:25:19 UTC"
)), x = cumsum(runif(15)*10),y=cumsum(runif(15)*20))
I want to group my data into 5 day intervals, so all the points that are 5 days or less apart are put into one group. I tried what was suggested here:
我想将我的数据分组为5天,因此所有5天或更短时间的点都放在一个组中。我尝试了这里建议的内容:
gr<-ddply(dates,.(cut(datecol,"5 day",include.lowest = TRUE)),"[")
But for some reason I end up having 3 groups instead of two, and the points from 04/21 and 04/23 fall into separate groups even though they are less than 5 days apart.
但由于某种原因,我最终得到3组而不是2组,而04/21和04/23的分数分成不同的组,即使它们相隔不到5天。
This is what I'd like to get:
这是我想得到的:
group datecol x y
1 1 2010-04-03 03:02:38 8.112423 4.790036
2 1 2010-04-03 03:03:14 11.184709 22.903475
3 2 2010-04-20 03:05:52 17.306835 32.286891
4 2 2010-04-20 03:07:42 24.071488 38.941709
5 2 2010-04-21 03:09:38 26.451493 48.378477
6 2 2010-04-21 03:10:14 33.090645 53.148149
7 2 2010-04-21 03:12:52 38.536416 64.346574
8 2 2010-04-23 03:13:42 40.911074 79.419002
9 2 2010-04-23 03:15:42 41.977579 89.760210
10 2 2010-04-23 03:16:38 46.838773 95.266709
11 2 2010-04-23 03:18:14 48.367159 112.619969
12 2 2010-04-24 03:01:52 57.470412 113.594423
13 2 2010-04-24 03:02:42 63.202005 123.653370
14 2 2010-04-24 03:04:19 65.615348 137.184153
15 2 2010-04-24 03:25:19 75.177633 137.559003
2 个解决方案
#1
5
How about a cumsum
that checks the lagged values and updates if necessary? We use the shift()
function from the data.table
library for the lags.
如果需要,检查滞后值并更新的cumsum怎么样?我们使用data.table库中的shift()函数来实现滞后。
library(data.table)
dates$group <- cumsum(ifelse(difftime(dates$datecol,
shift(dates$datecol, fill = dates$datecol[1]),
units = "days") >= 5
,1, 0)) + 1
head(dates)
# datecol x y group
#1 2010-04-03 03:02:38 4.776196 5.160336 1
#2 2010-04-03 03:03:14 13.388291 14.731241 1
#3 2010-04-20 03:05:52 17.769262 30.057454 2
#4 2010-04-20 03:07:42 20.217235 31.742392 2
#5 2010-04-21 03:09:38 20.924025 49.248819 2
#6 2010-04-21 03:10:14 21.918687 56.030278 2
This assumes your data is sorted by time from smallest to largest
这假设您的数据按时间从最小到最大排序
#2
1
You can set the breaks manually so that they are referenced to whatever baseline date you wish. For example:
您可以手动设置中断,以便它们可以参考您希望的任何基线日期。例如:
library(lubridate)
start.date = ymd_hms("2010-04-20 00:00:00")
breaks = seq(start.date - 30*3600*24, start.date + 30*3600*24, "5 days")
dates$group5 = cut(dates$datecol, breaks=breaks)
datecol x y group5 1 2010-04-03 03:02:38 7.265758 10.80777 2010-03-31 2 2010-04-03 03:03:14 15.632081 13.57187 2010-03-31 3 2010-04-20 03:05:52 19.219491 19.76293 2010-04-20 4 2010-04-20 03:07:42 20.605199 37.22687 2010-04-20 5 2010-04-21 03:09:38 26.533445 53.90345 2010-04-20 6 2010-04-21 03:10:14 33.449645 56.27885 2010-04-20 7 2010-04-21 03:12:52 39.050517 71.74788 2010-04-20 8 2010-04-23 03:13:42 39.499227 76.92669 2010-04-20 9 2010-04-23 03:15:42 44.827766 79.49207 2010-04-20 10 2010-04-23 03:16:38 54.206473 89.60895 2010-04-20 11 2010-04-23 03:18:14 54.982695 94.37855 2010-04-20 12 2010-04-24 03:21:52 64.414931 104.24174 2010-04-20 13 2010-04-24 03:22:42 64.659980 113.77616 2010-04-20 14 2010-04-24 03:24:19 67.343105 128.06813 2010-04-20 15 2010-04-24 03:25:19 71.060741 138.43512 2010-04-20
#1
5
How about a cumsum
that checks the lagged values and updates if necessary? We use the shift()
function from the data.table
library for the lags.
如果需要,检查滞后值并更新的cumsum怎么样?我们使用data.table库中的shift()函数来实现滞后。
library(data.table)
dates$group <- cumsum(ifelse(difftime(dates$datecol,
shift(dates$datecol, fill = dates$datecol[1]),
units = "days") >= 5
,1, 0)) + 1
head(dates)
# datecol x y group
#1 2010-04-03 03:02:38 4.776196 5.160336 1
#2 2010-04-03 03:03:14 13.388291 14.731241 1
#3 2010-04-20 03:05:52 17.769262 30.057454 2
#4 2010-04-20 03:07:42 20.217235 31.742392 2
#5 2010-04-21 03:09:38 20.924025 49.248819 2
#6 2010-04-21 03:10:14 21.918687 56.030278 2
This assumes your data is sorted by time from smallest to largest
这假设您的数据按时间从最小到最大排序
#2
1
You can set the breaks manually so that they are referenced to whatever baseline date you wish. For example:
您可以手动设置中断,以便它们可以参考您希望的任何基线日期。例如:
library(lubridate)
start.date = ymd_hms("2010-04-20 00:00:00")
breaks = seq(start.date - 30*3600*24, start.date + 30*3600*24, "5 days")
dates$group5 = cut(dates$datecol, breaks=breaks)
datecol x y group5 1 2010-04-03 03:02:38 7.265758 10.80777 2010-03-31 2 2010-04-03 03:03:14 15.632081 13.57187 2010-03-31 3 2010-04-20 03:05:52 19.219491 19.76293 2010-04-20 4 2010-04-20 03:07:42 20.605199 37.22687 2010-04-20 5 2010-04-21 03:09:38 26.533445 53.90345 2010-04-20 6 2010-04-21 03:10:14 33.449645 56.27885 2010-04-20 7 2010-04-21 03:12:52 39.050517 71.74788 2010-04-20 8 2010-04-23 03:13:42 39.499227 76.92669 2010-04-20 9 2010-04-23 03:15:42 44.827766 79.49207 2010-04-20 10 2010-04-23 03:16:38 54.206473 89.60895 2010-04-20 11 2010-04-23 03:18:14 54.982695 94.37855 2010-04-20 12 2010-04-24 03:21:52 64.414931 104.24174 2010-04-20 13 2010-04-24 03:22:42 64.659980 113.77616 2010-04-20 14 2010-04-24 03:24:19 67.343105 128.06813 2010-04-20 15 2010-04-24 03:25:19 71.060741 138.43512 2010-04-20