I have a dataframe as below
我有一个数据框如下
tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
tmpdf
licensePlate starttime endtime
1 Y80901 2015-09-18 09:55 2015-09-18 17:55
2 Y80901 2015-09-18 23:00 2015-09-20 11:00
3 Y80901 2015-09-20 15:00 2015-09-21 12:00
4 AMG-999 2015-09-17 15:42 2015-09-18 13:00
5 AMG-999 2015-09-21 09:22 2015-09-21 14:22
6 W3188 2015-09-17 09:00 2015-09-21 12:00
7 W3188 2015-09-21 14:00 2015-09-21 16:00
I want to count the last n days (for example, last 5 days from 9/17 to 9/21) in time used per day of each licensePlate, My expected result is as below:
我想计算每个licensePlate每天使用的最后n天(例如,从9月17日到9月21日的最后5天),我的预期结果如下:
Period LicensePlate Used Time
1 2015-09-17 Y80901 0
2 2015-09-17 AMG-999 8.3
3 2015-09-17 W3188 15
4 2015-09-18 Y80901 9
5 2015-09-18 AMG-999 13
6 2015-09-18 W3188 24
7 2015-09-19 Y80901 24
8 2015-09-19 AMG-999 0
9 2015-09-19 W3188 24
10 2015-09-20 Y80901 20
11 2015-09-20 AMG-999 0
12 2015-09-20 W3188 24
13 2015-09-21 Y80901 12
14 2015-09-21 AMG-999 5
15 2015-09-21 W3188 14
I think dplyr/data.table and lubridate can be used to obtain my result, and I may need to measure time period in days, but I don't know how to cut within a start/end interval when start/end will differ per row.
我认为dplyr / data.table和lubridate可用于获取我的结果,我可能需要以天为单位测量时间段,但我不知道如何在开始/结束时间间隔内切换开始/结束时间行。
4 个解决方案
#1
3
Here's something to get you started. This is almost your desired output, as it doesn't show you the missing licensePlate
per period.
这是让你入门的东西。这几乎是您想要的输出,因为它没有显示每个期间丢失的licensePlate。
The first step would be to convert your date to valid POSIXct
class, then expand the data to a per minute level (probably the most costly part in this solution) and the aggregate by licensePlate
and by Period
while summing up the results (I'm not using as.Date
here because it handles badly POSIX
values with values between 00 and 1 am).
第一步是将您的日期转换为有效的POSIXct类,然后将数据扩展到每分钟级别(可能是此解决方案中成本最高的部分),并按照licensePlate和Period汇总,同时总结结果(我是这里没有使用as.Date,因为它处理的POSIX值非常糟糕,值在00和凌晨1点之间。
library(data.table)
setDT(tmpdf)[, `:=`(starttime = as.POSIXct(starttime), endtime = as.POSIXct(endtime))]
res <- tmpdf[, .(licensePlate, Period = seq(starttime, endtime, by = "1 min")), by = 1:nrow(tmpdf)]
res[, .(Used_Time = round(.N/60L, 1L)), keyby = .(Period = substr(Period, 1L, 10L), licensePlate)]
# Period licensePlate Used_Time
# 1: 2015-09-17 AMG-999 8.3
# 2: 2015-09-17 W3188 15.0
# 3: 2015-09-18 AMG-999 13.0
# 4: 2015-09-18 W3188 24.0
# 5: 2015-09-18 Y80901 9.0
# 6: 2015-09-19 W3188 24.0
# 7: 2015-09-19 Y80901 24.0
# 8: 2015-09-20 W3188 24.0
# 9: 2015-09-20 Y80901 20.0
# 10: 2015-09-21 AMG-999 5.0
# 11: 2015-09-21 W3188 14.0
# 12: 2015-09-21 Y80901 12.0
#2
1
Take a deep breath. Here is my solution
深吸一口气。这是我的解决方案
Initializing Data
tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
'converting to POSIXct for better date/time handling'
tmpdf$starttime <- as.POSIXct(tmpdf$starttime, tz = "GMT")
tmpdf$endtime <- as.POSIXct(tmpdf$endtime, tz = "GMT")
Data Preparation
To perform the required operations your complete usage data has to be converted to daily usage data. So I have written the following function to prepare the data into a required format.
要执行所需的操作,必须将完整的使用数据转换为每日使用数据。所以我编写了以下函数来将数据准备为所需的格式。
#splits single usage data into two
splitToTwo <- function(list){
newList <- NULL
for ( i in 1:nrow(list)){
tmp <- list[i,]
# set the end time of the first split as 23:59:59
list[i,]$endtime <- as.Date(list[i,]$starttime) + hours(23) + minutes(59) + seconds(59)
# set the start time of the second split as 00:00:01
tmp$starttime <- list[i,]$endtime + seconds(2)
# add the new df to the list
tmp <- rbind(tmp,list[i,])
newList <- rbind(newList,tmp)
}
return(newList)
}
#recursive function. Split the usage data into two till there are completely normalised to daily usage data
setDailyUsage <- function(tmpdf){
# create a exclusive subset where the usage spawns more than a day
list <- tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0, ]
# replace tmpdf with usage that started and ended the same day
tmpdf <- tmpdf[ as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) == 0,]
# call to our split function to split the dataset with usage spawning more than one day
split <- splitToTwo(list)
# add the now split data to our exclusive
tmpdf <- rbind(tmpdf,split)
if (nrow(tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0, ])>0){
tmpdf <- setDailyUsage(tmpdf)
}
return(tmpdf)
}
Prepared Data
Our prepared data
我们准备的数据
preparedData <- setDailyUsage(tmpdf)
licensePlate starttime endtime
1 Y80901 2015-09-18 09:55:00 2015-09-18 17:55:00
5 AMG-999 2015-09-21 09:22:00 2015-09-21 14:22:00
7 W3188 2015-09-21 14:00:00 2015-09-21 16:00:00
21 Y80901 2015-09-18 23:00:00 2015-09-18 23:59:59
3 Y80901 2015-09-21 00:00:01 2015-09-21 12:00:00
31 Y80901 2015-09-20 15:00:00 2015-09-20 23:59:59
4 AMG-999 2015-09-18 00:00:01 2015-09-18 13:00:00
41 AMG-999 2015-09-17 15:42:00 2015-09-17 23:59:59
61 W3188 2015-09-17 09:00:00 2015-09-17 23:59:59
2 Y80901 2015-09-20 00:00:01 2015-09-20 11:00:00
211 Y80901 2015-09-19 00:00:01 2015-09-19 23:59:59
611 W3188 2015-09-18 00:00:01 2015-09-18 23:59:59
612 W3188 2015-09-19 00:00:01 2015-09-19 23:59:59
6 W3188 2015-09-21 00:00:01 2015-09-21 12:00:00
613 W3188 2015-09-20 00:00:01 2015-09-20 23:59:59
Data Manipulation
Now we create a new DF which represents the data in the required format. This will initially have empty values in the UsedTime column.
现在我们创建一个新的DF,它代表所需格式的数据。这最初将在UsedTime列中具有空值。
preparedData$duration <- preparedData$endtime - preparedData$starttime
noOfUniquePlates <- length(unique(preparedData$licensePlate))
Period <- rep(seq(from=(min(as.Date(preparedData$starttime))),to=(max(as.Date(preparedData$starttime))), by="day"),noOfUniquePlates)
noOfUniqueDays <- length(unique(Period))
LicensePlate <- rep(unique(preparedData$licensePlate),each=noOfUniqueDays)
UsedTime <- 0
newDF <- data.frame(Period,LicensePlate,UsedTime)
Now a simply mapply function on each row of the newDF , searching for the right usage data in preparedData df.
现在,newDF的每一行都有一个简单的mapply函数,在preparedData df中搜索正确的用法数据。
findUsage <- function(p,l){
sum(preparedData[as.Date(preparedData$starttime) == p & as.Date(preparedData$endtime) == p & preparedData$licensePlate == l , ]$duration)
}
newDF$UsedTime <- mapply( findUsage, newDF$Period, newDF$LicensePlate)
newDF$UsedTime <- newDF$UsedTime/60
> newDF[with(newDF,order(Period)),]
Period LicensePlate UsedTime
1 2015-09-17 Y80901 0.000000
6 2015-09-17 AMG-999 8.299722
11 2015-09-17 W3188 14.999722
2 2015-09-18 Y80901 8.999722
7 2015-09-18 AMG-999 12.999722
12 2015-09-18 W3188 23.999444
3 2015-09-19 Y80901 23.999444
8 2015-09-19 AMG-999 0.000000
13 2015-09-19 W3188 23.999444
4 2015-09-20 Y80901 19.999444
9 2015-09-20 AMG-999 0.000000
14 2015-09-20 W3188 23.999444
5 2015-09-21 Y80901 11.999722
10 2015-09-21 AMG-999 5.000000
15 2015-09-21 W3188 13.999722
I had to limit explanations to keep the answer short. Let me know if you need any clarification in comments.
我不得不限制解释以保持答案简短。如果您在评论中需要任何澄清,请与我们联系。
#3
0
You are correct that plyr can be used to solve this problem. One possible solution:
你是对的,plyr可以用来解决这个问题。一种可能的方案:
tmpdf$starttime <- as.POSIXct(tmpdf$starttime) #convert date/time columns to date/time values in R
tmpdf$endtime <- as.POSIXct(tmpdf$endtime) #convert date/time columns to date/time values in R
newdf <- ddply(tmpdf,.(as.Date(starttime),licensePlate),function(df){
df$diffdays <- as.double(difftime(df$endtime,df$starttime,units='days'))
df
})
#If you want to only have the Period, LicensePlate, and Used Time columns remaining:
newdf <- subset(newdf,select=c(1,2,5))
colnames(newdf) <- c('Period','LicensePlate','UsedTime')
Hope it helps!
希望能帮助到你!
#4
0
try this- does it help
试试这个 - 它有帮助吗?
`tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
tmpdf
str(tmpdf)
library(lubridate)
tmpdf$starttime=ymd_hm(paste(tmpdf$starttime))
tmpdf$endtime=ymd_hm(paste(tmpdf$endtime))
tmpdf$Period=day(tmpdf$starttime)
tmpdf$diff=difftime(tmpdf$endtime,tmpdf$starttime)
tmpdf`
#1
3
Here's something to get you started. This is almost your desired output, as it doesn't show you the missing licensePlate
per period.
这是让你入门的东西。这几乎是您想要的输出,因为它没有显示每个期间丢失的licensePlate。
The first step would be to convert your date to valid POSIXct
class, then expand the data to a per minute level (probably the most costly part in this solution) and the aggregate by licensePlate
and by Period
while summing up the results (I'm not using as.Date
here because it handles badly POSIX
values with values between 00 and 1 am).
第一步是将您的日期转换为有效的POSIXct类,然后将数据扩展到每分钟级别(可能是此解决方案中成本最高的部分),并按照licensePlate和Period汇总,同时总结结果(我是这里没有使用as.Date,因为它处理的POSIX值非常糟糕,值在00和凌晨1点之间。
library(data.table)
setDT(tmpdf)[, `:=`(starttime = as.POSIXct(starttime), endtime = as.POSIXct(endtime))]
res <- tmpdf[, .(licensePlate, Period = seq(starttime, endtime, by = "1 min")), by = 1:nrow(tmpdf)]
res[, .(Used_Time = round(.N/60L, 1L)), keyby = .(Period = substr(Period, 1L, 10L), licensePlate)]
# Period licensePlate Used_Time
# 1: 2015-09-17 AMG-999 8.3
# 2: 2015-09-17 W3188 15.0
# 3: 2015-09-18 AMG-999 13.0
# 4: 2015-09-18 W3188 24.0
# 5: 2015-09-18 Y80901 9.0
# 6: 2015-09-19 W3188 24.0
# 7: 2015-09-19 Y80901 24.0
# 8: 2015-09-20 W3188 24.0
# 9: 2015-09-20 Y80901 20.0
# 10: 2015-09-21 AMG-999 5.0
# 11: 2015-09-21 W3188 14.0
# 12: 2015-09-21 Y80901 12.0
#2
1
Take a deep breath. Here is my solution
深吸一口气。这是我的解决方案
Initializing Data
tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
'converting to POSIXct for better date/time handling'
tmpdf$starttime <- as.POSIXct(tmpdf$starttime, tz = "GMT")
tmpdf$endtime <- as.POSIXct(tmpdf$endtime, tz = "GMT")
Data Preparation
To perform the required operations your complete usage data has to be converted to daily usage data. So I have written the following function to prepare the data into a required format.
要执行所需的操作,必须将完整的使用数据转换为每日使用数据。所以我编写了以下函数来将数据准备为所需的格式。
#splits single usage data into two
splitToTwo <- function(list){
newList <- NULL
for ( i in 1:nrow(list)){
tmp <- list[i,]
# set the end time of the first split as 23:59:59
list[i,]$endtime <- as.Date(list[i,]$starttime) + hours(23) + minutes(59) + seconds(59)
# set the start time of the second split as 00:00:01
tmp$starttime <- list[i,]$endtime + seconds(2)
# add the new df to the list
tmp <- rbind(tmp,list[i,])
newList <- rbind(newList,tmp)
}
return(newList)
}
#recursive function. Split the usage data into two till there are completely normalised to daily usage data
setDailyUsage <- function(tmpdf){
# create a exclusive subset where the usage spawns more than a day
list <- tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0, ]
# replace tmpdf with usage that started and ended the same day
tmpdf <- tmpdf[ as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) == 0,]
# call to our split function to split the dataset with usage spawning more than one day
split <- splitToTwo(list)
# add the now split data to our exclusive
tmpdf <- rbind(tmpdf,split)
if (nrow(tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0, ])>0){
tmpdf <- setDailyUsage(tmpdf)
}
return(tmpdf)
}
Prepared Data
Our prepared data
我们准备的数据
preparedData <- setDailyUsage(tmpdf)
licensePlate starttime endtime
1 Y80901 2015-09-18 09:55:00 2015-09-18 17:55:00
5 AMG-999 2015-09-21 09:22:00 2015-09-21 14:22:00
7 W3188 2015-09-21 14:00:00 2015-09-21 16:00:00
21 Y80901 2015-09-18 23:00:00 2015-09-18 23:59:59
3 Y80901 2015-09-21 00:00:01 2015-09-21 12:00:00
31 Y80901 2015-09-20 15:00:00 2015-09-20 23:59:59
4 AMG-999 2015-09-18 00:00:01 2015-09-18 13:00:00
41 AMG-999 2015-09-17 15:42:00 2015-09-17 23:59:59
61 W3188 2015-09-17 09:00:00 2015-09-17 23:59:59
2 Y80901 2015-09-20 00:00:01 2015-09-20 11:00:00
211 Y80901 2015-09-19 00:00:01 2015-09-19 23:59:59
611 W3188 2015-09-18 00:00:01 2015-09-18 23:59:59
612 W3188 2015-09-19 00:00:01 2015-09-19 23:59:59
6 W3188 2015-09-21 00:00:01 2015-09-21 12:00:00
613 W3188 2015-09-20 00:00:01 2015-09-20 23:59:59
Data Manipulation
Now we create a new DF which represents the data in the required format. This will initially have empty values in the UsedTime column.
现在我们创建一个新的DF,它代表所需格式的数据。这最初将在UsedTime列中具有空值。
preparedData$duration <- preparedData$endtime - preparedData$starttime
noOfUniquePlates <- length(unique(preparedData$licensePlate))
Period <- rep(seq(from=(min(as.Date(preparedData$starttime))),to=(max(as.Date(preparedData$starttime))), by="day"),noOfUniquePlates)
noOfUniqueDays <- length(unique(Period))
LicensePlate <- rep(unique(preparedData$licensePlate),each=noOfUniqueDays)
UsedTime <- 0
newDF <- data.frame(Period,LicensePlate,UsedTime)
Now a simply mapply function on each row of the newDF , searching for the right usage data in preparedData df.
现在,newDF的每一行都有一个简单的mapply函数,在preparedData df中搜索正确的用法数据。
findUsage <- function(p,l){
sum(preparedData[as.Date(preparedData$starttime) == p & as.Date(preparedData$endtime) == p & preparedData$licensePlate == l , ]$duration)
}
newDF$UsedTime <- mapply( findUsage, newDF$Period, newDF$LicensePlate)
newDF$UsedTime <- newDF$UsedTime/60
> newDF[with(newDF,order(Period)),]
Period LicensePlate UsedTime
1 2015-09-17 Y80901 0.000000
6 2015-09-17 AMG-999 8.299722
11 2015-09-17 W3188 14.999722
2 2015-09-18 Y80901 8.999722
7 2015-09-18 AMG-999 12.999722
12 2015-09-18 W3188 23.999444
3 2015-09-19 Y80901 23.999444
8 2015-09-19 AMG-999 0.000000
13 2015-09-19 W3188 23.999444
4 2015-09-20 Y80901 19.999444
9 2015-09-20 AMG-999 0.000000
14 2015-09-20 W3188 23.999444
5 2015-09-21 Y80901 11.999722
10 2015-09-21 AMG-999 5.000000
15 2015-09-21 W3188 13.999722
I had to limit explanations to keep the answer short. Let me know if you need any clarification in comments.
我不得不限制解释以保持答案简短。如果您在评论中需要任何澄清,请与我们联系。
#3
0
You are correct that plyr can be used to solve this problem. One possible solution:
你是对的,plyr可以用来解决这个问题。一种可能的方案:
tmpdf$starttime <- as.POSIXct(tmpdf$starttime) #convert date/time columns to date/time values in R
tmpdf$endtime <- as.POSIXct(tmpdf$endtime) #convert date/time columns to date/time values in R
newdf <- ddply(tmpdf,.(as.Date(starttime),licensePlate),function(df){
df$diffdays <- as.double(difftime(df$endtime,df$starttime,units='days'))
df
})
#If you want to only have the Period, LicensePlate, and Used Time columns remaining:
newdf <- subset(newdf,select=c(1,2,5))
colnames(newdf) <- c('Period','LicensePlate','UsedTime')
Hope it helps!
希望能帮助到你!
#4
0
try this- does it help
试试这个 - 它有帮助吗?
`tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
tmpdf
str(tmpdf)
library(lubridate)
tmpdf$starttime=ymd_hm(paste(tmpdf$starttime))
tmpdf$endtime=ymd_hm(paste(tmpdf$endtime))
tmpdf$Period=day(tmpdf$starttime)
tmpdf$diff=difftime(tmpdf$endtime,tmpdf$starttime)
tmpdf`