I have been zoning in the R part of * for quite a while looking for a proper answer but nothing that what saw seems to apply to my problem. I have a dataset of this format ( I have adapted it for what seems to be the easiest way to work with, but the stop_sequence values are normally just incremental numbers for each stop) :
我已经在*的R部分划分了很长一段时间,寻找一个合适的答案,但是什么都没有看到似乎适用于我的问题。我有一个这种格式的数据集(我将它调整为似乎是最容易处理的方式,但是stop_sequence值通常只是每个stop的增量数字):
route_short_name trip_id direction_id departure_time stop_sequence
33A 1.1598.0-33A-b12-1.451.I 1 16:15:00 start
33A 1.1598.0-33A-b12-1.451.I 1 16:57:00 end
41C 10.3265.0-41C-b12-1.277.I 1 08:35:00 start
41C 10.3265.0-41C-b12-1.277.I 1 09:26:00 end
41C 100.3260.0-41C-b12-1.276.I 1 09:40:00 start
41C 100.3260.0-41C-b12-1.276.I 1 10:53:00 end
114 1000.987.0-114-b12-1.86.O 0 21:35:00 start
114 1000.987.0-114-b12-1.86.O 0 22:02:00 end
39 10000.2877.0-39-b12-1.242.I 1 11:15:00 start
39 10000.2877.0-39-b12-1.242.I 1 12:30:00 end
It is basically a bus trips dataset. All I want is to manage to get the duration of each trip, so something like that:
它基本上是一个总线旅行数据集。我想做的就是设法弄清楚每次旅行的持续时间,就像这样:
route_short_name trip_id direction_id duration
33A 1.1598.0-33A-b12-1.451.I 1 42
41C 10.3265.0-41C-b12-1.277.I 1 51
41C 100.3260.0-41C-b12-1.276.I 1 73
114 1000.987.0-114-b12-1.86.O 0 27
39 10000.2877.0-39-b12-1.242.I 1 75
I have tried a lot of things, but in no case have I managed to group the data by trip_id and then working on the two values at each time. I must have misunderstood something, but I do not know what.
我尝试过很多东西,但从来没有尝试过使用trip_id对数据进行分组,然后每次都处理这两个值。我一定是误解了什么,但我不知道是什么意思。
Does anyone have a clue?
有人知道吗?
2 个解决方案
#1
1
We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.
我们也可以不转换为“wide”格式(假设‘stop_sequence’是‘start’,后面跟着‘end’,分别对应于‘route_short_name’、‘trip_id’和‘direction_id’。
Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime
of the last
'departure_time' with that of the 'first' 'departure_time'
将“departure_time”转换为datetime列,按“route_short_name”、“trip_id”和“direction_id”分组,在“第一个”“离开时间”的情况下获得最后一个“离开时间”的传播时间。
df1 %>%
mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
group_by(route_short_name, trip_id, direction_id) %>%
summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups: route_short_name, trip_id [?]
# route_short_name trip_id direction_id duration
# <chr> <chr> <int> <dbl>
#1 114 1000.987.0-114-b12-1.86.O 0 27
#2 33A 1.1598.0-33A-b12-1.451.I 1 42
#3 39 10000.2877.0-39-b12-1.242.I 1 75
#4 41C 10.3265.0-41C-b12-1.277.I 1 51
#5 41C 100.3260.0-41C-b12-1.276.I 1 73
#2
1
Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread
function in the tidyverse
package will take your data from long to wide. From there you can use the mutate
function to add the new column you want. as.numeric(difftime(end,start))
will keep the difference unit in minutes.
试试这个。现在您有了“long”格式的dataframe,但是最好有“wide”格式来计算时差。在tidyverse包中使用扩展函数将使您的数据从长到宽。在这里,您可以使用mutate函数来添加您想要的新列。数字(扩散时间(结束,开始))将保持不同的单位在分钟。
library(tidyverse)
wide_df <-
spread(your_df,key = stop_sequence, value = departure_time) %>%
mutate(timediff = as.numeric(difftime(end,start)))
If you want to learn more about "tidy" data (and spread
ing and gather
ing), see this link to Hadley's book
如果您想了解更多关于“整洁”数据(以及传播和收集)的信息,请参见哈德利的书中的链接
#1
1
We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.
我们也可以不转换为“wide”格式(假设‘stop_sequence’是‘start’,后面跟着‘end’,分别对应于‘route_short_name’、‘trip_id’和‘direction_id’。
Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime
of the last
'departure_time' with that of the 'first' 'departure_time'
将“departure_time”转换为datetime列,按“route_short_name”、“trip_id”和“direction_id”分组,在“第一个”“离开时间”的情况下获得最后一个“离开时间”的传播时间。
df1 %>%
mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
group_by(route_short_name, trip_id, direction_id) %>%
summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups: route_short_name, trip_id [?]
# route_short_name trip_id direction_id duration
# <chr> <chr> <int> <dbl>
#1 114 1000.987.0-114-b12-1.86.O 0 27
#2 33A 1.1598.0-33A-b12-1.451.I 1 42
#3 39 10000.2877.0-39-b12-1.242.I 1 75
#4 41C 10.3265.0-41C-b12-1.277.I 1 51
#5 41C 100.3260.0-41C-b12-1.276.I 1 73
#2
1
Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread
function in the tidyverse
package will take your data from long to wide. From there you can use the mutate
function to add the new column you want. as.numeric(difftime(end,start))
will keep the difference unit in minutes.
试试这个。现在您有了“long”格式的dataframe,但是最好有“wide”格式来计算时差。在tidyverse包中使用扩展函数将使您的数据从长到宽。在这里,您可以使用mutate函数来添加您想要的新列。数字(扩散时间(结束,开始))将保持不同的单位在分钟。
library(tidyverse)
wide_df <-
spread(your_df,key = stop_sequence, value = departure_time) %>%
mutate(timediff = as.numeric(difftime(end,start)))
If you want to learn more about "tidy" data (and spread
ing and gather
ing), see this link to Hadley's book
如果您想了解更多关于“整洁”数据(以及传播和收集)的信息,请参见哈德利的书中的链接