I've looked around for something similar, but couldn't find anything. I have an airport data set which looks something like this (I rounded the hours):
我环顾四周寻找类似的东西,却找不到任何东西。我有一个机场数据集看起来像这样(我把时间缩短了):
Date Arrival_Time Departure_Time ...
2017-01-01 13:00 14:00 ...
2017-01-01 16:00 17:00 ...
2017-01-01 17:00 18:00 ...
2017-01-01 11:00 12:00 ...
The problem is that for some months, there isn't a flight for a specific time which means I have missing data for some hour. How can I extract hourly arrivals for each hour of every month so that there are no missing values?
问题是,在几个月内,没有特定时间的航班,这意味着我在一小时内缺少数据。如何提取每月每小时的每小时到达时间,以便没有遗漏值?
I've tried using dplyr and doing the following:
我尝试过使用dplyr并执行以下操作:
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
summarise(n()) %>%
na.omit()
but the problem clearly arrises as group_by cannot fill in my missing data. I end up with data for every month, but not entries for some hour (e.g. no entry for month 1, hour 22:00).
但问题显然是因为group_by无法填写我丢失的数据。我最终得到每个月的数据,但不是一小时的条目(例如,第1个月,第22个小时没有条目)。
I could currently get my answer by filtering out every month in its own list, and then fully merging them with a complete list of hours, but that's really slow as I have to do this 12 times. Ideally I'm trying to end up with something like this:
我现在可以通过在自己的列表中过滤掉每个月来获得我的答案,然后将它们与完整的小时列表完全合并,但这非常慢,因为我必须这样做12次。理想情况下,我试图最终得到这样的东西:
Hour Month January February March ... December
00:00 1 ### ### ### ... ###
01:00 1 ### ### ### ... ###
...
00:00 12 ### ### ### ... ###
23:00 12 ### ### ### ... ###
where ### is the number of flights for that hour of that month. Is there a nice way of doing this?
其中###是该月当月的航班数量。有这样一个很好的方法吗?
Note: I was thinking if I could somehow join every month's hours with my complete list of hours, and replace all na's with 0's, then that would work, but I couldn't figure out how to do it properly.
注意:我在想我是否可以用我完整的小时列表以某种方式加入每个月的小时,并将所有na替换为0,然后这样就行了,但我无法弄清楚如何正确地做到这一点。
Hopefully the question makes sense. I'd gladly clarify if anything is unclear.
希望这个问题有道理。我很乐意澄清是否有任何不清楚的地方。
EDIT: If you want to try it with the nycflights13 package, you could reproduce my attempt with the following code:
编辑:如果你想尝试使用nycflights13包,你可以使用以下代码重现我的尝试:
allFlights <- nycflights13::flights
allFlights$arr_time <- format(strptime(substr(as.POSIXct(sprintf("%04.0f", allFlights$arr_time), format="%H%M"), 12, 16), '%H:%M'), '%H:00')
arrivals <- allFlights %>% filter(carrier == "MQ") %>% group_by(month, arr_time) %>% summarise(n()) %>% na.omit()
Notice how arrivals doesn't have anything for month 1, hour 02:00, 03:00, etc. What I'm trying to do is have this be a complete data set with the missing hours filled in as 0.
注意到达时间对于第1个月,第02:00,第03:00等没有任何东西。我想要做的是这是一个完整的数据集,缺少的小时数填写为0。
2 个解决方案
#1
0
I think you can use the code below to generate what you need.
我想你可以使用下面的代码来生成你需要的东西。
library(stringr)
dim_month_hour<-data.frame(expand.grid(hour=paste(str_pad(seq(0,23,1),2,"left","0"),"00",sep=":"),month=sort(unique(allFlights$month)),stringsAsFactors=F))
arrivals_full<-left_join(dim_month_hour,arrivals,by=c("hour"="arr_time","month"="month"))
arrivals_full[is.na(arrivals_full$`n()`),"n()"]<-0
#2
0
Is this what you're trying to do? I'm not sure if I'm aggregating exactly how you want, but the !is.na should do what you're looking for.
这是你想要做的吗?我不确定我是否正在按照你想要的方式进行聚合,但是!is.na应该做你正在寻找的东西。
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
rowwise() %>%
summarise(month = plyr::count(!is.na(Arrival_Time)))
Edit: I may not be clear. Do you want a zero to show for hours where there are no data?
编辑:我可能不清楚。你想要一个没有数据的小时显示零吗?
So I'm circling it. There's a cool packaged, called padr
that will "pad" the date/time entries with NAs for missing values. Because there is a time_hour
field, you can use pad
.
所以我在盘旋它。有一个很酷的打包,称为padr,它将使用NA“填充”日期/时间条目以获取缺失值。因为有一个time_hour字段,你可以使用pad。
library(padr)
allFlightsPad <- allFlights %>% pad
Then you can summarize from there. See this page for info.
然后你可以从那里总结一下。有关信息,请参阅此页面。
#1
0
I think you can use the code below to generate what you need.
我想你可以使用下面的代码来生成你需要的东西。
library(stringr)
dim_month_hour<-data.frame(expand.grid(hour=paste(str_pad(seq(0,23,1),2,"left","0"),"00",sep=":"),month=sort(unique(allFlights$month)),stringsAsFactors=F))
arrivals_full<-left_join(dim_month_hour,arrivals,by=c("hour"="arr_time","month"="month"))
arrivals_full[is.na(arrivals_full$`n()`),"n()"]<-0
#2
0
Is this what you're trying to do? I'm not sure if I'm aggregating exactly how you want, but the !is.na should do what you're looking for.
这是你想要做的吗?我不确定我是否正在按照你想要的方式进行聚合,但是!is.na应该做你正在寻找的东西。
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
rowwise() %>%
summarise(month = plyr::count(!is.na(Arrival_Time)))
Edit: I may not be clear. Do you want a zero to show for hours where there are no data?
编辑:我可能不清楚。你想要一个没有数据的小时显示零吗?
So I'm circling it. There's a cool packaged, called padr
that will "pad" the date/time entries with NAs for missing values. Because there is a time_hour
field, you can use pad
.
所以我在盘旋它。有一个很酷的打包,称为padr,它将使用NA“填充”日期/时间条目以获取缺失值。因为有一个time_hour字段,你可以使用pad。
library(padr)
allFlightsPad <- allFlights %>% pad
Then you can summarize from there. See this page for info.
然后你可以从那里总结一下。有关信息,请参阅此页面。