R:按最近的时间组合两个数据帧

时间:2022-02-08 22:58:55

I have two dataframes; one that contains a year's worth of hourly temperatures and the other contains flight information. Bellow shows an extract from the temperature dataframe:

我有两个数据帧;一个包含一年的小时温度,另一个包含航班信息。 Bellow显示了温度数据框的摘录:

  Time <- c("2000-01-01 00:53:00","2000-01-01 06:53:00","2000-01-01 10:53:00")
  Time <- as.POSIXct(Time)
  Temp <- c(20,30,10)
  Temperature <- data.frame(Time,Temp)
  Temperature
                 Time Temp
1 2000-01-01 00:53:00   20
2 2000-01-01 06:53:00   30
3 2000-01-01 10:53:00   10

Bellow shows an extract from the flight information dataframe:

Bellow显示航班信息数据框的摘录:

  DepartureTime <- c("2000-01-01 03:01:00","2000-01-01 10:00:00","2000-01-01 14:00:00")
  DepartureTime <- as.POSIXct(DepartureTime)
  FlightInformation <- data.frame(DepartureTime)
  FlightInformation
        DepartureTime
1 2000-01-01 03:01:00
2 2000-01-01 10:14:00
3 2000-01-01 14:55:00

My goal is to take each row of FlightInformation$DepartureTime and find the closest time in the whole column Temperature$Time. I then want to add the corresponding temperature to the FlightInformation dataframe. The desired output should look like this:

我的目标是获取FlightInformation $ DepartureTime的每一行,并在整个列Temperature $ Time中找到最接近的时间。然后我想将相应的温度添加到FlightInformation数据帧。所需的输出应如下所示:

FlightInformation
        DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:14:00 10
3 2000-01-01 14:55:00 10

My attempts so far have come up with this:

到目前为止,我的尝试已经提出了这个问题:

  i <- 1
  j <- 1
  while(i <= nrow(Temperature)){
    while(j <= nrow(FlightInformation)){
      if(Temperature$Time[i] == FlightInformation$Time[j]){
        FlightInformation$Temp[j] == Temperature$Temp[i]
      }
      j <- j + 1
    }
    i <- i + 1
  }

This involves first rounding all times to the nearest hour. This method is not as accurate as i would like it to be and seems VERY inefficient! Is there an easy way to find the nearest posix to give my desired output?

这涉及首先将所有时间舍入到最近的小时。这种方法不像我希望的那样准确,看起来非常低效!有没有一种简单的方法可以找到最近的posix来提供我想要的输出?

2 个解决方案

#1


1  

Some assumptions:

  • you have temperature data before and after all flight information; otherwise you'll see NA
  • 所有航班信息之前和之后都有温度数据;否则你会看到NA

  • temperature data is continuous-enough, meaning with the interpolation this presents, you don't grab something from 3 months prior (not useful)
  • 温度数据是连续足够的,这意味着通过插值显示,你不会从3个月之前拿到东西(没用)

  • temperature data is ordered (easy enough to fix if not)
  • 订购温度数据(如果没有,很容易修复)

We'll use cut, that finds the interval in which values fit within a series of breaks:

我们将使用cut,它找到值适合一系列中断的区间:

(ind <- cut(FlightInformation$DepartureTime, Temperature$Time, labels = FALSE))
# [1]  1  2 NA

These indicate rows within Temperature from which we should retrieve the $Temp. Unfortunately, it is absolute and does not allow for being closer to the next value, so we can compensate for that:

这些表示温度范围内的行,我们应从中检索$ Temp。不幸的是,它是绝对的,不允许更接近下一个值,所以我们可以弥补:

(ind <- ind + (abs(Temperature$Time[ind] - FlightInformation$DepartureTime) >
                 abs(Temperature$Time[1+ind] - FlightInformation$DepartureTime)))
# [1]  1  3 NA

Okay, now that NA: that indicates that the latest $DepartureTime is outside of the known times. This indicates a violation of my first assumption above, but it can be fixed. I use a magic-constant of "6 hours" here to determine that the data is close enough to be able to use it; there are certainly many other heuristics which will be less-wrong. For those, we can just assume the latest temperature:

好的,现在NA:表示最近的$ DepartureTime超出了已知时间。这表明违反了我上面的第一个假设,但可以修复。我在这里使用“6小时”的魔术常数来确定数据足够接近以便能够使用它;当然还有许多其他的启发式方法,它们的错误也会少一些。对于那些,我们可以假设最新的温度:

(is_recoverable <- is.na(ind) & abs(FlightInformation$DepartureTime - max(Temperature$Time)) < 60*60*6)
# [1] FALSE FALSE  TRUE
ind[is_recoverable] <- nrow(Temperature)
ind
# [1] 1 3 3

The the results:

结果如下:

FlightInformation$Temp <- Temperature$Temp[ ind ]
FlightInformation
#         DepartureTime Temp
# 1 2000-01-01 03:01:00   20
# 2 2000-01-01 10:00:00   10
# 3 2000-01-01 14:00:00   10

Though definitely quicker than double while loops, it will be a problem if you have large gaps in your temperature data. That is, if you have a 3-year gap in your data, the most-recent temperature will be used, which might be 2.99 years ago. For a double-check, use this:

虽然绝对比循环中的两倍快,但如果温度数据中存在较大间隙,则会出现问题。也就是说,如果您的数据有3年的差距,将使用最近的温度,这可能是2。99年前。要进行复核,请使用以下命令:

FlightInformation$TempTime <- Temperature$Time[ ind ]
FlightInformation$TimeDelta <- with(FlightInformation, abs(TempTime - DepartureTime))
FlightInformation
#         DepartureTime Temp            TempTime TimeDelta
# 1 2000-01-01 03:01:00   20 2000-01-01 00:53:00  128 mins
# 2 2000-01-01 10:00:00   10 2000-01-01 10:53:00   53 mins
# 3 2000-01-01 14:00:00   10 2000-01-01 10:53:00  187 mins

You can use different units for the time delta and check for problems with:

您可以使用不同的单位作为时间增量并检查以下问题:

units(FlightInformation$TimeDelta) <- "secs"
which(FlightInformation$TimeDelta > 60*60*6)
# integer(0)

(where integer(0) says you have none that are outside of my magic window of 6 hours.)

(其中整数(0)表示你没有超出我6小时的魔法窗口。)

#2


1  

Here's a way! Time is easiest to work with for this if you convert it to a numeric value. Then you can compare the numeric values to find the closest times before/after your reference time (FlightInformation$time_num in the below example). Once you have the closest time before and after your reference value, figure out which is really the closest to your reference. Use that time value to look up (index) the correct temperature value and add it to your data frame.

这是一种方式!如果将时间转换为数值,则最容易使用时间。然后,您可以比较数值以查找参​​考时间之前/之后的最接近时间(以下示例中为FlightInformation $ time_num)。一旦你得到参考值之前和之后的最接近时间,找出哪个最接近你的参考值。使用该时间值查找(索引)正确的温度值并将其添加到数据框中。

#convert time to numeric (seconds since origin of time)
Temperature$time_num <- as.numeric(Temperature$Time) 
FlightInformation$time_num <- as.numeric(FlightInformation$DepartureTime)

#make sure time data is in correct order so that indexes for time are in correct order 
Temperature <- Temperature[with(Temperature, order(time_num)), ] #sort data

for (i in 1:nrow(FlightInformation)) #for each row of data in flight...
{
  #find the time in Temp that is closest + prior to Flight time
  #create a logical vector saying which Temperature$time_num are <= to FlightInformation$time_num. 
  #pull the max row index from the logical vector where value == TRUE (this is the closest time for Temp that is prior to Flight Time)
  #use that row index to look up the Temperature$time_num value that is closest + prior to Flight time
  #will return NA/warning message if no time in Temp is before time in Flight
  temptime_prior <- Temperature[max(which(Temperature$time_num <= FlightInformation$time_num[i])), "time_num"] 

  #find the time in Temp that is closest + after to Flight time
  #will return NA/warning message if no time in Temp is after time in Flight
  temptime_after <- Temperature[min(which(Temperature$time_num > FlightInformation$time_num[i])), "time_num"] 

  #compare times before and after to see which is closest to flight time. If no before/after time was found (e.g., NA was returned), always use the other time value
  temptime_closest <- ifelse(is.na(temptime_prior), temptime_after, 
                             ifelse(is.na(temptime_after), temptime_prior, 
                                    ifelse((FlightInformation$time_num[i] - temptime_prior) <= (temptime_after - FlightInformation$time_num[i]),
                                           temptime_prior, temptime_after)))

  #look up the right temp by finding the row index of right Temp$time_num value and add it to Flight info
  FlightInformation$Temp[i] <- Temperature[which(Temperature$time_num == temptime_closest), "Temp"]
}

#get rid of numeric time column, you don't need it anymore
FlightInformation <- FlightInformation[,!(names(FlightInformation) %in% c("time_num"))]

Output

        DepartureTime Temp
1 2000-01-01 03:01:00   20
2 2000-01-01 10:00:00   10
3 2000-01-01 14:00:00   10

If you have subsets of data in each data frame you need to match up to (e.g., match df1$group1 time values only to df2$group1 time values), you can use survival::neardate. It's a nice function for this that does basically what the above code does, but has some additional parameters if you need them.

如果每个数据框中都有数据子集,则需要匹配(例如,仅将df1 $ group1时间值与df2 $ group1时间值匹配),您可以使用survival :: neardate。这是一个很好的功能,它基本上完成了上面的代码所做的,但如果你需要它还有一些额外的参数。

Hope this helps! The codes a lot shorter without all the comments =)

希望这可以帮助!没有所有评论的代码短得多=)

#1


1  

Some assumptions:

  • you have temperature data before and after all flight information; otherwise you'll see NA
  • 所有航班信息之前和之后都有温度数据;否则你会看到NA

  • temperature data is continuous-enough, meaning with the interpolation this presents, you don't grab something from 3 months prior (not useful)
  • 温度数据是连续足够的,这意味着通过插值显示,你不会从3个月之前拿到东西(没用)

  • temperature data is ordered (easy enough to fix if not)
  • 订购温度数据(如果没有,很容易修复)

We'll use cut, that finds the interval in which values fit within a series of breaks:

我们将使用cut,它找到值适合一系列中断的区间:

(ind <- cut(FlightInformation$DepartureTime, Temperature$Time, labels = FALSE))
# [1]  1  2 NA

These indicate rows within Temperature from which we should retrieve the $Temp. Unfortunately, it is absolute and does not allow for being closer to the next value, so we can compensate for that:

这些表示温度范围内的行,我们应从中检索$ Temp。不幸的是,它是绝对的,不允许更接近下一个值,所以我们可以弥补:

(ind <- ind + (abs(Temperature$Time[ind] - FlightInformation$DepartureTime) >
                 abs(Temperature$Time[1+ind] - FlightInformation$DepartureTime)))
# [1]  1  3 NA

Okay, now that NA: that indicates that the latest $DepartureTime is outside of the known times. This indicates a violation of my first assumption above, but it can be fixed. I use a magic-constant of "6 hours" here to determine that the data is close enough to be able to use it; there are certainly many other heuristics which will be less-wrong. For those, we can just assume the latest temperature:

好的,现在NA:表示最近的$ DepartureTime超出了已知时间。这表明违反了我上面的第一个假设,但可以修复。我在这里使用“6小时”的魔术常数来确定数据足够接近以便能够使用它;当然还有许多其他的启发式方法,它们的错误也会少一些。对于那些,我们可以假设最新的温度:

(is_recoverable <- is.na(ind) & abs(FlightInformation$DepartureTime - max(Temperature$Time)) < 60*60*6)
# [1] FALSE FALSE  TRUE
ind[is_recoverable] <- nrow(Temperature)
ind
# [1] 1 3 3

The the results:

结果如下:

FlightInformation$Temp <- Temperature$Temp[ ind ]
FlightInformation
#         DepartureTime Temp
# 1 2000-01-01 03:01:00   20
# 2 2000-01-01 10:00:00   10
# 3 2000-01-01 14:00:00   10

Though definitely quicker than double while loops, it will be a problem if you have large gaps in your temperature data. That is, if you have a 3-year gap in your data, the most-recent temperature will be used, which might be 2.99 years ago. For a double-check, use this:

虽然绝对比循环中的两倍快,但如果温度数据中存在较大间隙,则会出现问题。也就是说,如果您的数据有3年的差距,将使用最近的温度,这可能是2。99年前。要进行复核,请使用以下命令:

FlightInformation$TempTime <- Temperature$Time[ ind ]
FlightInformation$TimeDelta <- with(FlightInformation, abs(TempTime - DepartureTime))
FlightInformation
#         DepartureTime Temp            TempTime TimeDelta
# 1 2000-01-01 03:01:00   20 2000-01-01 00:53:00  128 mins
# 2 2000-01-01 10:00:00   10 2000-01-01 10:53:00   53 mins
# 3 2000-01-01 14:00:00   10 2000-01-01 10:53:00  187 mins

You can use different units for the time delta and check for problems with:

您可以使用不同的单位作为时间增量并检查以下问题:

units(FlightInformation$TimeDelta) <- "secs"
which(FlightInformation$TimeDelta > 60*60*6)
# integer(0)

(where integer(0) says you have none that are outside of my magic window of 6 hours.)

(其中整数(0)表示你没有超出我6小时的魔法窗口。)

#2


1  

Here's a way! Time is easiest to work with for this if you convert it to a numeric value. Then you can compare the numeric values to find the closest times before/after your reference time (FlightInformation$time_num in the below example). Once you have the closest time before and after your reference value, figure out which is really the closest to your reference. Use that time value to look up (index) the correct temperature value and add it to your data frame.

这是一种方式!如果将时间转换为数值,则最容易使用时间。然后,您可以比较数值以查找参​​考时间之前/之后的最接近时间(以下示例中为FlightInformation $ time_num)。一旦你得到参考值之前和之后的最接近时间,找出哪个最接近你的参考值。使用该时间值查找(索引)正确的温度值并将其添加到数据框中。

#convert time to numeric (seconds since origin of time)
Temperature$time_num <- as.numeric(Temperature$Time) 
FlightInformation$time_num <- as.numeric(FlightInformation$DepartureTime)

#make sure time data is in correct order so that indexes for time are in correct order 
Temperature <- Temperature[with(Temperature, order(time_num)), ] #sort data

for (i in 1:nrow(FlightInformation)) #for each row of data in flight...
{
  #find the time in Temp that is closest + prior to Flight time
  #create a logical vector saying which Temperature$time_num are <= to FlightInformation$time_num. 
  #pull the max row index from the logical vector where value == TRUE (this is the closest time for Temp that is prior to Flight Time)
  #use that row index to look up the Temperature$time_num value that is closest + prior to Flight time
  #will return NA/warning message if no time in Temp is before time in Flight
  temptime_prior <- Temperature[max(which(Temperature$time_num <= FlightInformation$time_num[i])), "time_num"] 

  #find the time in Temp that is closest + after to Flight time
  #will return NA/warning message if no time in Temp is after time in Flight
  temptime_after <- Temperature[min(which(Temperature$time_num > FlightInformation$time_num[i])), "time_num"] 

  #compare times before and after to see which is closest to flight time. If no before/after time was found (e.g., NA was returned), always use the other time value
  temptime_closest <- ifelse(is.na(temptime_prior), temptime_after, 
                             ifelse(is.na(temptime_after), temptime_prior, 
                                    ifelse((FlightInformation$time_num[i] - temptime_prior) <= (temptime_after - FlightInformation$time_num[i]),
                                           temptime_prior, temptime_after)))

  #look up the right temp by finding the row index of right Temp$time_num value and add it to Flight info
  FlightInformation$Temp[i] <- Temperature[which(Temperature$time_num == temptime_closest), "Temp"]
}

#get rid of numeric time column, you don't need it anymore
FlightInformation <- FlightInformation[,!(names(FlightInformation) %in% c("time_num"))]

Output

        DepartureTime Temp
1 2000-01-01 03:01:00   20
2 2000-01-01 10:00:00   10
3 2000-01-01 14:00:00   10

If you have subsets of data in each data frame you need to match up to (e.g., match df1$group1 time values only to df2$group1 time values), you can use survival::neardate. It's a nice function for this that does basically what the above code does, but has some additional parameters if you need them.

如果每个数据框中都有数据子集,则需要匹配(例如,仅将df1 $ group1时间值与df2 $ group1时间值匹配),您可以使用survival :: neardate。这是一个很好的功能,它基本上完成了上面的代码所做的,但如果你需要它还有一些额外的参数。

Hope this helps! The codes a lot shorter without all the comments =)

希望这可以帮助!没有所有评论的代码短得多=)