在R中清理原始目的地数据

时间:2021-04-23 15:37:41

I have trip data that looks something like this

我的旅行数据看起来像这样

ClientID <- c("45675")
Date <- c("10/10/2016")
PickUpAddress <- c("123 Street", "45 Way", "66 Blvd")
DropOffAddress <- c("45 Way", "66 Blvd", "123 Street")
PickUpTime <- c("08:00", "17:00", "18:00")
DropOffTime <- c("8:30", "17:30", "19:00")

df <- data.frame(ClientID, Date, PickUpAddress, DropOffAddress, PickUpTime, DropOffTime)

df
  ClientID       Date PickUpAddress DropOffAddress PickUpTime DropOffTime
1    45675 10/10/2016    123 Street         45 Way      08:00        8:30
2    45675 10/10/2016        45 Way        66 Blvd      17:00       17:30
3    45675 10/10/2016       66 Blvd     123 Street      18:00       19:00

But with thousands of records and varying numbers of trips per client though the year.

但是,每年有数千条记录和每个客户的不同旅行次数。

The third row in this example is the return trip (the trip to the original origin). I would like to remove all return trips from the database.

此示例中的第三行是返回行程(原始行程的行程)。我想从数据库中删除所有回程。

Any suggestions?

1 个解决方案

#1


0  

You can try the following solution which is based of the definition of client home address.

您可以尝试以下基于客户端归属地址定义的解决方案。

library(dplyr)
library(lubridate)

# create date/time format variables
df$Date_PickUpTime <- paste(df$Date, df$PickUpTime, sep = " ")
df$Date_DropOffTime <- paste(df$Date, df$DropOffTime, sep = " ")

df$Date_PickUpTime <- mdy_hm(df$Date_PickUpTime)
df$Date_DropOffTime <- mdy_hm(df$Date_DropOffTime)

str(df) # as you can see Date_PickUpTime and Date_DropOffTime are in POSIXct format

# define the client home address
df %>%
  group_by(ClientID) %>%                 # group by client
  arrange(Date_PickUpTime) %>%           # order the data by Date_PickUpTime
  mutate(HomeAddress = PickUpAddress[1]) # client home address is the first PickUpAddress

# ... then add filter to the above code

df %>%
  group_by(ClientID) %>% # group by client
  arrange(Date_PickUpTime) %>%      # order the data
  mutate(HomeAddress = PickUpAddress[1]) %>% # client home address
  filter(DropOffAddress != HomeAddress) # condition for filter:
                                        # DropOffAddress is different to HomeAddress
                                        # return trip (3rd) is not selected

#1


0  

You can try the following solution which is based of the definition of client home address.

您可以尝试以下基于客户端归属地址定义的解决方案。

library(dplyr)
library(lubridate)

# create date/time format variables
df$Date_PickUpTime <- paste(df$Date, df$PickUpTime, sep = " ")
df$Date_DropOffTime <- paste(df$Date, df$DropOffTime, sep = " ")

df$Date_PickUpTime <- mdy_hm(df$Date_PickUpTime)
df$Date_DropOffTime <- mdy_hm(df$Date_DropOffTime)

str(df) # as you can see Date_PickUpTime and Date_DropOffTime are in POSIXct format

# define the client home address
df %>%
  group_by(ClientID) %>%                 # group by client
  arrange(Date_PickUpTime) %>%           # order the data by Date_PickUpTime
  mutate(HomeAddress = PickUpAddress[1]) # client home address is the first PickUpAddress

# ... then add filter to the above code

df %>%
  group_by(ClientID) %>% # group by client
  arrange(Date_PickUpTime) %>%      # order the data
  mutate(HomeAddress = PickUpAddress[1]) %>% # client home address
  filter(DropOffAddress != HomeAddress) # condition for filter:
                                        # DropOffAddress is different to HomeAddress
                                        # return trip (3rd) is not selected