使用Google BigQuery中的子查询自行加入

时间:2022-10-12 19:19:51

I am a SQL noob in need of some help with a specific query using the NYC 2013 Taxi Trips Dataset located here.

我是一个SQL noob,需要使用位于此处的NYC 2013 Taxi Trips Dataset进行特定查询。

I want to analyze dropoffs at JFK Airport, but want to build my query so that I can include the next subsequent pickup that a taxi does after dropping off someone at the airport.

我想分析JFK机场的下车,但是想要建立我的查询,以便我可以包括下一个出租车在机场下车后做的下一次接送。

This query gets me all the trips at the airport for a given day:

这个查询让我在机场的所有旅行都在给定的一天:

SELECT * FROM [833682135931:nyctaxi.trip_data] 
WHERE DATE(pickup_datetime) = '2013-05-01'
  AND FLOAT(pickup_latitude) < 40.651381
  AND FLOAT(pickup_latitude) > 40.640668
  AND FLOAT(pickup_longitude) < -73.776283
  AND FLOAT(pickup_longitude) > -73.794694

I want to join the dataset with itself to add next_pickup_time, next_pickup_lat, and next_pickup_lon values for each row.

我想自己加入数据集,为每一行添加next_pickup_time,next_pickup_lat和next_pickup_lon值。

To do this, I assume I need a correlated subquery, but don't know where to start building it out because the subquery is based on the outer query.

为此,我假设我需要一个相关的子查询,但不知道从哪里开始构建它因为子查询基于外部查询。

It needs to search for trips with the same medallion, on the same day, and with a pickup time later than the current airport dropoff, then limit 1... Any help is much appreciated!

它需要在同一天搜索具有相同奖章的旅行,并且在当前机场下车后提取时间,然后限制1 ...任何帮助都非常感谢!

4 个解决方案

#1


1  

This should give you all the dropoffs with next pickups

这应该会给你下一次拾取的所有下降

SELECT *
FROM
  (SELECT medallion,
          dropoff_datetime,
          dropoff_longitude,
          dropoff_latitude,
          LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion
                                             ORDER BY pickup_datetime) AS next_datetime,
          LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion
                                                 ORDER BY pickup_datetime) AS next_longitude,
          LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion
                                                ORDER BY pickup_datetime) AS next_latitude
   FROM [833682135931:nyctaxi.trip_data]) d
WHERE date(next_datetime)=date(dropoff_datetime)
  AND DATE(dropoff_datetime) = '2013-05-01'
  AND FLOAT(dropoff_latitude) < 40.651381
  AND FLOAT(dropoff_latitude) > 40.640668
  AND FLOAT(dropoff_longitude) < -73.776283
  AND FLOAT(dropoff_longitude) > -73.794694

#2


1  

consider using the LAG window function instead of self join

考虑使用LAG窗口函数而不是自联接

#3


1  

I think that N.N. has the right idea, except that you want LEAD instead of LAG to get the next pickup. For example, this query will produce the next pickup time, lat and long after a pickup at JFK.

我认为N.N.有正确的想法,除了你想要LEAD而不是LAG来获得下一个拾取。例如,此查询将在JFK拾取后生成下一个拾取时间,lat和long。

SELECT
    medallion,
    pickup_datetime,
    pickup_longitude,
    pickup_latitude,
    LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_datetime,
    LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_longitude,
    LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_latitude
FROM [833682135931:nyctaxi.trip_data]
WHERE DATE(pickup_datetime) = '2013-05-01'
  AND FLOAT(pickup_latitude) < 40.651381
  AND FLOAT(pickup_latitude) > 40.640668
  AND FLOAT(pickup_longitude) < -73.776283
  AND FLOAT(pickup_longitude) > -73.794694;

Any time you can avoid a self-join, it's good to do so.

任何时候你都可以避免自我加入,这样做很好。

#4


0  

This is what finally worked, modified from Pentium10's answer:

这是最终工作,从Pentium10的答案修改:

SELECT *
FROM
  (SELECT medallion,
          dropoff_datetime,
          dropoff_longitude,
          dropoff_latitude,
          LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion
                                             ORDER BY pickup_datetime) AS next_datetime,
          LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion
                                                 ORDER BY pickup_datetime) AS next_longitude,
          LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion
                                                ORDER BY pickup_datetime) AS next_latitude
   FROM [833682135931:nyctaxi.trip_data]) d
WHERE date(next_datetime)=date(dropoff_datetime)
  AND DATE(dropoff_datetime) = '2013-05-01'
  AND FLOAT(dropoff_latitude) < 40.651381
  AND FLOAT(dropoff_latitude) > 40.640668
  AND FLOAT(dropoff_longitude) < -73.776283
  AND FLOAT(dropoff_longitude) > -73.794694

#1


1  

This should give you all the dropoffs with next pickups

这应该会给你下一次拾取的所有下降

SELECT *
FROM
  (SELECT medallion,
          dropoff_datetime,
          dropoff_longitude,
          dropoff_latitude,
          LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion
                                             ORDER BY pickup_datetime) AS next_datetime,
          LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion
                                                 ORDER BY pickup_datetime) AS next_longitude,
          LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion
                                                ORDER BY pickup_datetime) AS next_latitude
   FROM [833682135931:nyctaxi.trip_data]) d
WHERE date(next_datetime)=date(dropoff_datetime)
  AND DATE(dropoff_datetime) = '2013-05-01'
  AND FLOAT(dropoff_latitude) < 40.651381
  AND FLOAT(dropoff_latitude) > 40.640668
  AND FLOAT(dropoff_longitude) < -73.776283
  AND FLOAT(dropoff_longitude) > -73.794694

#2


1  

consider using the LAG window function instead of self join

考虑使用LAG窗口函数而不是自联接

#3


1  

I think that N.N. has the right idea, except that you want LEAD instead of LAG to get the next pickup. For example, this query will produce the next pickup time, lat and long after a pickup at JFK.

我认为N.N.有正确的想法,除了你想要LEAD而不是LAG来获得下一个拾取。例如,此查询将在JFK拾取后生成下一个拾取时间,lat和long。

SELECT
    medallion,
    pickup_datetime,
    pickup_longitude,
    pickup_latitude,
    LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_datetime,
    LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_longitude,
    LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_latitude
FROM [833682135931:nyctaxi.trip_data]
WHERE DATE(pickup_datetime) = '2013-05-01'
  AND FLOAT(pickup_latitude) < 40.651381
  AND FLOAT(pickup_latitude) > 40.640668
  AND FLOAT(pickup_longitude) < -73.776283
  AND FLOAT(pickup_longitude) > -73.794694;

Any time you can avoid a self-join, it's good to do so.

任何时候你都可以避免自我加入,这样做很好。

#4


0  

This is what finally worked, modified from Pentium10's answer:

这是最终工作,从Pentium10的答案修改:

SELECT *
FROM
  (SELECT medallion,
          dropoff_datetime,
          dropoff_longitude,
          dropoff_latitude,
          LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion
                                             ORDER BY pickup_datetime) AS next_datetime,
          LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion
                                                 ORDER BY pickup_datetime) AS next_longitude,
          LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion
                                                ORDER BY pickup_datetime) AS next_latitude
   FROM [833682135931:nyctaxi.trip_data]) d
WHERE date(next_datetime)=date(dropoff_datetime)
  AND DATE(dropoff_datetime) = '2013-05-01'
  AND FLOAT(dropoff_latitude) < 40.651381
  AND FLOAT(dropoff_latitude) > 40.640668
  AND FLOAT(dropoff_longitude) < -73.776283
  AND FLOAT(dropoff_longitude) > -73.794694