google big query sql中的性能增强

时间:2020-12-10 15:25:55

In the below google big query, I join two tables "Data" and "Location" on Id, StartTime and StopTime.

在下面的google big查询中,我在Id,StartTime和StopTime上加入了两个表“Data”和“Location”。

Since Data is partitioned by Date, I have the condition based on PartitionTime in the WHERE clauase.

由于数据按日期分区,因此我在WHERE clauase中具有基于PartitionTime的条件。

The query runs for a very long time (~20 mins), just wondering whether i am missing some performace techniques for the query efficiency to be improved.

查询运行了很长时间(约20分钟),只是想知道我是否缺少一些性能技术来提高查询效率。

Any help would be appreciated. Thanks !!

任何帮助,将不胜感激。谢谢 !!

  SELECT
    *
  FROM (
      SELECT
          A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
      FROM
          `Data` AS A
      JOIN
        (SELECT * FROM `Location` WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
        "19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )) AS B
      ON
        A.StartTime < B.DateTime
        AND A.StopTime >= B.DateTime
        AND A.Id = B.Id
  WHERE
    (A._PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01')
      AND TIMESTAMP('2016-11-30'))
  ORDER BY
    B.Id,
    A.Id1,
    B.DateTime )
ORDER BY
  Id,
  Id1,
  DateTime

2 个解决方案

#1


0  

I would also remove outer ORDER BY as I think it is the main killer of performance for your query.
Moving _PARTITIONTIME up to respective table is another item to consider.
Using SELECT * in sub-selects doesn't affect performance and cost (as it is final outer SELECT which defines which columns are used in addition to those used in WHERE and other clauses), but as a good practice I think better to list explicitly needed columns/fields

我也会删除外部ORDER BY,因为我认为它是查询性能的主要杀手。将_PARTITIONTIME移动到相应的表是另一个要考虑的项目。在子选择中使用SELECT *不会影响性能和成本(因为它是最终的外部SELECT,它定义除了WHERE和其他子句中使用的列之外还使用哪些列),但作为一种好的做法,我认为最好明确列出需要的列/字段

#standardSQL
SELECT
  A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
FROM (
  SELECT Id, Id1, StartTime, StopTime 
  FROM `Data` 
  WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01') AND TIMESTAMP('2016-11-30')
) AS A
JOIN (
  SELECT Latitude, Longitude, DateTime 
  FROM `Location` 
  WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )
) AS B
ON  A.StartTime < B.DateTime
AND A.StopTime >= B.DateTime
AND A.Id = B.Id   

You may also consider "compressing" below statement as Elliott suggested,

您也可以考虑Elliott建议的“压缩”以下声明,

WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )  

but be careful as this can bring unwanted tables to be involved (in case if you have such in your dataset). like for example those with suffixes as '011' or '046' etc.

但要小心,因为这会带来不必要的表格(万一你在数据集中有这样的表格)。例如那些后缀为'011'或'046'等的那些。

Yet another option is - you potentially have some logical relation between partitions in Data and suffix in Location. If so you can employ it to narrow down JOIN thus making it more performant

另一种选择是 - 您可能在数据中的分区和位置中的后缀之间存在某种逻辑关系。如果是这样,您可以使用它来缩小JOIN,从而使其更具性能

#2


1  

A couple of thoughts:

几点想法:

  • There is no need for the inner ORDER BY, since only a top-level ORDER BY has an effect on the results of the query.
  • 不需要内部ORDER BY,因为只有*ORDER BY才会影响查询结果。
  • If you want to query over all suffixes except "25", you can use _TABLE_SUFFIX BETWEEN "01" AND "31" AND _TABLE_SUFFIX != "25".
  • 如果要查询除“25”以外的所有后缀,可以使用_TABLE_SUFFIX BETWEEN“01”和“31”AND _TABLE_SUFFIX!=“25”。
  • Depending on the type of JOIN, the filter on _PARTITIONTIME might not be "pushed down" to avoid reading extra data automatically, e.g. if you are actually using a RIGHT JOIN. If this is the case, use a subquery such as (SELECT * FROM YourTable WHERE _PARTITIONTIME BETWEEN ...) AS A RIGHT JOIN ... instead.
  • 根据JOIN的类型,_PARTITIONTIME上的过滤器可能不会被“下推”以避免自动读取额外数据,例如如果你实际上正在使用RIGHT JOIN。如果是这种情况,请使用子查询,例如(SELECT * FROM YourTable WHERE _PARTITIONTIME BETWEEN ...)作为一个正确的JOIN ...而是。

If you'd like a BigQuery engineer to take a more detailed look at where the time went, you could include a sample job ID in your question and someone may be able to help.

如果您希望BigQuery工程师更详细地了解时间,您可以在问题中包含一个示例作业ID,并且有人可能会提供帮助。

#1


0  

I would also remove outer ORDER BY as I think it is the main killer of performance for your query.
Moving _PARTITIONTIME up to respective table is another item to consider.
Using SELECT * in sub-selects doesn't affect performance and cost (as it is final outer SELECT which defines which columns are used in addition to those used in WHERE and other clauses), but as a good practice I think better to list explicitly needed columns/fields

我也会删除外部ORDER BY,因为我认为它是查询性能的主要杀手。将_PARTITIONTIME移动到相应的表是另一个要考虑的项目。在子选择中使用SELECT *不会影响性能和成本(因为它是最终的外部SELECT,它定义除了WHERE和其他子句中使用的列之外还使用哪些列),但作为一种好的做法,我认为最好明确列出需要的列/字段

#standardSQL
SELECT
  A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
FROM (
  SELECT Id, Id1, StartTime, StopTime 
  FROM `Data` 
  WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01') AND TIMESTAMP('2016-11-30')
) AS A
JOIN (
  SELECT Latitude, Longitude, DateTime 
  FROM `Location` 
  WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )
) AS B
ON  A.StartTime < B.DateTime
AND A.StopTime >= B.DateTime
AND A.Id = B.Id   

You may also consider "compressing" below statement as Elliott suggested,

您也可以考虑Elliott建议的“压缩”以下声明,

WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )  

but be careful as this can bring unwanted tables to be involved (in case if you have such in your dataset). like for example those with suffixes as '011' or '046' etc.

但要小心,因为这会带来不必要的表格(万一你在数据集中有这样的表格)。例如那些后缀为'011'或'046'等的那些。

Yet another option is - you potentially have some logical relation between partitions in Data and suffix in Location. If so you can employ it to narrow down JOIN thus making it more performant

另一种选择是 - 您可能在数据中的分区和位置中的后缀之间存在某种逻辑关系。如果是这样,您可以使用它来缩小JOIN,从而使其更具性能

#2


1  

A couple of thoughts:

几点想法:

  • There is no need for the inner ORDER BY, since only a top-level ORDER BY has an effect on the results of the query.
  • 不需要内部ORDER BY,因为只有*ORDER BY才会影响查询结果。
  • If you want to query over all suffixes except "25", you can use _TABLE_SUFFIX BETWEEN "01" AND "31" AND _TABLE_SUFFIX != "25".
  • 如果要查询除“25”以外的所有后缀,可以使用_TABLE_SUFFIX BETWEEN“01”和“31”AND _TABLE_SUFFIX!=“25”。
  • Depending on the type of JOIN, the filter on _PARTITIONTIME might not be "pushed down" to avoid reading extra data automatically, e.g. if you are actually using a RIGHT JOIN. If this is the case, use a subquery such as (SELECT * FROM YourTable WHERE _PARTITIONTIME BETWEEN ...) AS A RIGHT JOIN ... instead.
  • 根据JOIN的类型,_PARTITIONTIME上的过滤器可能不会被“下推”以避免自动读取额外数据,例如如果你实际上正在使用RIGHT JOIN。如果是这种情况,请使用子查询,例如(SELECT * FROM YourTable WHERE _PARTITIONTIME BETWEEN ...)作为一个正确的JOIN ...而是。

If you'd like a BigQuery engineer to take a more detailed look at where the time went, you could include a sample job ID in your question and someone may be able to help.

如果您希望BigQuery工程师更详细地了解时间,您可以在问题中包含一个示例作业ID,并且有人可能会提供帮助。