内部在Big Query中加入大表

时间:2022-02-02 19:19:53

I am trying to perform an inner join between two big tables where each table consists of almost 30 million records. When I try running a simple INNER JOIN between these two tables I get an error as below asking me to use JOIN EACH syntax but I didn't find any proper documentation on google references for JOIN EACH. Can somebody share thoughts about this? Here is my error as below.

我试图在两个大表之间执行内部联接,其中每个表包含近3000万条记录。当我尝试在这两个表之间运行一个简单的INNER JOIN时,我得到一个错误,如下所示,要求我使用JOIN EACH语法,但我没有找到关于JOIN EACH的google引用的任何适当文档。有人可以分享一下这个想法吗?这是我的错误如下。

Error: Table too large for JOIN. Consider using JOIN EACH. For more details, please see https://developers.google.com/bigquery/docs/query-reference#joins

1 个解决方案

#1


Looking at your question, seems like all you need is to read up a bit on the doc available.

看看你的问题,看起来你需要的只是阅读一下可用的文档。

Now, having read Jordan Tigani's book, I can tell you that when you join, the system actually sends the smaller table in every shard that handles your query. Since none of your table is under 8 Mb, what happens is that it cannot simply send your table (as it's simply too big).

现在,在阅读了Jordan Tigani的书之后,我可以告诉你,当你加入时,系统实际上会在处理你的查询的每个分片中发送较小的表。由于你的桌子都没有低于8 Mb,所以会发生的事情是它不能简单地发送你的桌子(因为它太大了)。

The way "JOIN EACH" works is that it tells the system "hash the joining criteria on both tables, and send a subset of each table to a specific shard". Hashing means that whatever you use as a criteria for the inner join will actually end up in the same shard. It has impacts on performance, but it's the only thing that can make a JOIN where both tables are bigger than 8 mb go through.

“JOIN EACH”的工作方式是它告诉系统“哈希两个表上的连接条件,并将每个表的子集发送到特定的分片”。散列意味着无论您使用什么作为内部联接的条件,实际上都会在同一个分片中结束。它对性能有影响,但它是唯一可以使两个表大于8 mb的JOIN。

#1


Looking at your question, seems like all you need is to read up a bit on the doc available.

看看你的问题,看起来你需要的只是阅读一下可用的文档。

Now, having read Jordan Tigani's book, I can tell you that when you join, the system actually sends the smaller table in every shard that handles your query. Since none of your table is under 8 Mb, what happens is that it cannot simply send your table (as it's simply too big).

现在,在阅读了Jordan Tigani的书之后,我可以告诉你,当你加入时,系统实际上会在处理你的查询的每个分片中发送较小的表。由于你的桌子都没有低于8 Mb,所以会发生的事情是它不能简单地发送你的桌子(因为它太大了)。

The way "JOIN EACH" works is that it tells the system "hash the joining criteria on both tables, and send a subset of each table to a specific shard". Hashing means that whatever you use as a criteria for the inner join will actually end up in the same shard. It has impacts on performance, but it's the only thing that can make a JOIN where both tables are bigger than 8 mb go through.

“JOIN EACH”的工作方式是它告诉系统“哈希两个表上的连接条件,并将每个表的子集发送到特定的分片”。散列意味着无论您使用什么作为内部联接的条件,实际上都会在同一个分片中结束。它对性能有影响,但它是唯一可以使两个表大于8 mb的JOIN。