Any ideas how to make this query return results on Google BigQuery? I'm getting a resources exceeded error... There are about 2B rows in the dataset. I'm trying to get the artist ID that appears the most for each user_id.
如何让这个查询返回谷歌BigQuery的结果?我得到的资源超过了错误……数据集中大约有2B行。我正在尝试获取每个user_id出现最多的艺术家ID。
select user_id, artist, count(*) as count
from [legacy20130831.merged_data] as d
group each by user_id, artist
order by user_id ASC, count DESC
1 个解决方案
#1
6
An equivalent query on public data, that throws the same error:
对公共数据的等价查询,会抛出相同的错误:
SELECT actor, repository_name, count(*) AS count
FROM [githubarchive:github.timeline] AS d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc
Compare with the same query, plus a limit on the results to be returned. This one works (14 seconds for me):
与相同的查询进行比较,并对要返回的结果进行限制。这个(我用了14秒):
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc
LIMIT 100
Instead of using a LIMIT, you could go through a fraction of the user_ids. In my case, a 1/3 works:
不使用极限,您可以使用user_id的一小部分。在我的例子中,1/3是有效的:
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
WHERE ABS(HASH(actor) % 3) = 0
GROUP EACH BY actor, repository_name
But what you really want is "to get the artist ID that appears the most for each user_id". Let's go further, and get that:
但是您真正想要的是“获取每个user_id出现最多的艺术家ID”。让我们更进一步,
SELECT actor, repository_name, count FROM (
SELECT actor, repository_name, count, ROW_NUMBER() OVER (PARTITION BY actor ORDER BY count DESC) rank FROM (
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
WHERE ABS(HASH(actor) % 10) = 0
GROUP EACH BY actor, repository_name
))
WHERE rank=1
Note that this time I used %10, as it gets me results faster. But you might be wondering "I want to get my results with one query, not 10".
注意,这次我使用了%10,因为它使我的结果更快。但你可能会想,“我想用一个查询来得到结果,而不是10个”。
There are 2 things you can do for that:
为此你可以做两件事:
- Unioning the partitioned tables (comma in the FROM expression does an union, not a join in BigQuery).
- 联合分区表(FROM表达式中的逗号执行联合,而不是BigQuery中的连接)。
- If you are still exceeding resources, you might need to materialize the table. Run the original query and save the result to a new table. Run the RANK() algorithm over that table, instead of over an in-memory GROUP.
- 如果您仍然超出了资源,您可能需要实现该表。运行原始查询并将结果保存到新表中。在该表上运行RANK()算法,而不是在内存中组上运行。
If you are willing to share your dataset with me, I could provide dataset specific advice (a lot depends on cardinality).
如果您愿意与我共享您的数据集,我可以提供数据集特定的建议(很大程度上取决于基数)。
#1
6
An equivalent query on public data, that throws the same error:
对公共数据的等价查询,会抛出相同的错误:
SELECT actor, repository_name, count(*) AS count
FROM [githubarchive:github.timeline] AS d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc
Compare with the same query, plus a limit on the results to be returned. This one works (14 seconds for me):
与相同的查询进行比较,并对要返回的结果进行限制。这个(我用了14秒):
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc
LIMIT 100
Instead of using a LIMIT, you could go through a fraction of the user_ids. In my case, a 1/3 works:
不使用极限,您可以使用user_id的一小部分。在我的例子中,1/3是有效的:
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
WHERE ABS(HASH(actor) % 3) = 0
GROUP EACH BY actor, repository_name
But what you really want is "to get the artist ID that appears the most for each user_id". Let's go further, and get that:
但是您真正想要的是“获取每个user_id出现最多的艺术家ID”。让我们更进一步,
SELECT actor, repository_name, count FROM (
SELECT actor, repository_name, count, ROW_NUMBER() OVER (PARTITION BY actor ORDER BY count DESC) rank FROM (
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
WHERE ABS(HASH(actor) % 10) = 0
GROUP EACH BY actor, repository_name
))
WHERE rank=1
Note that this time I used %10, as it gets me results faster. But you might be wondering "I want to get my results with one query, not 10".
注意,这次我使用了%10,因为它使我的结果更快。但你可能会想,“我想用一个查询来得到结果,而不是10个”。
There are 2 things you can do for that:
为此你可以做两件事:
- Unioning the partitioned tables (comma in the FROM expression does an union, not a join in BigQuery).
- 联合分区表(FROM表达式中的逗号执行联合,而不是BigQuery中的连接)。
- If you are still exceeding resources, you might need to materialize the table. Run the original query and save the result to a new table. Run the RANK() algorithm over that table, instead of over an in-memory GROUP.
- 如果您仍然超出了资源,您可能需要实现该表。运行原始查询并将结果保存到新表中。在该表上运行RANK()算法,而不是在内存中组上运行。
If you are willing to share your dataset with me, I could provide dataset specific advice (a lot depends on cardinality).
如果您愿意与我共享您的数据集,我可以提供数据集特定的建议(很大程度上取决于基数)。