How are the comments
and posts
tables related in the Reddit dataset available on BigQuery? It doesn't seem obvious.
如何在BigQuery上提供Reddit数据集中的评论和帖子表?这似乎并不明显。
2 个解决方案
#1
2
Below is for BigQuery Standard SQL
以下是BigQuery Standard SQL
#standardSQL
SELECT posts.title, comments.body
FROM `fh-bigquery.reddit_comments.2016_01` AS comments
JOIN `fh-bigquery.reddit_posts.2016_01` AS posts
ON posts.id = SUBSTR(comments.link_id, 4)
WHERE posts.id = '43go1r'
If you still using BigQuery Legacy SQL consider migrating to BigQuery Standard SQL
.
如果您仍在使用BigQuery Legacy SQL,请考虑迁移到BigQuery Standard SQL。
Btw, performance wise it took 2 sec vs. 18 sec in Legacy SQL
顺便说一句,性能方面它在Legacy SQL中耗时2秒而18秒
#2
1
Using the advice from u/Infamous_Blue, we can join comments to their parent posts by using SUBSTR()
on the column link_id
and matching it with the post's id
. For example, each comment will have a link_id
looking something like t3_43go1r
, so to match the post's id
of 43go1r
we must call SUBSTR(link_id, 4)
.
使用来自u / Infamous_Blue的建议,我们可以使用列link_id上的SUBSTR()并将其与帖子的id匹配,将评论加入其父帖子。例如,每个注释都会有一个类似于t3_43go1r的link_id,所以要匹配帖子的id为43go1r,我们必须调用SUBSTR(link_id,4)。
Here is a complete query where we join the post's title
with each comments body
:
这是一个完整的查询,我们在每个评论正文中加入帖子的标题:
select posts.title, comments.body --grab anything you like
from (select SUBSTR(link_id, 4) as lnk, body
from [fh-bigquery:reddit_comments.2016_01]) as comments,
join [fh-bigquery:reddit_posts.2016_01] as posts
on posts.id = comments.lnk
where posts.id = '43go1r'; --random subreddit
This completed in 40.3 seconds and processed 11.9 GB when ran.
这在40.3秒内完成,并在运行时处理11.9 GB。
#1
2
Below is for BigQuery Standard SQL
以下是BigQuery Standard SQL
#standardSQL
SELECT posts.title, comments.body
FROM `fh-bigquery.reddit_comments.2016_01` AS comments
JOIN `fh-bigquery.reddit_posts.2016_01` AS posts
ON posts.id = SUBSTR(comments.link_id, 4)
WHERE posts.id = '43go1r'
If you still using BigQuery Legacy SQL consider migrating to BigQuery Standard SQL
.
如果您仍在使用BigQuery Legacy SQL,请考虑迁移到BigQuery Standard SQL。
Btw, performance wise it took 2 sec vs. 18 sec in Legacy SQL
顺便说一句,性能方面它在Legacy SQL中耗时2秒而18秒
#2
1
Using the advice from u/Infamous_Blue, we can join comments to their parent posts by using SUBSTR()
on the column link_id
and matching it with the post's id
. For example, each comment will have a link_id
looking something like t3_43go1r
, so to match the post's id
of 43go1r
we must call SUBSTR(link_id, 4)
.
使用来自u / Infamous_Blue的建议,我们可以使用列link_id上的SUBSTR()并将其与帖子的id匹配,将评论加入其父帖子。例如,每个注释都会有一个类似于t3_43go1r的link_id,所以要匹配帖子的id为43go1r,我们必须调用SUBSTR(link_id,4)。
Here is a complete query where we join the post's title
with each comments body
:
这是一个完整的查询,我们在每个评论正文中加入帖子的标题:
select posts.title, comments.body --grab anything you like
from (select SUBSTR(link_id, 4) as lnk, body
from [fh-bigquery:reddit_comments.2016_01]) as comments,
join [fh-bigquery:reddit_posts.2016_01] as posts
on posts.id = comments.lnk
where posts.id = '43go1r'; --random subreddit
This completed in 40.3 seconds and processed 11.9 GB when ran.
这在40.3秒内完成,并在运行时处理11.9 GB。