在BigQuery Reddit数据集中使用注释添加帖子

How are the comments and posts tables related in the Reddit dataset available on BigQuery? It doesn't seem obvious.

如何在BigQuery上提供Reddit数据集中的评论和帖子表？这似乎并不明显。

2 个解决方案

#1

Below is for BigQuery Standard SQL

以下是BigQuery Standard SQL

#standardSQL
SELECT posts.title, comments.body
FROM `fh-bigquery.reddit_comments.2016_01` AS comments
JOIN `fh-bigquery.reddit_posts.2016_01`  AS posts
ON posts.id = SUBSTR(comments.link_id, 4) 
WHERE posts.id = '43go1r'

If you still using BigQuery Legacy SQL consider migrating to BigQuery Standard SQL.

如果您仍在使用BigQuery Legacy SQL，请考虑迁移到BigQuery Standard SQL。

Btw, performance wise it took 2 sec vs. 18 sec in Legacy SQL

顺便说一句，性能方面它在Legacy SQL中耗时2秒而18秒

#2

Using the advice from u/Infamous_Blue, we can join comments to their parent posts by using SUBSTR() on the column link_id and matching it with the post's id. For example, each comment will have a link_id looking something like t3_43go1r, so to match the post's id of 43go1r we must call SUBSTR(link_id, 4).

使用来自u / Infamous_Blue的建议，我们可以使用列link_id上的SUBSTR（）并将其与帖子的id匹配，将评论加入其父帖子。例如，每个注释都会有一个类似于t3_43go1r的link_id，所以要匹配帖子的id为43go1r，我们必须调用SUBSTR（link_id，4）。

Here is a complete query where we join the post's title with each comments body:

这是一个完整的查询，我们在每个评论正文中加入帖子的标题：

select posts.title, comments.body --grab anything you like
from (select SUBSTR(link_id, 4) as lnk, body 
      from [fh-bigquery:reddit_comments.2016_01]) as comments,
join [fh-bigquery:reddit_posts.2016_01]  as posts
on posts.id = comments.lnk
where posts.id = '43go1r'; --random subreddit

This completed in 40.3 seconds and processed 11.9 GB when ran.

这在40.3秒内完成，并在运行时处理11.9 GB。

#1