I am trying to accomplish a task in Google's BigQuery which may require logic I am not sure SQL can handle natively.
我正在尝试完成Google的BigQuery中的任务,这可能需要逻辑我不确定SQL可以本机处理。
I have 2 tables:
我有2张桌子:
- First table has a single column where each row is a single lowercase word
- 第一个表有一列,每行是一个小写字
- Second table is a database of comments (with data like who made the comment, the comment itself, timestamp etc)
- 第二个表是一个评论数据库(包括谁发表评论,评论本身,时间戳等数据)
I want sort the comments in the second table by the number of occurrences of the words in the first table.
我想根据第一个表中单词的出现次数对第二个表中的注释进行排序。
Here is a basic example of what I want to do, using python, using letters instead of words... but you get the idea:
这是我想要做的一个基本的例子,使用python,使用字母而不是单词...但你明白了:
words = ['a','b','c','d','e']
comments = ['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']
wordcount = {}
for comment in comments:
for word in words:
if word in comment:
if comment in wordcount:
wordcount[comment] += 1
else:
wordcount[comment] = 1
print(sorted(wordcount.items(), key = lambda k: k[1], reverse=True))
Output:
输出:
[('look another sentence, which is also a comment', 3), ('this is another comment', 3), ('this is the first sentence', 2), ('nope', 1)]
The best thing I have seen so far for generating an SQL query is doing something like the following:
到目前为止,我所看到的生成SQL查询的最好的事情是执行以下操作:
SELECT
COUNT(*)
FROM
table
WHERE
comment_col like '%word1%'
OR comment_col like '%word2%'
OR ...
But there are over 2000 words... it just doesn't feel right. Any tips?
但是有超过2000个单词......它感觉不对。有小费吗?
2 个解决方案
#1
2
Below is for BigQuery Standard SQL
以下是BigQuery Standard SQL
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0
GROUP BY comment
-- ORDER BY cnt DESC
As an option you can use regexp if you wish:
如果您愿意,可以使用正则表达式:
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, word)
GROUP BY comment
-- ORDER BY cnt DESC
You can test / play with above using dummy example from your question
您可以使用问题中的虚拟示例来测试/播放上面的内容
#standardSQL
WITH words AS (
SELECT word
FROM UNNEST(['a','b','c','d','e']) word
),
comments AS (
SELECT comment
FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0
GROUP BY comment
ORDER BY cnt DESC
Update for :
更新:
Do have any quick suggestions to do full string match only?
有任何快速建议只能进行完整的字符串匹配吗?
#standardSQL
WITH words AS (
SELECT word
FROM UNNEST(['a','no','is','d','e']) word
),
comments AS (
SELECT comment
FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, CONCAT(r'\b', word, r'\b'))
GROUP BY comment
ORDER BY cnt DESC
#2
1
If I understand it well, I think you need a query like this:
如果我理解得很好,我认为你需要这样的查询:
select comment, count(*) cnt
from comments
join words
on comment like '% ' + word + ' %' --this checks for `... word ..`; a word between spaces
or comment like word + ' %' --this checks for `word ..`; a word at the start of comment
or comment like '% ' + word --this checks for `.. word`; a word at the end of comment
or comment = word --this checks for `word`; whole comment is the word
group by comment
order by count(*) desc
SQL Server Fiddle Demo as a sample
SQL Server Fiddle Demo作为示例
#1
2
Below is for BigQuery Standard SQL
以下是BigQuery Standard SQL
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0
GROUP BY comment
-- ORDER BY cnt DESC
As an option you can use regexp if you wish:
如果您愿意,可以使用正则表达式:
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, word)
GROUP BY comment
-- ORDER BY cnt DESC
You can test / play with above using dummy example from your question
您可以使用问题中的虚拟示例来测试/播放上面的内容
#standardSQL
WITH words AS (
SELECT word
FROM UNNEST(['a','b','c','d','e']) word
),
comments AS (
SELECT comment
FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0
GROUP BY comment
ORDER BY cnt DESC
Update for :
更新:
Do have any quick suggestions to do full string match only?
有任何快速建议只能进行完整的字符串匹配吗?
#standardSQL
WITH words AS (
SELECT word
FROM UNNEST(['a','no','is','d','e']) word
),
comments AS (
SELECT comment
FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, CONCAT(r'\b', word, r'\b'))
GROUP BY comment
ORDER BY cnt DESC
#2
1
If I understand it well, I think you need a query like this:
如果我理解得很好,我认为你需要这样的查询:
select comment, count(*) cnt
from comments
join words
on comment like '% ' + word + ' %' --this checks for `... word ..`; a word between spaces
or comment like word + ' %' --this checks for `word ..`; a word at the start of comment
or comment like '% ' + word --this checks for `.. word`; a word at the end of comment
or comment = word --this checks for `word`; whole comment is the word
group by comment
order by count(*) desc
SQL Server Fiddle Demo as a sample
SQL Server Fiddle Demo作为示例