Given a Google BigQuery dataset with col_1....col_m, how can you use Google BigQuery SQL to return the dataset where there are no duplicates in say... [col1, col3, col7] such that when there are rows with duplicates in [col1, col3, col7], then the first row among those duplicates is returned, and the rest of the rows which have duplicate fields in those columns are all removed?
给定一个谷歌BigQuery数据集与col_1 ....col_m,如何使用谷歌BigQuery SQL返回没有重复的数据集…[col1, col3, col7]这样,当[col1, col3, col7]中有重复的行时,那么返回这些重复的行中的第一行,并删除那些列中具有重复字段的其余行?
Example: removeDuplicates([col1, col3])
例如:removeDuplicates([col1 col3])
col1 col2 col3
---- ---- ----
r1: 20 25 30
r2: 20 70 30
r3: 40 70 30
returns
返回
col1 col2 col3
---- ---- ----
r1: 20 25 30
r3: 40 70 30
To do this using python pandas is easy. For a dataframe (i.e. matrix), you call the pandas function removedDuplicates([field1, field2, ...])
. However, removeDuplicates is not specified within the context of Google Big Query SQL.
使用python熊猫是很容易的。对于一个dataframe(即矩阵),您可以调用大熊猫函数removed副本([field1, field2,…])。但是,在谷歌Big Query SQL的上下文中没有指定removeduplicate。
My best guess with how to do it in Google Big Query is to use the rank()
function:
在谷歌大查询中,我的最佳猜测是使用rank()函数:
https://cloud.google.com/bigquery/query-reference#rank
https://cloud.google.com/bigquery/query-reference排名
I am looking for a concise solution if one exists.
我正在寻找一个简洁的解决方案,如果存在的话。
1 个解决方案
#1
5
You can group by all of your columns that you want to remove duplicates from, and use FIRST()
of the others. That is, removeDuplicates([col1, col3])
would translate to
您可以对所有列进行分组,您希望删除它们的副本,并使用其他列的第一个()。也就是说,removeduplicate ([col1, col3])将转化为
SELECT col1, FIRST(col2) as col2, col3
FROM table
GROUP EACH BY col1, col3
Note that in BigQuery SQL, if you have more than a million distinct values for col1
and col3
, you'll need the EACH
keyword.
注意,在BigQuery SQL中,如果col1和col3有超过100万个不同的值,则需要使用EACH关键字。
#1
5
You can group by all of your columns that you want to remove duplicates from, and use FIRST()
of the others. That is, removeDuplicates([col1, col3])
would translate to
您可以对所有列进行分组,您希望删除它们的副本,并使用其他列的第一个()。也就是说,removeduplicate ([col1, col3])将转化为
SELECT col1, FIRST(col2) as col2, col3
FROM table
GROUP EACH BY col1, col3
Note that in BigQuery SQL, if you have more than a million distinct values for col1
and col3
, you'll need the EACH
keyword.
注意,在BigQuery SQL中,如果col1和col3有超过100万个不同的值,则需要使用EACH关键字。