合并Redshift表中的类似对

时间:2021-09-18 23:07:02

I have my data in Amazon Redshift which looks as follows:

我在Amazon Redshift中有我的数据,如下所示:

   Q1          Q2        Occ           Prob               Q1ID    Q2ID
fe349344    f821b6e1    1280    6.62226553857608E-7      AC.122  AC.124
f821b6e1    fe349344    1127    5.830697860918158E-7     AC.124  AC.122
fe349344    fb13cd0e    967     5.002914668596148E-7     AC.122  AC.124
1208bf29    02174133    945     4.889094479651871E-7     AC.831  AC.356

As we can see in the fist 2 rows, the pair values for Q1 and Q2 are the same. For my case here, I don't care if the same pair appears as Q1-Q2 or Q2-Q1. So I would like to change this to reflect only 1 pair for every 2 pairs available. I'm having a hard time coming up with language to describe this, so here's the result that I want:

正如我们在第2行中看到的那样,Q1和Q2的对值是相同的。就我的情况而言,我不关心同一对是否出现在Q1-Q2或Q2-Q1。所以我想改变它,以反映每2对可用的1对。我很难用语言来描述这个,所以这就是我想要的结果:

   Q1          Q2        Occ           Prob               Q1ID    Q2ID
fe349344    f821b6e1    2407    1.245296339949424E-6     AC.122  AC.124
fe349344    fb13cd0e    967     5.002914668596148E-7     AC.122  AC.124
1208bf29    02174133    945     4.889094479651871E-7     AC.831  AC.356

Here, I have collapsed the rows 1 and 2 to just row 1, and added the values for columns Occ and Prob.

在这里,我将第1行和第2行折叠为第1行,并添加了Occ和Prob列的值。

My question is: How do I achieve this using a query? I believe it requires a self join, but I'm not sure how to write on to achieve this task.

我的问题是:如何使用查询实现此目的?我相信它需要一个自我加入,但我不确定如何写下来实现这个任务。

Any help would be much appreciated.

任何帮助将非常感激。

TIA.

TIA。

2 个解决方案

#1


1  

You can use least and greatest (as you don't care if the pair appears as q1-q2 or q2-q1) to get one row per symmetric pair (if it exists) and sum the other columns.

您可以使用最小和最大(因为您不关心该对是否显示为q1-q2或q2-q1)以获得每对称对(如果存在)的一行并将其他列相加。

select least(q1,q2) as q1, greatest(q1,q2) as q2,
sum(occ),sum(prob),least(q1id,q2id) as q1id, greatest(q1id,q2id) as q2id
from t
group by least(q1,q2), greatest(q1,q2),least(q1id,q2id), greatest(q1id,q2id)

If the q1id and q2id are related to columns q1 and q2 and if the corresponding values should show up in those columns, use

如果q1id和q2id与列q1和q2相关,并且相应的值应显示在这些列中,请使用

select least(q1,q2) as q1, greatest(q1,q2) as q2,
sum(occ),sum(prob),
case when least(q1,q2) = q1 then q1id else q2id end as q1id,
case when greatest(q1,q2) = q2 then q2id else q1id end as q2id
from t
group by least(q1,q2), greatest(q1,q2),
case when least(q1,q2) = q1 then q1id else q2id end,
case when greatest(q1,q2) = q2 then q2id else q1id end

#2


0  

Here is one method:

这是一种方法:

select t.*
from t
where t.q1 < t.q2
union all
select t.*
from t
where t.q1 > t.q2 and
      not exists (select 1 from t t2 where t2.q1 = t1.q2 and t2.q2 = t1.q1);

In Redshift, I might be more inclined to do this with window functions:

在Redshift中,我可能更倾向于使用窗口函数来执行此操作:

select t.*
from (select t.*,
             count(*) over (partition by least(q1, q2), greatest(q1, q2)) as cnt
      from t
      ) t
where q1 < q2 or
      (q2 > q1 and cnt = 2);

Note: This assumes that there are no duplicate rows for values of q1, q2.

注意:这假定q1,q2的值没有重复的行。

(More inclined means: I'm scared of doing a correlated subquery in Redshift. It supports them grammatically but I don't know how well they work.)

(更倾向于:我害怕在Redshift中执行相关子查询。它在语法上支持它们,但我不知道它们的工作情况。)

#1


1  

You can use least and greatest (as you don't care if the pair appears as q1-q2 or q2-q1) to get one row per symmetric pair (if it exists) and sum the other columns.

您可以使用最小和最大(因为您不关心该对是否显示为q1-q2或q2-q1)以获得每对称对(如果存在)的一行并将其他列相加。

select least(q1,q2) as q1, greatest(q1,q2) as q2,
sum(occ),sum(prob),least(q1id,q2id) as q1id, greatest(q1id,q2id) as q2id
from t
group by least(q1,q2), greatest(q1,q2),least(q1id,q2id), greatest(q1id,q2id)

If the q1id and q2id are related to columns q1 and q2 and if the corresponding values should show up in those columns, use

如果q1id和q2id与列q1和q2相关,并且相应的值应显示在这些列中,请使用

select least(q1,q2) as q1, greatest(q1,q2) as q2,
sum(occ),sum(prob),
case when least(q1,q2) = q1 then q1id else q2id end as q1id,
case when greatest(q1,q2) = q2 then q2id else q1id end as q2id
from t
group by least(q1,q2), greatest(q1,q2),
case when least(q1,q2) = q1 then q1id else q2id end,
case when greatest(q1,q2) = q2 then q2id else q1id end

#2


0  

Here is one method:

这是一种方法:

select t.*
from t
where t.q1 < t.q2
union all
select t.*
from t
where t.q1 > t.q2 and
      not exists (select 1 from t t2 where t2.q1 = t1.q2 and t2.q2 = t1.q1);

In Redshift, I might be more inclined to do this with window functions:

在Redshift中,我可能更倾向于使用窗口函数来执行此操作:

select t.*
from (select t.*,
             count(*) over (partition by least(q1, q2), greatest(q1, q2)) as cnt
      from t
      ) t
where q1 < q2 or
      (q2 > q1 and cnt = 2);

Note: This assumes that there are no duplicate rows for values of q1, q2.

注意:这假定q1,q2的值没有重复的行。

(More inclined means: I'm scared of doing a correlated subquery in Redshift. It supports them grammatically but I don't know how well they work.)

(更倾向于:我害怕在Redshift中执行相关子查询。它在语法上支持它们,但我不知道它们的工作情况。)