根据Redshift中的条件过滤数据

时间:2021-09-22 23:07:13

I came across one more issue while resolving the previous problem: So, I have this data: 根据Redshift中的条件过滤数据

在解决上一个问题时我遇到了另外一个问题:所以,我有这些数据:

For each route -> I want to get only those rows where ob exists in rb. Hence, this output: 根据Redshift中的条件过滤数据

对于每个路由 - >我想只获得rb中存在ob的那些行。因此,这个输出:

I know this also needs to worked through a temp table. Earlier I was doing this as suggested by @smb:

我知道这也需要通过临时表来完成。之前我按照@smb的建议做了这个:

select * from table_name as a inner join (select load, rb from table_name group by load, rb) as b on a.load = b.load and a.ob = b.rb

select * from table_name作为内部联接(select load,rb from table_name group by load,rb)as b on a.load = b.load and a.ob = b.rb

but this solution will give me: 根据Redshift中的条件过滤数据

但是这个解决方案会给我:

And this is incorrect as it doesn’t take into account the route.

这是不正确的,因为它没有考虑路线。

It’d be great if you guys could help :)

如果你们能帮忙的话会很棒:)

Thanks

谢谢

2 个解决方案

#1


2  

updated to add in route -

更新以添加路线 -

The answer would be in a nested join. The concept is

答案是嵌套连接。这个概念是

  1. Get a list of distinct pairs of obs and rbs
  2. 获取不同的obs和rbs对列表
  3. Join to the original data where ob = ob and lane = rb
  4. 加入原始数据,其中ob = ob和lane = rb

Code as follows:

代码如下:

select * from table_name as a
inner join
(select route, ob, rb from table_name
group by route, ob, rb) as b
on a.ob = b.ob
and 
a.lane = b.rb
and 
a.route = b.route

I have done an example using a temp table here so you can see it in action.

我在这里使用临时表做了一个例子,这样你就可以看到它的实际效果。

Note that if your data is large you should consider making sure your dist key in the join. This makes sure that redshift knows that no rows need to be joined across different compute nodes so it can execute multiple local joins and therefore be more efficient.

请注意,如果您的数据很大,则应考虑在连接中确保您的dist键。这可以确保redshift知道不需要在不同的计算节点之间连接任何行,因此它可以执行多个本地连接,因此效率更高。

#2


1  

few ways (in statement is simple but often slower on larger sets)

几种方式(声明很简单但在较大的集合上通常较慢)

select * 
from table
where lane in (select rb from table)

or (i find exists faster on larger sets, but try both )

或者(我觉得在较大的集合上存在得更快,但尝试两者)

select * 
from table
where exists (select 'x' from table t_inner 
              where t_inner.rb = table.lane)

either way create an index on the rb column for speed

无论哪种方式在rb列上创建索引以获得速度

#1


2  

updated to add in route -

更新以添加路线 -

The answer would be in a nested join. The concept is

答案是嵌套连接。这个概念是

  1. Get a list of distinct pairs of obs and rbs
  2. 获取不同的obs和rbs对列表
  3. Join to the original data where ob = ob and lane = rb
  4. 加入原始数据,其中ob = ob和lane = rb

Code as follows:

代码如下:

select * from table_name as a
inner join
(select route, ob, rb from table_name
group by route, ob, rb) as b
on a.ob = b.ob
and 
a.lane = b.rb
and 
a.route = b.route

I have done an example using a temp table here so you can see it in action.

我在这里使用临时表做了一个例子,这样你就可以看到它的实际效果。

Note that if your data is large you should consider making sure your dist key in the join. This makes sure that redshift knows that no rows need to be joined across different compute nodes so it can execute multiple local joins and therefore be more efficient.

请注意,如果您的数据很大,则应考虑在连接中确保您的dist键。这可以确保redshift知道不需要在不同的计算节点之间连接任何行,因此它可以执行多个本地连接,因此效率更高。

#2


1  

few ways (in statement is simple but often slower on larger sets)

几种方式(声明很简单但在较大的集合上通常较慢)

select * 
from table
where lane in (select rb from table)

or (i find exists faster on larger sets, but try both )

或者(我觉得在较大的集合上存在得更快,但尝试两者)

select * 
from table
where exists (select 'x' from table t_inner 
              where t_inner.rb = table.lane)

either way create an index on the rb column for speed

无论哪种方式在rb列上创建索引以获得速度