按日期过滤大表

时间:2021-09-28 22:21:07

I have a table, VISIT_INFO, with these columns:

我有一个表,VISIT_INFO,这些列:

pers_key - unique identifyer for each person
pers_name - name of person
visit_date - date at which they visited a business

And another table, VALID_DATES, with these columns:

还有另一个表VALID_DATES,包含以下列:

condition - string
start_date - date
end_date - date 

I currently have the following query:

我目前有以下查询:

select pers_key, pers_name from VISIT_INFO a
CROSS JOIN
(select start_date, end_date from VALID_DATES where condition = 'condition1') b
WHERE (a.visit_date >= b.start_date and a.visit_date <= b.end_date)
GROUP BY a.pers_key

So 'condition1' has a specific start_date and end_date. I need to filter VISIT_INFO for visits that are between the two dates. I'm wondering if there is a more efficient way to do this. From my current understanding, it currently has to go through the entire table (millions of rows) and add start_date and end_date to each row. Then does it have to go through each row again and test against the WHERE condition?

所以'condition1'有一个特定的start_date和end_date。我需要为两个日期之间的访问过滤VISIT_INFO。我想知道是否有更有效的方法来做到这一点。根据我目前的理解,它目前必须遍历整个表(数百万行)并将start_date和end_date添加到每一行。那么它是否必须再次遍历每一行并测试WHERE条件?

I ask this because when I remove the cross join and hardcode the start_date and end_date for condition1, it takes substantially less time. I'm trying to avoid hardcoding in the dates because it will lead to serious tedium down the road.

我问这个是因为当我删除交叉连接并对condition1硬编码start_date和end_date时,它花费的时间要少得多。我试图避免在日期中进行硬编码,因为这将导致严重的单调乏味。

So to reiterate, is there a better way to filter VISIT_INFO by specific dates in VALID_DATES?

重申一下,有没有更好的方法来过滤VALIS_DATES中特定日期的VISIT_INFO?

Edit: I just realized I left out a pretty huge piece of information, being that this is all in HIVE. So EXISTS and joins on (a between b and c) are out of the question.

编辑:我刚刚意识到我遗漏了一大堆信息,因为这都属于HIVE。所以EXISTS和加入(在b和c之间)是不可能的。

3 个解决方案

#1


1  

How about:

SELECT DISTINCT pers_key, pers_name
FROM visit_info
WHERE EXISTS
(
    SELECT 1
    FROM valid_dates
    WHERE condition = 'condition1'
    AND visit_date BETWEEN start_date AND end_date
);

?

#2


0  

with dt as (select start_date, end_date from VALID_DATES where condition = 'condition1')
select a.pers_key, a.pers_name 
from VISIT_INFO a
JOIN dt on a.visit_date between dt.start_date and dt.end_date
GROUP BY a.pers_key

#3


0  

Trying the exists version is definitely a possibility. However, you might be better off expanding the VALID_DATES table, so there is one row per date.

尝试存在版本绝对是可能的。但是,您可能最好扩展VALID_DATES表,因此每个日期有一行。

Then, the query:

然后,查询:

select vi.*
from VISIT_INFO vi JOIN
     VALID_DATES_expanded vde
     ON vi.visit_date = vde.valid_date
where vde.condition = 'condition1';

can make use of an index on VISIT_INFO(visit_date) and on VALID_DATES_expanded(condition, valid_date). This is likely to be the fastest approach to solving this problem, if VISIT_INFO is very large and relatively few rows are being selected by the query.

可以使用VISIT_INFO(visit_date)和VALID_DATES_expanded(condition,valid_date)上的索引。如果VISIT_INFO非常大并且查询选择了相对较少的行,这可能是解决此问题的最快方法。

#1


1  

How about:

SELECT DISTINCT pers_key, pers_name
FROM visit_info
WHERE EXISTS
(
    SELECT 1
    FROM valid_dates
    WHERE condition = 'condition1'
    AND visit_date BETWEEN start_date AND end_date
);

?

#2


0  

with dt as (select start_date, end_date from VALID_DATES where condition = 'condition1')
select a.pers_key, a.pers_name 
from VISIT_INFO a
JOIN dt on a.visit_date between dt.start_date and dt.end_date
GROUP BY a.pers_key

#3


0  

Trying the exists version is definitely a possibility. However, you might be better off expanding the VALID_DATES table, so there is one row per date.

尝试存在版本绝对是可能的。但是,您可能最好扩展VALID_DATES表,因此每个日期有一行。

Then, the query:

然后,查询:

select vi.*
from VISIT_INFO vi JOIN
     VALID_DATES_expanded vde
     ON vi.visit_date = vde.valid_date
where vde.condition = 'condition1';

can make use of an index on VISIT_INFO(visit_date) and on VALID_DATES_expanded(condition, valid_date). This is likely to be the fastest approach to solving this problem, if VISIT_INFO is very large and relatively few rows are being selected by the query.

可以使用VISIT_INFO(visit_date)和VALID_DATES_expanded(condition,valid_date)上的索引。如果VISIT_INFO非常大并且查询选择了相对较少的行,这可能是解决此问题的最快方法。