I have a table, VISIT_INFO, with these columns:
我有一个表,VISIT_INFO,这些列:
pers_key - unique identifyer for each person
pers_name - name of person
visit_date - date at which they visited a business
And another table, VALID_DATES, with these columns:
还有另一个表VALID_DATES,包含以下列:
condition - string
start_date - date
end_date - date
I currently have the following query:
我目前有以下查询:
select pers_key, pers_name from VISIT_INFO a
CROSS JOIN
(select start_date, end_date from VALID_DATES where condition = 'condition1') b
WHERE (a.visit_date >= b.start_date and a.visit_date <= b.end_date)
GROUP BY a.pers_key
So 'condition1' has a specific start_date and end_date. I need to filter VISIT_INFO for visits that are between the two dates. I'm wondering if there is a more efficient way to do this. From my current understanding, it currently has to go through the entire table (millions of rows) and add start_date and end_date to each row. Then does it have to go through each row again and test against the WHERE condition?
所以'condition1'有一个特定的start_date和end_date。我需要为两个日期之间的访问过滤VISIT_INFO。我想知道是否有更有效的方法来做到这一点。根据我目前的理解,它目前必须遍历整个表(数百万行)并将start_date和end_date添加到每一行。那么它是否必须再次遍历每一行并测试WHERE条件?
I ask this because when I remove the cross join and hardcode the start_date and end_date for condition1, it takes substantially less time. I'm trying to avoid hardcoding in the dates because it will lead to serious tedium down the road.
我问这个是因为当我删除交叉连接并对condition1硬编码start_date和end_date时,它花费的时间要少得多。我试图避免在日期中进行硬编码,因为这将导致严重的单调乏味。
So to reiterate, is there a better way to filter VISIT_INFO by specific dates in VALID_DATES?
重申一下,有没有更好的方法来过滤VALIS_DATES中特定日期的VISIT_INFO?
Edit: I just realized I left out a pretty huge piece of information, being that this is all in HIVE. So EXISTS and joins on (a between b and c) are out of the question.
编辑:我刚刚意识到我遗漏了一大堆信息,因为这都属于HIVE。所以EXISTS和加入(在b和c之间)是不可能的。
3 个解决方案
#1
1
How about:
SELECT DISTINCT pers_key, pers_name
FROM visit_info
WHERE EXISTS
(
SELECT 1
FROM valid_dates
WHERE condition = 'condition1'
AND visit_date BETWEEN start_date AND end_date
);
?
#2
0
with dt as (select start_date, end_date from VALID_DATES where condition = 'condition1')
select a.pers_key, a.pers_name
from VISIT_INFO a
JOIN dt on a.visit_date between dt.start_date and dt.end_date
GROUP BY a.pers_key
#3
0
Trying the exists
version is definitely a possibility. However, you might be better off expanding the VALID_DATES
table, so there is one row per date.
尝试存在版本绝对是可能的。但是,您可能最好扩展VALID_DATES表,因此每个日期有一行。
Then, the query:
然后,查询:
select vi.*
from VISIT_INFO vi JOIN
VALID_DATES_expanded vde
ON vi.visit_date = vde.valid_date
where vde.condition = 'condition1';
can make use of an index on VISIT_INFO(visit_date)
and on VALID_DATES_expanded(condition, valid_date)
. This is likely to be the fastest approach to solving this problem, if VISIT_INFO
is very large and relatively few rows are being selected by the query.
可以使用VISIT_INFO(visit_date)和VALID_DATES_expanded(condition,valid_date)上的索引。如果VISIT_INFO非常大并且查询选择了相对较少的行,这可能是解决此问题的最快方法。
#1
1
How about:
SELECT DISTINCT pers_key, pers_name
FROM visit_info
WHERE EXISTS
(
SELECT 1
FROM valid_dates
WHERE condition = 'condition1'
AND visit_date BETWEEN start_date AND end_date
);
?
#2
0
with dt as (select start_date, end_date from VALID_DATES where condition = 'condition1')
select a.pers_key, a.pers_name
from VISIT_INFO a
JOIN dt on a.visit_date between dt.start_date and dt.end_date
GROUP BY a.pers_key
#3
0
Trying the exists
version is definitely a possibility. However, you might be better off expanding the VALID_DATES
table, so there is one row per date.
尝试存在版本绝对是可能的。但是,您可能最好扩展VALID_DATES表,因此每个日期有一行。
Then, the query:
然后,查询:
select vi.*
from VISIT_INFO vi JOIN
VALID_DATES_expanded vde
ON vi.visit_date = vde.valid_date
where vde.condition = 'condition1';
can make use of an index on VISIT_INFO(visit_date)
and on VALID_DATES_expanded(condition, valid_date)
. This is likely to be the fastest approach to solving this problem, if VISIT_INFO
is very large and relatively few rows are being selected by the query.
可以使用VISIT_INFO(visit_date)和VALID_DATES_expanded(condition,valid_date)上的索引。如果VISIT_INFO非常大并且查询选择了相对较少的行,这可能是解决此问题的最快方法。