I'm using SAS University Edition to analyze the following table (actually has 2.5M rows in it)
我正在使用SAS University Edition来分析下表(实际上有2.5M行)
p_id c_id startyear endyear
0001 3201 2008 2013
0001 2131 2013 2015
0013 3201 2006 2010
where p_id is person_id and c_id is companyid.
其中p_id是person_id,c_id是companyid。
I want to get number of colleagues (number of persons that worked during an overlapping span at the same companies) in a certain year, so I created a table with the distinct p_ids and do the following query:
我想在某一年获得一些同事(在同一公司的重叠期间工作的人数),所以我创建了一个包含不同p_ids的表并执行以下查询:
PROC SQL;
UPDATE no_colleagues AS t1
SET c2007 = (
SELECT COUNT(DISTINCT t2.p_id) - 1
FROM table AS t2
INNER JOIN table AS t3
ON t3.p_id = t1.p_id
AND t3.c_id = t2.c_id
AND t3.startyear <= t2.endyear % checks overlapping criteria
AND t3.endyear >= t2.startyear % checks overlapping criteria
AND t3.startyear <= 2007 % limits number of returns
AND t2.startyear <= 2007 % limits number of returns
);
A single lookup on an indexed query (p_id, c_id, startyear, endyear) takes 0.04 seconds. The query above takes about 1.8 seconds for a single update, and does not use any indexes.
对索引查询(p_id,c_id,startyear,endyear)的单个查找需要0.04秒。单个更新上面的查询大约需要1.8秒,并且不使用任何索引。
So my question is:
所以我的问题是:
How to improve the query, and/or how to use indices to make sure the self join can use the indices?
如何改进查询,和/或如何使用索引来确保自联接可以使用索引?
Thanks in advance.
提前致谢。
2 个解决方案
#1
1
Based on your data, I'd do something like this, but maybe you need to tweak the code to fit your needs.
根据您的数据,我会做这样的事情,但也许您需要调整代码以满足您的需求。
First, create a table with p_id, c_id, year. So your first guy working at the company 3201 will have 6 observations in this table, one for each worked year.
首先,使用p_id,c_id,year创建一个表。所以你在公司3201工作的第一个人将在这个表中有6个观察,每个工作年一个。
data have_count;
set have;
do i=startyear to endyear;
worked_in = i;
output;
end;
drop i startyear endyear;
run;
Now you just count and agreggate:
现在你只需数数和agreggate:
proc sql;
select
worked_in as year
,c_id
,count(distinct p_id) as no_colleagues
from have_count
group by 1,2;
quit;
Result:
year c_id no_colleagues
2006 3201 1
2007 3201 1
2008 3201 2
2009 3201 2
2010 3201 2
2011 3201 1
2012 3201 1
2013 2131 1
2013 3201 1
2014 2131 1
2015 2131 1
#2
0
A more efficient method:
一种更有效的方法:
1) Create a long format table for the results rather than wide format. This will be both easier to populate and easier to work with later.
1)为结果而不是宽格式创建长格式表。这将更容易填充,以后更容易使用。
create table colleagues_by_year (
p_id int,
year int,
colleagues int
);
Now this can be populated with a single insert
statement. The only trick is getting the full list of years you want in the final table. There are a few options, but since I'm not too familiar with SAS SQL I'm going to go with a very simple one: a lookup table of years, to which you can join.
现在可以使用单个insert语句填充它。唯一的诀窍是在决赛桌中获得您想要的完整年份列表。有几个选项,但由于我对SAS SQL不太熟悉,我将使用一个非常简单的选项:多年的查找表,您可以加入。
create table years (
year int
);
insert into years
values (2007),(2008),...
(A more sophisticated approach would be a recursive query that found the range of all years in the input data).
(更复杂的方法是递归查询,它在输入数据中找到所有年份的范围)。
Now the final insert:
现在最后插入:
insert into colleagues_by_year
select p_id,
year,
count(*)
from colleagues
join years on
years.year between colleagues.startyear and colleagues.endyear
group by p_id,year
This won't have any rows where the number of colleagues for the year would be 0. If you wanted that you could make years be a left join and only count the rows where years.year is not null.
这将没有任何行,其中年份的同事数量为0.如果您希望这样,您可以将年份设为左连接,并且仅计算years.year不为空的行。
#1
1
Based on your data, I'd do something like this, but maybe you need to tweak the code to fit your needs.
根据您的数据,我会做这样的事情,但也许您需要调整代码以满足您的需求。
First, create a table with p_id, c_id, year. So your first guy working at the company 3201 will have 6 observations in this table, one for each worked year.
首先,使用p_id,c_id,year创建一个表。所以你在公司3201工作的第一个人将在这个表中有6个观察,每个工作年一个。
data have_count;
set have;
do i=startyear to endyear;
worked_in = i;
output;
end;
drop i startyear endyear;
run;
Now you just count and agreggate:
现在你只需数数和agreggate:
proc sql;
select
worked_in as year
,c_id
,count(distinct p_id) as no_colleagues
from have_count
group by 1,2;
quit;
Result:
year c_id no_colleagues
2006 3201 1
2007 3201 1
2008 3201 2
2009 3201 2
2010 3201 2
2011 3201 1
2012 3201 1
2013 2131 1
2013 3201 1
2014 2131 1
2015 2131 1
#2
0
A more efficient method:
一种更有效的方法:
1) Create a long format table for the results rather than wide format. This will be both easier to populate and easier to work with later.
1)为结果而不是宽格式创建长格式表。这将更容易填充,以后更容易使用。
create table colleagues_by_year (
p_id int,
year int,
colleagues int
);
Now this can be populated with a single insert
statement. The only trick is getting the full list of years you want in the final table. There are a few options, but since I'm not too familiar with SAS SQL I'm going to go with a very simple one: a lookup table of years, to which you can join.
现在可以使用单个insert语句填充它。唯一的诀窍是在决赛桌中获得您想要的完整年份列表。有几个选项,但由于我对SAS SQL不太熟悉,我将使用一个非常简单的选项:多年的查找表,您可以加入。
create table years (
year int
);
insert into years
values (2007),(2008),...
(A more sophisticated approach would be a recursive query that found the range of all years in the input data).
(更复杂的方法是递归查询,它在输入数据中找到所有年份的范围)。
Now the final insert:
现在最后插入:
insert into colleagues_by_year
select p_id,
year,
count(*)
from colleagues
join years on
years.year between colleagues.startyear and colleagues.endyear
group by p_id,year
This won't have any rows where the number of colleagues for the year would be 0. If you wanted that you could make years be a left join and only count the rows where years.year is not null.
这将没有任何行,其中年份的同事数量为0.如果您希望这样,您可以将年份设为左连接,并且仅计算years.year不为空的行。