如何为自联接编制索引

I'm using SAS University Edition to analyze the following table (actually has 2.5M rows in it)

我正在使用SAS University Edition来分析下表(实际上有2.5M行)

p_id c_id startyear endyear
0001 3201      2008    2013
0001 2131      2013    2015
0013 3201      2006    2010

where p_id is person_id and c_id is companyid.

其中p_id是person_id,c_id是companyid。

I want to get number of colleagues (number of persons that worked during an overlapping span at the same companies) in a certain year, so I created a table with the distinct p_ids and do the following query:

我想在某一年获得一些同事(在同一公司的重叠期间工作的人数),所以我创建了一个包含不同p_ids的表并执行以下查询:

PROC SQL;

UPDATE no_colleagues AS t1
SET c2007 = (
    SELECT COUNT(DISTINCT t2.p_id) - 1
    FROM table AS t2
    INNER JOIN table AS t3
    ON  t3.p_id = t1.p_id
    AND t3.c_id = t2.c_id
    AND t3.startyear <= t2.endyear   % checks overlapping criteria
    AND t3.endyear >= t2.startyear   % checks overlapping criteria
    AND t3.startyear <= 2007         % limits number of returns
    AND t2.startyear <= 2007         % limits number of returns
);

A single lookup on an indexed query (p_id, c_id, startyear, endyear) takes 0.04 seconds. The query above takes about 1.8 seconds for a single update, and does not use any indexes.

对索引查询(p_id,c_id,startyear,endyear)的单个查找需要0.04秒。单个更新上面的查询大约需要1.8秒,并且不使用任何索引。

So my question is:

所以我的问题是:

How to improve the query, and/or how to use indices to make sure the self join can use the indices?

如何改进查询,和/或如何使用索引来确保自联接可以使用索引?

Thanks in advance.

提前致谢。

2 个解决方案

#1

Based on your data, I'd do something like this, but maybe you need to tweak the code to fit your needs.

根据您的数据,我会做这样的事情,但也许您需要调整代码以满足您的需求。

First, create a table with p_id, c_id, year. So your first guy working at the company 3201 will have 6 observations in this table, one for each worked year.

首先,使用p_id,c_id,year创建一个表。所以你在公司3201工作的第一个人将在这个表中有6个观察,每个工作年一个。

data have_count;
    set have;

do i=startyear to endyear;
    worked_in = i;
    output;
end;

drop i startyear endyear;
run;

Now you just count and agreggate:

现在你只需数数和agreggate:

proc sql;
select
    worked_in as year
    ,c_id
    ,count(distinct p_id) as no_colleagues
from have_count
group by 1,2;
quit;

Result:

year c_id no_colleagues 
2006 3201 1 
2007 3201 1 
2008 3201 2 
2009 3201 2 
2010 3201 2 
2011 3201 1 
2012 3201 1 
2013 2131 1 
2013 3201 1 
2014 2131 1 
2015 2131 1

#2

A more efficient method:

一种更有效的方法:

1) Create a long format table for the results rather than wide format. This will be both easier to populate and easier to work with later.

1)为结果而不是宽格式创建长格式表。这将更容易填充,以后更容易使用。

create table colleagues_by_year (
    p_id int,
    year int,
    colleagues int
);

Now this can be populated with a single insert statement. The only trick is getting the full list of years you want in the final table. There are a few options, but since I'm not too familiar with SAS SQL I'm going to go with a very simple one: a lookup table of years, to which you can join.

现在可以使用单个insert语句填充它。唯一的诀窍是在决赛桌中获得您想要的完整年份列表。有几个选项,但由于我对SAS SQL不太熟悉,我将使用一个非常简单的选项:多年的查找表,您可以加入。

create table years (
   year int
);

insert into years 
   values (2007),(2008),...

(A more sophisticated approach would be a recursive query that found the range of all years in the input data).

(更复杂的方法是递归查询,它在输入数据中找到所有年份的范围)。

Now the final insert:

现在最后插入:

insert into colleagues_by_year
    select p_id,
           year,
           count(*)
       from colleagues
       join years on
          years.year between colleagues.startyear and colleagues.endyear
       group by p_id,year

This won't have any rows where the number of colleagues for the year would be 0. If you wanted that you could make years be a left join and only count the rows where years.year is not null.

这将没有任何行,其中年份的同事数量为0.如果您希望这样,您可以将年份设为左连接,并且仅计算years.year不为空的行。

#1