Looking for ways of counting distinct entries across multiple columns / variables with PROC SQL, all I am coming across is how to count combinations of values. However, I would like to search through 2 (character) columns (within rows that meet a certain condition) and count the number of distinct values that appear in any of the two.
寻找使用PROC SQL计算跨多个列/变量的不同条目的方法,我所遇到的是如何计算值的组合。但是,我想搜索2(字符)列(在满足特定条件的行内)并计算出现在两者中的任何一个中的不同值的数量。
Consider a dataset that looks like this:
考虑一个如下所示的数据集:
DATA have;
INPUT A_ID C C_ID1 $ C_ID2 $;
DATALINES;
1 1 abc .
2 0 . .
3 1 efg abc
4 0 . .
5 1 abc kli
6 1 hij .
;
RUN;
I now want to have a table containing the count of the nr. of unique values within C_ID1 and C_ID2 in rows where C = 1. The result should be 4 (abc, efg, hij, kli):
我现在想要一个包含nr计数的表。 C_ID1和C_ID2中C = 1的行中的唯一值。结果应为4(abc,efg,hij,kli):
nr_distinct_C_IDs
4
So far, I only have been able to process one column (C_ID1):
到目前为止,我只能处理一个列(C_ID1):
PROC SQL;
CREATE TABLE try AS
SELECT
COUNT (DISTINCT
(CASE WHEN C=1 THEN C_ID1 ELSE ' ' END)) AS nr_distinct_C_IDs
FROM have;
QUIT;
(Note that I use CASE processing instead of a WHERE clause since my actual PROC SQL also processes other cases within the same query).
(请注意,我使用CASE处理而不是WHERE子句,因为我的实际PROC SQL也处理同一查询中的其他情况)。
This gives me:
这给了我:
nr_distinct_C_IDs
3
How can I extend this to two variables (C_ID1 and C_ID2 in my example)?
如何将其扩展为两个变量(在我的示例中为C_ID1和C_ID2)?
2 个解决方案
#1
It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:
使用您的方法很难将其扩展为两个或更多变量。尝试先堆叠变量,然后计算不同的值。像这样:
proc sql;
create table want as
select count(ID) as nr_distinct_C_IDs from
(select C_ID1 as ID from have
union
select C_ID2 as ID from have)
where not missing(ID);
quit;
#2
I think in this case a data step may be a better fit if your priority is to come up with something that extends easily to a large number of variables. E.g.
我认为在这种情况下,如果您的优先级是想出一些容易扩展到大量变量的东西,那么数据步骤可能更合适。例如。
data _null_;
length ID $3;
declare hash h();
rc = h.definekey('ID');
rc = h.definedone();
array IDs $ C_ID1-C_ID2;
do until(eof);
set have(where = (C = 1)) end = eof;
do i = 1 to dim(IDs);
if not(missing(IDs[i])) then do;
ID = IDs[i];
rc = h.add();
if rc = 0 then COUNT + 1;
end;
end;
end;
put "Total distinct values found: " COUNT;
run;
All that needs to be done here to accommodate a further variable is to add it to the array.
所有需要在此处完成以容纳另一个变量的是将其添加到数组中。
N.B. as this uses a hash object, you will need sufficient memory to hold all of the distinct values you expect to find. On the other hand, it only reads the input dataset once, with no sorting required, so it might be faster than SQL approaches that require multiple internal reads and sorts.
注:因为这使用了一个哈希对象,你需要足够的内存来保存你期望找到的所有不同的值。另一方面,它只读取输入数据集一次,不需要排序,因此它可能比需要多次内部读取和排序的SQL方法更快。
#1
It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:
使用您的方法很难将其扩展为两个或更多变量。尝试先堆叠变量,然后计算不同的值。像这样:
proc sql;
create table want as
select count(ID) as nr_distinct_C_IDs from
(select C_ID1 as ID from have
union
select C_ID2 as ID from have)
where not missing(ID);
quit;
#2
I think in this case a data step may be a better fit if your priority is to come up with something that extends easily to a large number of variables. E.g.
我认为在这种情况下,如果您的优先级是想出一些容易扩展到大量变量的东西,那么数据步骤可能更合适。例如。
data _null_;
length ID $3;
declare hash h();
rc = h.definekey('ID');
rc = h.definedone();
array IDs $ C_ID1-C_ID2;
do until(eof);
set have(where = (C = 1)) end = eof;
do i = 1 to dim(IDs);
if not(missing(IDs[i])) then do;
ID = IDs[i];
rc = h.add();
if rc = 0 then COUNT + 1;
end;
end;
end;
put "Total distinct values found: " COUNT;
run;
All that needs to be done here to accommodate a further variable is to add it to the array.
所有需要在此处完成以容纳另一个变量的是将其添加到数组中。
N.B. as this uses a hash object, you will need sufficient memory to hold all of the distinct values you expect to find. On the other hand, it only reads the input dataset once, with no sorting required, so it might be faster than SQL approaches that require multiple internal reads and sorts.
注:因为这使用了一个哈希对象,你需要足够的内存来保存你期望找到的所有不同的值。另一方面,它只读取输入数据集一次,不需要排序,因此它可能比需要多次内部读取和排序的SQL方法更快。