按组分组的多个变量的总和

时间:2022-03-31 21:30:43

I have a dataset with over 900 observations, each observation represents the population of a sub-geographical area for a given year by gender (male, female, all) and 20 different age groups.

我有一个包含900多个观察结果的数据集,每个观察结果代表按性别(男性,女性,所有)和20个不同年龄组的特定年份的亚地理区域的人口。

I have dropped the variable for the sub-geographical area and I want to collape into the greater geographical area (called Geo).

我已经删除了子地理区域的变量,我想合作进入更大的地理区域(称为Geo)。

I am having a difficult time doing a SUM or PROC MEANS because I have so many age groups to sum up and I am trying to avoid writing them all out. I want to collapse across the group year, geo, sex so that I only have 3 observations per Geo (my raw data could have as many as 54 observations).

我很难做一个SUM或PROC意思,因为我有很多年龄组要总结,我试图避免全部写出来。我希望在整个组年,地理位置,性别中崩溃,这样我每个Geo只有3个观测值(我的原始数据可能有多达54个观测值)。


This is an example of what a tiny section of the raw data looks like:

这是一小部分原始数据的示例:

Year    Geo     Sex     Age0005     Age0610     Age1115     (etc)
2010    1       1       92          73          75
2010    1       2       57          81          69
2010    1       3       159         154         144
2010    1       1       41          38          43
2010    1       2       52          41          39
2010    1       3       93          79          82
2010    2       1       71          66          68
2010    2       2       63          64          70
2010    2       3       134         130         138
2010    2       1       32          35          34
2010    2       2       29          31          36
2010    2       3       61          66          70

This is how I want it to look:

这就是我希望它看起来的样子:

Year    Group   Sex     Age0005     Age0610     Age1115     (etc)
2010    1       1       133         111         118
2010    1       2       109         122         08
2010    1       3       252         233         226
2010    2       1       103         101         102
2010    2       2       92          95          106
2010    2       3       195         196         208 

Any ideas? Please help!

有任何想法吗?请帮忙!

2 个解决方案

#1


You don't have to write out each variable name individually - there are ways of getting around that. E.g. if all of the age group variables that need to be summed up start with age then you can use a : wildcard to match them:

您不必单独写出每个变量名称 - 有办法解决这个问题。例如。如果需要总结的所有年龄组变量都以年龄开始,那么您可以使用:通配符来匹配它们:

proc summary nway data = have;
  var age:;
  class year geo sex;
  output out = want sum=;
run;

If your variables don't have a common prefix, but are all next to each other in one big horizontal group in your dataset, you can use a double dash list instead:

如果您的变量没有公共前缀,但在数据集中的一个大水平组中彼此相邻,则可以使用双短划线列表:

proc summary nway data = have;
  var age005--age1115; /*Includes all variables between these two*/
  class year geo sex;
  output out = want sum=;
run;

Note also the use of sum= - this means that each summarised variable is reproduced with its original name in the output dataset.

另请注意sum =的使用 - 这意味着每个汇总变量在输出数据集中以其原始名称进行复制。

#2


I personally like to use proc sql for this, since it makes it very clear what you're summing and grouping by.

我个人喜欢使用proc sql,因为它非常清楚你要汇总和分组的内容。

    data old ; 
    input Year Geo Sex Age0005 Age0610 Age1115 ; 
    datalines;
    2010 1 1 92 73 75
    2010 1 2 57 81 69
    2010 1 3 159 154 144
    2010 1 1 41 38 43
    2010 1 2 52 41 39
    2010 1 3 93 79 82
    2010 2 1 71 66 68
    2010 2 2 63 64 70
    2010 2 3 134 130 138
    2010 2 1 32 35 34
    2010 2 2 29 31 36
    2010 2 3 61 66 70
    ;
    run;

    proc sql ; 
     create table new as select 
      year
        , geo label = 'Group'
        , sex
        , sum(age0005) as age0005
        , sum(age0610) as age0610
        , sum(age1115) as age1115
        from old 
        group by geo, year, sex ; 
    quit; 

#1


You don't have to write out each variable name individually - there are ways of getting around that. E.g. if all of the age group variables that need to be summed up start with age then you can use a : wildcard to match them:

您不必单独写出每个变量名称 - 有办法解决这个问题。例如。如果需要总结的所有年龄组变量都以年龄开始,那么您可以使用:通配符来匹配它们:

proc summary nway data = have;
  var age:;
  class year geo sex;
  output out = want sum=;
run;

If your variables don't have a common prefix, but are all next to each other in one big horizontal group in your dataset, you can use a double dash list instead:

如果您的变量没有公共前缀,但在数据集中的一个大水平组中彼此相邻,则可以使用双短划线列表:

proc summary nway data = have;
  var age005--age1115; /*Includes all variables between these two*/
  class year geo sex;
  output out = want sum=;
run;

Note also the use of sum= - this means that each summarised variable is reproduced with its original name in the output dataset.

另请注意sum =的使用 - 这意味着每个汇总变量在输出数据集中以其原始名称进行复制。

#2


I personally like to use proc sql for this, since it makes it very clear what you're summing and grouping by.

我个人喜欢使用proc sql,因为它非常清楚你要汇总和分组的内容。

    data old ; 
    input Year Geo Sex Age0005 Age0610 Age1115 ; 
    datalines;
    2010 1 1 92 73 75
    2010 1 2 57 81 69
    2010 1 3 159 154 144
    2010 1 1 41 38 43
    2010 1 2 52 41 39
    2010 1 3 93 79 82
    2010 2 1 71 66 68
    2010 2 2 63 64 70
    2010 2 3 134 130 138
    2010 2 1 32 35 34
    2010 2 2 29 31 36
    2010 2 3 61 66 70
    ;
    run;

    proc sql ; 
     create table new as select 
      year
        , geo label = 'Group'
        , sex
        , sum(age0005) as age0005
        , sum(age0610) as age0610
        , sum(age1115) as age1115
        from old 
        group by geo, year, sex ; 
    quit;