I have a dataset with over 900 observations, each observation represents the population of a sub-geographical area for a given year by gender (male, female, all) and 20 different age groups.
我有一个包含900多个观察结果的数据集,每个观察结果代表按性别(男性,女性,所有)和20个不同年龄组的特定年份的亚地理区域的人口。
I have dropped the variable for the sub-geographical area and I want to collape into the greater geographical area (called Geo
).
我已经删除了子地理区域的变量,我想合作进入更大的地理区域(称为Geo)。
I am having a difficult time doing a SUM
or PROC MEANS
because I have so many age groups to sum up and I am trying to avoid writing them all out. I want to collapse across the group year, geo, sex so that I only have 3 observations per Geo (my raw data could have as many as 54 observations).
我很难做一个SUM或PROC意思,因为我有很多年龄组要总结,我试图避免全部写出来。我希望在整个组年,地理位置,性别中崩溃,这样我每个Geo只有3个观测值(我的原始数据可能有多达54个观测值)。
This is an example of what a tiny section of the raw data looks like:
这是一小部分原始数据的示例:
Year Geo Sex Age0005 Age0610 Age1115 (etc)
2010 1 1 92 73 75
2010 1 2 57 81 69
2010 1 3 159 154 144
2010 1 1 41 38 43
2010 1 2 52 41 39
2010 1 3 93 79 82
2010 2 1 71 66 68
2010 2 2 63 64 70
2010 2 3 134 130 138
2010 2 1 32 35 34
2010 2 2 29 31 36
2010 2 3 61 66 70
This is how I want it to look:
这就是我希望它看起来的样子:
Year Group Sex Age0005 Age0610 Age1115 (etc)
2010 1 1 133 111 118
2010 1 2 109 122 08
2010 1 3 252 233 226
2010 2 1 103 101 102
2010 2 2 92 95 106
2010 2 3 195 196 208
Any ideas? Please help!
有任何想法吗?请帮忙!
2 个解决方案
#1
You don't have to write out each variable name individually - there are ways of getting around that. E.g. if all of the age group variables that need to be summed up start with age
then you can use a :
wildcard to match them:
您不必单独写出每个变量名称 - 有办法解决这个问题。例如。如果需要总结的所有年龄组变量都以年龄开始,那么您可以使用:通配符来匹配它们:
proc summary nway data = have;
var age:;
class year geo sex;
output out = want sum=;
run;
If your variables don't have a common prefix, but are all next to each other in one big horizontal group in your dataset, you can use a double dash list instead:
如果您的变量没有公共前缀,但在数据集中的一个大水平组中彼此相邻,则可以使用双短划线列表:
proc summary nway data = have;
var age005--age1115; /*Includes all variables between these two*/
class year geo sex;
output out = want sum=;
run;
Note also the use of sum=
- this means that each summarised variable is reproduced with its original name in the output dataset.
另请注意sum =的使用 - 这意味着每个汇总变量在输出数据集中以其原始名称进行复制。
#2
I personally like to use proc sql for this, since it makes it very clear what you're summing and grouping by.
我个人喜欢使用proc sql,因为它非常清楚你要汇总和分组的内容。
data old ;
input Year Geo Sex Age0005 Age0610 Age1115 ;
datalines;
2010 1 1 92 73 75
2010 1 2 57 81 69
2010 1 3 159 154 144
2010 1 1 41 38 43
2010 1 2 52 41 39
2010 1 3 93 79 82
2010 2 1 71 66 68
2010 2 2 63 64 70
2010 2 3 134 130 138
2010 2 1 32 35 34
2010 2 2 29 31 36
2010 2 3 61 66 70
;
run;
proc sql ;
create table new as select
year
, geo label = 'Group'
, sex
, sum(age0005) as age0005
, sum(age0610) as age0610
, sum(age1115) as age1115
from old
group by geo, year, sex ;
quit;
#1
You don't have to write out each variable name individually - there are ways of getting around that. E.g. if all of the age group variables that need to be summed up start with age
then you can use a :
wildcard to match them:
您不必单独写出每个变量名称 - 有办法解决这个问题。例如。如果需要总结的所有年龄组变量都以年龄开始,那么您可以使用:通配符来匹配它们:
proc summary nway data = have;
var age:;
class year geo sex;
output out = want sum=;
run;
If your variables don't have a common prefix, but are all next to each other in one big horizontal group in your dataset, you can use a double dash list instead:
如果您的变量没有公共前缀,但在数据集中的一个大水平组中彼此相邻,则可以使用双短划线列表:
proc summary nway data = have;
var age005--age1115; /*Includes all variables between these two*/
class year geo sex;
output out = want sum=;
run;
Note also the use of sum=
- this means that each summarised variable is reproduced with its original name in the output dataset.
另请注意sum =的使用 - 这意味着每个汇总变量在输出数据集中以其原始名称进行复制。
#2
I personally like to use proc sql for this, since it makes it very clear what you're summing and grouping by.
我个人喜欢使用proc sql,因为它非常清楚你要汇总和分组的内容。
data old ;
input Year Geo Sex Age0005 Age0610 Age1115 ;
datalines;
2010 1 1 92 73 75
2010 1 2 57 81 69
2010 1 3 159 154 144
2010 1 1 41 38 43
2010 1 2 52 41 39
2010 1 3 93 79 82
2010 2 1 71 66 68
2010 2 2 63 64 70
2010 2 3 134 130 138
2010 2 1 32 35 34
2010 2 2 29 31 36
2010 2 3 61 66 70
;
run;
proc sql ;
create table new as select
year
, geo label = 'Group'
, sex
, sum(age0005) as age0005
, sum(age0610) as age0610
, sum(age1115) as age1115
from old
group by geo, year, sex ;
quit;