一、 在SAS中进行随机抽样:
1、 在实际数据处理中常常需要进行样本抽样,在实践中主要有两种情况:
(1)简单无重复抽样
(2)分层抽样 a.等比例分层抽样 b. 不等比例分层抽样;
2、SAS 中可以利用PROC suveryselect 过程实现各种抽样:
其一般形式是:
PROC SURVEYSELECT data=<源数据集名> method = <srslursl sys > out=<抽取样本存放的数据集> n=<抽取数量>(or samprate=抽样比例) seed =n;
strata <指定分层变量>;
id <指定抽取的样本所保留的源数据集变量>;
run;
说明:method用来指定随机抽样方法的,其中SRS是指不放回简单随机抽样(Simple Random Samping);urs是指放回简单随机抽样(Unrestricted Random Sampling);sys是指系统抽样(Systematic Sampling)。seed用来指定随机种子数,为非负整数,取0则每次抽取的样本不同,若取大于0的整数,则下次抽样时若输入相同值即可得到相同的样本;id是指定从源数据集复制到样本数据集的变量,若缺省,则复制所有变量。
3、简单无重复随机抽样举例:
/*按30%的比例从test数据集中抽取样本,并把样本输出到results数据集中*/
proc surveyselect data=test1 out=results1 method=srs samprate=0.3;
run;
4、分层等比例随机抽样举例;
proc sort data=test2;
by 分层变量;
run; /**先用分层变量对总体样本进行排序/
proc surveyselect data=test2 out=results2 method=srs samprate=0.1;
strata 分层变量;
run; /*根据分层变量等比例从总体中抽取样本*/
5、分层不等比例抽样举例;
(1)手工设置抽样比例或者抽样数
proc sort data=test3;
by 分层变量;
run; /**先用分层变量对总体样本进行排序/
proc surveyselect data=test3 out=results3 method=srs
samprate=(0.1,0.3,0.5,0.2); /*根据分层情况设置每一层要抽取的比例*/
strata 分层变量;
run; /*根据分层变量不等比例从总体中抽取样本*/
proc surveyselect data=test3 out=results3 method=srs
n=(30,20,50,40); /*根据分层情况设置每一层要抽取的样本数*/
strata 分层变量;
run;
(2)根据抽样表进行不等比例抽样
proc sort data=test3;
by 分层变量;
run; /**先用分层变量对总体样本进行排序/
proc surveyselect data=test3 out=results3 method=SRS
samprate=samp_table; /*通过抽样比例数据集进行抽样,samp_table数据集中要包括分层变量 以及每一分层对应的抽样比例或者数量,如果按比例抽样变量必须用_rate_来命名抽样比例,如果是按数量抽样必须用_nsize_来命名抽样数量*/
strata 分层变量;
run;
6、关于surveyselect过程的更多内容详见SAS帮助
在命令栏输入 help surveyselect 然后按enter键即可。
二、把数据分成训练集和测试集的程序:
1、data train(drop=u) validate(drop=u);
set develop1;
u=ranuni(27513);
if u<=.67 then output train;
else output validate;
run;
A DATA step is used to split DE VELOP into TRAIN and VALIDATE. The
variable U is created using the RANUNI function, which generates pseudo-random numbers from a uniform distribution on the interval (0,1). The
RANUNI function argument is an initialization seed. Using a particular
number, greater than zero, will produce the same split each time the DATA
step is run. If the seed was zero, then the data would be split differently each
time the DATA step was run. The IF statement puts approximately 67% of
the data into TRAIN and approximately 33% of the data into VALIDATE
because on average 33% of the values of a uniform random variable are
greater than 0.67.
三、 SAS logistic中class选项的使用:
Class语句的作用其实是设置虚拟变量。
- 如果是两分类变量,其实分类不分类效果都是一样的;
- 多分类变量,如果是数值型,不指定分类,SAS会当做连续变量处理(分类变量时有序分类变量时,比如年龄段,工资水平等,记成1,2,3,4等等,这样处理是有意义的,如果是无序分类变量,这样处理就没有任何意 义);如果是非数值型,SAS会报错;
- 对待多分类无序变量,是必须要设置成分类变量的,否则SAS会报错;
SAS对分类变量,采用param=ref;
即分类变量 Response Profile是
[1 0
0 1
0 0]
的形式;
作为参照对象的参数估计为0;这样在估计oddratios比较方便,分类水平vs参照的水平的优势比=exp(该水平参数估计) - 不指定ref参数,SAS会以最大值最为参照水平,比如sex会以1作为参照水平;在没有设置param参数时,SAS默认是param=effect;
Response Profile
[1 0
0 1
-1 -1]
的形式;
5、ref=看你想把什么设置成参照水平:
Ref=first、ref=last、或ref=”某类别赋值”,表示以第一类、最后一类或其中的某一类作为参照组;
param=ref 主要强调参数估计,计算优势比oddsratio比较方便;
param=effect主要强调假设检验,方便交互分析。
总的来说,这两个参数的设置只是编码方式不同,不同的设置只是为了方便后面的估计或检验的计算,最后的运算结果应该是一样的。
四、logistic in SAS
1、计算score:
The SCORE procedure multiplies values from two SAS data sets, one
containing coefficients (SCORE=) and the other containing the data to be
scored (DATA=). The data set to be scored typically would not have a target
variable. The OUT= option specifies the name of the scored data set created
by PROC SCORE. The TYPE=PARMS option is required for scoring
regression models.
proc score data=read.new out=scored score=betas1
type=parms;
var dda ddabal dep depamt cashbk checks;
run;
Data can also be scored directly in PROC LOGISTIC using the OUTPUT
statement. This has several disadvantages over using PROC SCORE: it does
not scale well with large data sets, it requires a target variable (or some
proxy), and the adjustments for oversampling, discussed in a later section,
are not automatically applied.
2、 缺失值的填充:
The STDIZE procedure with the REPONLY option can be used to replace missing values. The METHOD= option allows you to choose several different location measures such as the mean, median, and midrange. The output data set created by the OUT= option contains all the variables in the input data set where the variables listed in the VAR statement are imputed. Only numeric input variables should be used in PROC STDIZE.
proc stdize data=develop1 reponly method=median out=imputed;
var &inputs;
run;
proc print data=imputed(obs=20);
var ccbal miccbal ccpurc miccpurc income miincome
hmown mihmown;
run;
PROC STANDARD with the REPLACE option can be used to replace missing values with the mean of that variable on the non-missing cases.
3、 模型验证及评价部分(ROC及logistic procedure验证):
The INEST= option on the PROC LOGISTIC statement names the data set that contains initial parameter estimates for starting the iterative ML estimation algorithm. The MAXITER= option in the MODEL statement specifies the maximum number of iterations to perform. The combination of DATA=validation data, INEST= final estimates from training data , and MAXITER=0 causes PROC LOGISTIC to score, not refit, the validation data. The OFFSET= option is also needed since the offset variable was used when creating the final parameter estimates from the training data set.
The OUTROC= option creates an output data set with sensitivity (_SENSIT_) and one minus specificity (_1MSPEC_) calculated for a full range of cutoff probabilities (_PROB_). The other statistics in the OUTROC= data set are not useful when the data is oversampled. The two variables _SENSIT_ and _1MSPEC_ in the OUTROC= data set are correct whether or not the validation data is oversampled. The variable _PROB_ is correct, provided the INEST= parameter estimates were corrected for oversampling using sampling weights. If they were not corrected or if they were corrected with an offset, then _PROB_ needs to be adjusted using the formula (shown in Section 2.2).
proc logistic data=validate des inest=betas;
model ins=&selected / maxiter=0 outroc=roc offset=off;
run;
However, this model should be assessed using the validation data set because the inclusion of many higher order terms may increase the risk of overfitting.
proc logistic data=train1 des outest=betas;
model ins=miphone CHECKS MM CD brclus1 DDABAL TELLER
SAVBAL CASHBK brclus3 ACCTAGE SAV DDA ATMAMT
PHONE INV ATM savbal*savbal ddabal*ddabal
ddabal*savbal atmamt*atmamt
savbal*dda brclus1*atmamt mm*savbal
acctage*acctage miphone*brclus1
checks*ddabal ddabal*phone ddabal*brclus3
mm*phone sav*dda mm*dda
cashbk*acctage;
run;
proc logistic data=validate des inest=betas;
model ins=miphone CHECKS MM CD brclus1 DDABAL TELLER
SAVBAL CASHBK brclus3 ACCTAGE SAV DDA ATMAMT
PHONE INV ATM savbal*savbal ddabal*ddabal
ddabal*savbal atmamt*atmamt
savbal*dda brclus1*atmamt mm*savbal
acctage*acctage miphone*brclus1
checks*ddabal ddabal*phone ddabal*brclus3
mm*phone sav*dda mm*dda
cashbk*acctage / maxiter=0 ;
run;
4、 CROSS-VALIDATION
(1)K-fold
%let k=5;
data xx10f;
do replicate=1 to &k ;
do rec=1 to numrecs;
set mylib.stu nobs=numrecs point=rec;
%let m=floor(numrecs/&k);
/* if replicate ^= rec then output;*/
if replicate ^=ceil(rec/&m) then do;
new_y=y;
selected=1;
end;
else do;
new_y=.;
selected=0;
end;
output;
end;
end;
stop;
run;
(2)LOOCV
data xx;
do replicate=1 to numrecs ;
do rec=1 to numrecs;
set mylib.stu nobs=numrecs point=rec;
/* if replicate ^= rec then output;*/
if replicate ^= rec then new_y=y;
else new_y=.;
output;
end;
end;
stop;
run;
(3)bootstrapping
%let K=3;
%let rate=%sysevalf((&k-1)/&k);
proc surveyselect data=temp1 out=xv seed=7589747 method=urs
samprate=&rate outall rep=k
run;
data xv;
set xv;
if selected then new_y=y;
run;
五、运行程序时,日志窗口老是提示满了而中断运行,怎么办?
(1)option nonotes; 可以让SAS不输出notes。
也可以用proc printto;来指定log的内容到外部文件:
*** point log to an external file.;
proc printto log="c:\test.txt";
run;
--Your program---;
*** Point the log to its default destination;
proc printto;run;
(2)不生成日志的设置:
You can do as,
proc printto log=_null_;
run;
proc print data=sashelp.class;
run;
%put 'not thing show' _all_;
proc printto log=log;
run;
proc print data=sashelp.class;
run;
%put _all_;
(3)不生成log,可运行:
- options nosource nonotes errors=0;
不生成log可提高运行速度
六、利用ROC曲线寻找合理的界值(cut-off 值):
ROC曲线可结合灵敏度和特异度寻找一个界值,使灵敏度和特异度结合得最优。通常有两种方式:一是根据各个灵敏度和特异度,计算使(灵敏度+特异度-1)取值最大的一个点,作为界值;二是从ROC曲线图中寻找一个最靠近左上角的点作为界值。通过这两种方式寻找的点在通常情况下是一致的。
七、exist of maximum likelihood的解决方法:
(1)Exact语句对指定的变量执行精确检验。在例数不多、模型不大的情况下,当结果不稳定时,可以用到这一选项(分类变量须添加param=ref这一参数),指定需执行精确检验的变量;如果样本量大、模型复杂,执行这个语句就会提示内存不足了;
(2)Strata语句是SAS 9.0版本以后加上的,专门用于匹配设计的logistic回归分析。该语句实现了应用proc logistic命令对1:1、1:m、m:n等多配比资料进行分析。Strata语句主要指定匹配组变量,在病例交叉研究中,每个个体是一个匹配组,因此个体编号就是strata语句需指定的匹配组变量。
(3)SAS9.2以后的版本可使用firth选项解决这个问题。
八、分层等比例抽样:
使用surveyselect中的strata语句可对变量进行分层抽样,这样,生成的cv数据集中各组的0和1的比例是相同的。
九、logistic 过程中“检验全局零假设”这部分是模型总体检验结果。
Logistic回归的单因素分析结果与卡方检验结果一致。有的文章中采用logistic回归进行单因素分析,有的采用卡方检验进行单因素分析,实际上结果是相同的。
十、秩和检验的适用范围
如果两个样本来自两个独立的但非正态或形态不清的两总体,要检验两样本之间的差异是否显著,不应运用参数检验中的T检验,而需采用秩和检验。
秩和检验
应用条件
①总体分布形式未知或分布类型不明;
②偏态分布的资料:
③等级资料:不能精确测定,只能以严重程度、优劣等级、次序先后等表示;
④不满足参数检验条件的资料:各组方差明显不齐。
⑤数据的一端或两端是不确定数值,如“
>50mg
”等。
十一、proc logistic 中model statement后面的选项aggregate scale= 及rsquare(引用:《医学案例统计分析与SAS应用》P172-178)
Model y=chage rs2 rs3 lc mr/aggregate scale=none;
/*选项aggregate 和scale输出Pearson 卡方和Deviance值,用于拟合优度评价,rsquare输出广义R2。*/
如果Deviance和Pearson 卡方的P值均较低,提示模型拟合不充分。Deviance和Pearson 卡方值均大于1,提示可能存在过离散现象。(即两者的值应较小,P值应较大才好?)
若去掉无统计意义的变量后,Pearson 卡方和Deviance值仍大于1,提示过离散现象的存在。我们可以采用Pearson 卡方和Deviance统计量来进行调整。这里我们采用Pearson 卡方进行调整,只要将选项改为 scale=pearson,则结果中协方差矩阵就会乘以异质因子(即Pearson 卡方值与其*度之比)。
十二、关于输出结果缺少了最终选择的模型的ROC曲线的问题:
因为outroc这个选项与aggregate scale这个选项有冲突,后者会改变了数据矩阵,因此使得最终的结果数据无法画出选出模型的ROC曲线了;