在给学生上课或者与学生交流过程中,发现很多学生往往在导入数据第一步就“卡壳”了,因为现实的数据和教科书配套的习题集是完全不一样的。现实的数据集就好比一只没有杀好的猪,看你有没有本事从中提取你要的部位。而教科书配套的习题集已经是超市里面封装好的一块块“猪颈肉”或者“五花肉”。
其实总结来说无外乎就这么几种方法,把这些代码保存起来,遇到相匹配的数据就直接把相应的代码拿来用。当然对其中的一些语法还是要熟悉,否则情况有一点变法都没有办法应对了。在《Data Preparation for Data Mining》里面还有详细介绍了数据清理的很多技巧,以后再慢慢补充,这里就只总结数据导入。
注:程序中使用的数据集来源于SAS TRAINING KITS PROG I中的数据,网上大把都是,自己去下载。
一、几种常用文本格式数据导入方法:
/*导入不带格式的制表位分隔的数据*/
data work.dfwlax;
infile 'D:\data&program\dfwlax.dat';
input Flight $ 1-3 Date $ 4-11
Dest $ 12-14 FirstClass 15-17
Economy 18-20;
run;
/*导入带格式的制表位分隔的数据*/
data work.dfwlax2;
infile 'D:\data&program\dfwlax.dat';
input Flight $ 1-3
@4 Date mmddyy8.
Dest $ 12-14
FirstClass 15-17
Economy 18-20;
format date date9.;
run;
/*导入带格式的逗号分隔的数据*/
data dfwlax3;
infile 'D:\data&program\dfwlax.csv' dlm=',' firstobs=2 ;
input Flight $ Date Dest $ FirstClass Economy;
informat date mmddyy8.;
format date date9.;
run;
/*导入EXCEL数据*/
PROC IMPORT OUT= WORK.SANFRAN4
DATAFILE= "C:\sfosch.xls"
DBMS=EXCEL REPLACE;
RANGE="SFODATA";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;
二、数据导入参数解读
主要包括 @ @@ DSD MISSOVER TRUNCOVER DLM等常用参数。资料来源:A LITTLE SAS BOOK (CHAPTER 2)
相信大家英文水平都不错,就不翻译了。8-)
Reading Raw Data Not in Standard Format
input name $16. age 3. +1 type $1. +1 date mmddyy10. (score1 score2 score3) (4.1);
Mixing Input Styles
INPUT ParkName $ 1-22 State $ Year @40 Acreage COMMA9.;
Reading Messy Raw Data
INPUT @’Breed:’ DogBreed $20.;
Reading Multiple Lines of Raw Data per Observation
INPUT City $ State $ / NormalHigh NormalLow #3 RecordHigh RecordLow;
Reading Multiple Observations per Line of Raw Data
INPUT City $ State $ NormalRain MeanDaysRain @@;
/*@@ SAS will hold that line of data, continuing to read observations until it either runs out of data or reaches an INPUTstatement that does not end with a double trailing @.*/
Reading Part of a Raw Data File
/*Trailing @ versus double trailing @ The double trailing @ is similar to the trailing @. Both are line-hold specifiers; the difference is how long they hold a line of data for input. The trailing @ holds a line of data for subsequent INPUT statements, but releases that line of data when SAS returns to the top of the DATA step to begin building the next observation. The double trailing @ holds a line of data for subsequent INPUT statements even when SAS starts building a new observation. In both cases, the line of data is released if SAS reaches a subsequent INPUT statement that does not contain a line-hold specifier.*/
Controlling Input with Options in the INFILE Statement
INFILE ’c:\MyRawData\Sales.dat’ FIRSTOBS = 3 obs=5;
MISSOVER By default, SAS will go to the next data line to read more data if SAS has reached
the end of the data line and there are still more variables in the INPUT statement that have not been assigned values. The MISSOVER option tells SAS that if it runs out of data, don’t go to the next data line. Instead, assign missing values to any remaining variables.
Reading Delimited Files with the DATA Step
INFILE ’c:\MyRawData\Books.dat’ DLM = ’,’;
The DSD option The DSD (Delimiter-Sensitive Data) option for the INFILE statement does three things for you. First, it ignores delimiters in data values enclosed in quotation marks. Second, it does not read quotation marks as part of the data value. Third, it treats two delimiters in a row as a missing value.
TRUNCOVER You need the TRUNCOVER option when you are reading data using column
or formatted input and some data lines are shorter than others. If a variable’s field extends past the end of the data line, then, by default, SAS will go to the next line to start reading the variable’s value. This option tells SAS to read data for the variable until it reaches the end of the data line, or the last column specified in the format or column range, whichever comes first.
Reading PC Files with DDE
* Read an Excel spreadsheet using DDE;
FILENAME baseball DDE 'CLIPBOARD';
DATA sales;
INFILE baseball NOTAB DLM='09'x DSD MISSOVER;
LENGTH VisitingTeam $ 20;
INPUT VisitingTeam CSales BSales OurHits TheirHits OurRuns TheirRuns;
RUN;