写在前面:本文为本人所做数据分析关于信用评分卡的习作,使用的是一个多年前kaggle的一个数据集,所以已经有人做过相关的分析。正在学习增强中,水平有限,文中不当之处望各位多多指点。
一、 数据介绍
SeriousDlqin2yrs |
是否有超过90天的逾期 |
RevolvingUtilizationOfUnsecuredLines |
循环贷款无抵押额度 |
Age |
借款人年龄 |
NumberOfTime30-59DaysPastDueNotWorse |
有逾期30-59天,但在过去2年没有更糟的情况出现的次数 |
DebtRatio |
每月债务支付,赡养费,生活费用除以月总收入 |
MonthlyIncome |
月收入 |
NumberOfOpenCreditLinesAndLoans |
开放式贷款的数量(如汽车贷款或抵押贷款等分期付款)和信用额度(如信用卡) |
NumberOfTimes90DaysLate |
借款人逾期90天或以上的次数 |
NumberRealEstateLoansOrLines |
按揭及房地产贷款数目,包括房屋净值信贷额度。 |
NumberOfTime60-89DaysPastDueNotWorse |
借款人逾期60-89天的次数,但过去两年更糟的情况出现 |
NumberOfDependents |
不包括自己在内的家属(配偶,子女等)数量。 |
二、 数据获取和整合
1. 数据获取
adress1<-"E:\\project\\pfk\\cs-training.csv"
train1<-read.table(adress1,header=TRUE,sep=",",row.names="id")
2. 变量名整理
由于变量名都太长了,进行重新命名
names(train1)[1:11]<-c("y","x1","x2","x3","x4","x5","x6","x7","x8","x9","x10")
3. 缺失值分析及处理
> md.pattern(train1)
y x1 x2 x3 x4 x6 x7 x8 x9 x10 x5
120269 1 1 1 1 1 1 1 1 1 1 1 0
25807 1 1 1 1 1 1 1 1 1 1 0 1
3924 1 1 1 1 1 1 1 1 1 0 0 2
0 0 0 0 0 0 0 0 0 3924 29731 33655
> matrixplot(train1)
存在缺失x5(月收入)的情况,x5(月收入)与x10(家庭成员情况)同时缺失的情况。
由于缺失值数量较多,不宜采用直接删除的方法,从缺失的位置可以看出,x10的缺失只出现在x5也缺失的情况下,为非随机缺失,也不宜直接删除。这里选择对缺失值进行填补,使用k临近方法进行插值。此处缺失值x10(家庭成员数量)为整数,将k设置为奇数,采用中位数进行插值。
> train2<-knnImputation(train1[,2:11],k=11, scale =T,meth = "median", distData =NULL)
缺失值在数据中显示为NA,但是在使用knnImputation函数的时候将distData设置为NA反而会出现Not sufficient complete cases for computing neighbors.错误,不进行指定反而可以正常运行。迷
4. 异常值的分析处理
(1) 单变量异常检测
> summary(train2)
x1 x2 x3 x4 x5
Min.
: 0.00 Min. : 0.0
Min. : 0.000 Min.
: 0.0 Min.
: 0
1st Qu.: 0.03
1st Qu.: 41.0 1st Qu.:
0.000 1st Qu.: 0.2
1st Qu.: 2500
Median : 0.15
Median : 52.0 Median : 0.000 Median :
0.4 Median :
4500
Mean
: 6.05 Mean : 52.3
Mean : 0.421 Mean
: 353.0 Mean
: 5658
3rd Qu.: 0.56
3rd Qu.: 63.0 3rd Qu.:
0.000 3rd Qu.: 0.9
3rd Qu.: 7416
Max. :50708.00 Max. :109.0 Max.
:98.000 Max. :329664.0
Max. :3008750
x6 x7 x8 x9 x10
Min.
: 0.000 Min. : 0.000
Min. : 0.000 Min.
: 0.0000 Min. : 0.0000
1st Qu.: 5.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000
Median : 8.000 Median : 0.000 Median : 1.000 Median : 0.0000 Median : 0.0000
Mean
: 8.453 Mean : 0.266
Mean : 1.018 Mean
: 0.2404 Mean : 0.7451
3rd Qu.:11.000 3rd Qu.: 0.000 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 1.0000
Max.
:58.000 Max. :98.000
Max. :54.000 Max.
:98.0000 Max. :20.0000
根据变量的定义
X1是已经使用的信用额度与总的信用额度之比,正常情况下在(0,1)之间,但是数据出现了50708这样的异常值,综合考虑现实中可能会出现超额的情况和银行对贷款应有的风险管理,定义正常的数据范围为(0,1.5),超过1.5的为异常数据,予以删除。
X2(年龄)有0的情况 ,0为异常值。
par(mfrow=c(1,3))
boxplot(train1$x3,xlab="x3")
boxplot(train1$x7,xlab="x7")
boxplot(train1$x9,xlab="x9")
X3,x7,x9在大于80位置,均有异常值
其他变量没有异常点或者无法通过经验对数据范围进行限定,暂不做处理。
处理方法:
> train3<-train21[which(train21$x2!=0&train21$x3<=80&train21$x7<80&train21$x9<80&train21$x1<=1.5),]
(2) 多变量异常值检测
多变量异常值检测,使用LOF检测异常值。在此处可采用多个异常值检验模型进行检验,后根据多个模型的结论投票最终决定一个值是否为异常值,由于水平有限暂时做不了。
a <-names(train3)%in%c("y")
train31<-train3[!a]
train31<-scale(train31)
yc <- lofactor(train31, k=5)
train32<-cbind(train3,yc)
plot(density(train32[which(train32$yc>0&train32$yc<3),]$yc))
定义大于1.5的部分为异常值。
#对异常值进行筛选
train33<-na.omit(train32[which(train32$yc>0&train32$yc<1.5),])
train33<-train33[,1:11]
三、 数据探索(EDA)与变量描述性统计
> stat.desc(train33,basic=TRUE,desc=TRUE,norm=FALSE,p=0.95)
1. 单变量分析
X1:
p<-ggplot(train33,aes(x=x1))#
p+geom_histogram(bins = 50,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x1))#
p+geom_histogram(bins = 50,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
评析:循环贷款无抵押额度在人数分布上集中于0与1两个点附近,比率越大,人数分布越少;违约概率与循环贷款无抵押额度在(0,1)区间内呈现比较标准的线性正相关,而大于1后呈现出非常高的违约率。
X2:
p<-ggplot(train33,aes(x=x2))#
p+geom_histogram(binwidth = 2,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x2))#
p+geom_histogram(binwidth = 2,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
评析:年纪的人数分布大致呈现正态分布,年纪98岁以前呈现出随着年纪的增加,违约率下降,而到了98岁之后呈现异常的违约率增加的现象。
X3:
p<-ggplot(train33,aes(x=x3))
p+geom_histogram(binwidth = 1,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x3))
p+geom_histogram(binwidth = 1,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
评析:35-59天违约的次数集中在0次,有过本类违约的人是较少的,随着违约次数的增加,违约率先增加,然后在6-7次出出现违约率减少,而在10次以后出现了违约率猛然上升的现象。
X4:
未发现规律性
X5:
p<-ggplot(train33[which(train33$x5<100000&train33$x1>0),],aes(x=x5))
p+geom_histogram(bins = 50,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33[which(train33$x5<100000&train33$x1>0),],aes(x=x5))
p+geom_histogram(bins = 50,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
评析:人群主要分布在10000月收入以下,在10000月收入以下,违约率呈现收入越高违约率越低的趋势,但是在更高月收入的数据当真呈现混乱的现象,初步判断是少部分顾客将月收入错填为年收入而产生的混乱。
X6:
p<-ggplot(train33,aes(x=x6))
p+geom_histogram(binwidth =4,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x6))
p+geom_histogram(binwidth = 4,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
评析:人群在不同的信贷数量下呈现正态的分布,违约率呈现出持有较少的信贷数量和较多信贷数量的客户有较高的违约概率,而持有较多数量信贷的客户比持有较少的信贷数量的客户的违约率更高。
X7:
p<-ggplot(train33,aes(x=x7))
p+geom_histogram(binwidth = 2,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x7))
p+geom_histogram(binwidth = 2,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
评析:x7与x3类似
X8:
p<-ggplot(train33,aes(x=x8))
p+geom_histogram(binwidth = 4,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x8))
p+geom_histogram(binwidth = 4,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
评析:随着贷款数量的增加,人数越来越少,大多数的人的贷款数量在4个以下,随着贷款数量的增加,违约率也呈现着增加的趋势。
X9:
与x3,x7类似
X10:
p<-ggplot(train33,aes(x=x10))
p+geom_histogram(binwidth = 1,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(train33,aes(x=x10))
p+geom_histogram(binwidth = 1,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
评析:家庭成员越多,客户越少,单身的客户占了大半;随着家庭成员的增多,违约率变化不大,当家庭成员在8个以上时,违约率低。
2. 变量间的相关性分析
library(corrplot)
cor1<-cor(train33)
corrplot(cor1)
geom_tile(aes(fill=value))
corrplot(corr=cor1, method = "number",addCoef.col = "grey")
评析:各变量间相关性小,初步判断符合logistic回归的要求。
四、 模型开发
1. 数据切分
library(caret)
set.seed(1111)
inTrain <- createDataPartition(y=train33$y,p=0.5,list=FALSE)
training <- train33[inTrain, ]
testing <- train33[-inTrain, ]
summary(testing)
2. 建立模型
glm1<-glm(y~.,family = binomial(link = "logit"),data = training)
summary(glm1)
Call:
glm(formula = y ~ ., family = binomial(link = "logit"), data =
training)
Deviance
Residuals:
Min 1Q
Median 3Q Max
-4.2404 -0.3241 -0.2230
-0.1725 3.2137
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.310338803 0.080252123
-41.249 < 0.0000000000000002 ***
x1 1.958402994 0.048926358
40.028 < 0.0000000000000002 ***
x2 -0.019645433 0.001371451 -14.325 < 0.0000000000000002
***
x3 0.431973025 0.016448532
26.262 < 0.0000000000000002 ***
x4 -0.000062736 0.000017252
-3.636 0.000276 ***
x5 -0.000028903 0.000004622
-6.253 0.000000000402647570 ***
x6 0.030681183 0.003769002
8.140 0.000000000000000394 ***
x7 0.696080668 0.024405712
28.521 < 0.0000000000000002 ***
x8 0.128121160 0.015487068
8.273 < 0.0000000000000002 ***
x9 0.591574598 0.032721639
18.079 < 0.0000000000000002 ***
x10 0.049479163 0.014527829
3.406 0.000660 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null
deviance: 35353 on 73821 degrees of freedom
Residual deviance: 27127 on 73811 degrees of freedom
AIC: 27149
Number of Fisher Scoring iterations: 6
模型变量都通过检验。
3. 模型评估
使用AUC值对模型进行评估。
library(pROC)
library(e1071)
pre <- predict(glm1,testing)
modelroc <- roc(testing$y,pre)
plot(modelroc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),grid.col=c("green", "red"), max.auc.polygon=TRUE,auc.polygon.col="skyblue", print.thres=TRUE)
最优点FPR=1-TNR=0.771,TPR=0.773,AUC值为0.851,说明模型预测效果不错。
五、 评分卡实施
1. 数据分箱
数据分箱方法分为手动(等距分箱和等宽分箱)与自动分箱(基于决策树算法的最优分箱)。本例采用最优分箱与手动分箱相结合。
library(smbinning)
#该列数据小数位数较长,全部纳入计算会出现内存不足。
xx1<-round(train33$x1,4)
a <-names(train3)%in%c("x1")
train341<-train33[!a]
train341<-cbind(train341,xx1)
smbin1<-smbinning(df=train341, y="y", x="xx1", p = 0.05)
smbin1$bands
smbinning.plot(smbin1,option="WoE",sub="xx1")
>
smbin1$bands
[1] 0.0000 0.1318 0.2177 0.3008 0.3852 0.4949 0.6980 0.9032 1.4995
smbin2<-smbinning(df=train33, y="y", x="x2", p = 0.05)
smbin2$bands
smbinning.plot(smbin2,option="WoE",sub="x2")
>
smbin2$bands
[1]
21 29 36
42 47 52
55 59 62 67
105
smbin3<-smbinning(df=train33, y="y", x="x3", p = 0.05)
smbin3$bands
smbinning.plot(smbin3,option="WoE",sub="x3")
>
smbin3$bands
[1] 0
0 1 13
xx4<-round(train33$x4,4)
a <-names(train3)%in%c("x4")
train344<-train33[!a]
train344<-cbind(train344,xx4)
smbin4<-smbinning(df=train344, y="y", x="xx4", p = 0.05)
smbin4$bands
smbinning.plot(smbin4,option="WoE",sub="xx4")
>
smbin4$bands
[1] 0.0000 0.0192 0.1371 0.4232 0.5041 0.6536 3.9723
[8] 329664.0000
smbin5<-smbinning(df=train33, y="y", x="x5", p = 0.05)
smbin5$bands
smbinning.plot(smbin5,option="WoE",sub="x5")
>
smbin5$bands
[1] 0 391
3331 4833 6643 835040
smbin6<-smbinning(df=train33, y="y", x="x6", p = 0.05)
smbin6$bands
smbinning.plot(smbin6,option="WoE",sub="x6")
>
smbin6$bands
[1] 0
2 3 13 57
smbin7<-smbinning(df=train33, y="y", x="x7", p = 0.01)
smbin7$bands
smbinning.plot(smbin7,option="WoE",sub="x7")
>
smbin7$bands
[1] 0
0 1 17
smbin10<-smbinning(df=train33, y="y", x="x10", p = 0.05)
smbin10$bands
smbinning.plot(smbin10,option="WoE",sub="x10")
>
smbin10$bands
[1] 0
0 1 2 10
在最优分箱的过程中,x8,x9数据不满足smbinning函数的输入数据要求,需要进行手东的分箱。
getwoe<-function(xx,a,b)
{
good<-nrow(train33[which(train33$y==0&train33[,xx]>=a&train33[,xx]<b),])
bad<-nrow(train33[which(train33$y==1&train33[,xx]>=a&train33[,xx]<b),])
Tgood<-nrow(train33[which(train33$y==0),])
Tbad<-nrow(train33[which(train33$y==1),])
woe<-log((bad*Tgood)/(good*Tbad))
return(woe)
}
table(train33$x8[duplicated(train33$x8)])
barplot(c(getwoe(9,0,1),getwoe(9,1,2),getwoe(9,2,3),getwoe(9,3,4),getwoe(9,4,5),getwoe(9,5,6),getwoe(9,6,7),getwoe(9,7,70)))
c(getwoe(9,0,1),getwoe(9,1,2),getwoe(9,2,3),getwoe(9,3,4),getwoe(9,4,5),getwoe(9,5,6),getwoe(9,6,7),getwoe(9,7,70))
第二项与第三项水平相当,可以合并。
barplot(c(getwoe(9,0,1),getwoe(9,1,3),getwoe(9,3,4),getwoe(9,4,5),getwoe(9,5,6),getwoe(9,6,7),getwoe(9,7,70)))
分区为:[0,2,3,4,5,6]
[1] 0.2162265 -0.2093599 0.0216662 0.3377388 0.6708343 0.9103778 1.2689269
table(train33$x9[duplicated(train33$x9)])
barplot(c(getwoe(10,0,1),getwoe(10,1,2),getwoe(10,2,3),getwoe(10,3,4),getwoe(10,4,5),getwoe(10,5,6),getwoe(10,6,7),getwoe(10,7,70)))
c(getwoe(10,0,1),getwoe(10,1,2),getwoe(10,2,3),getwoe(10,3,4),getwoe(10,4,5),getwoe(10,5,6),getwoe(10,6,7),getwoe(10,7,70))
第三到第六相当合并
barplot(c(getwoe(10,0,1),getwoe(10,1,2),getwoe(10,2,6),getwoe(10,6,7),getwoe(10,7,70)))
c(getwoe(10,0,1),getwoe(10,1,2),getwoe(10,2,6),getwoe(10,6,7),getwoe(10,7,70))
分区为:[0,1,5,6]
[1] -0.272719 1.848474 2.766116 3.764645 2.666032
2. 评分卡刻度
主要工作为:
(1)指定某个特定比率的预期分值;
(2)指定翻倍比率的分值
设定1:20比率时分数为600分,违约率翻倍时减少的分数为20分。
B=20/ln(2)
B=28.8539
A=600+28.8539*ln(0.05)
A=513.5614
评分卡刻度:
3. 生成评分卡
base |
|
609 |
RevolvingUtilizationOfUnsecuredLines |
<= 0.13 |
38 |
|
<= 0.22 |
22 |
|
<= 0.30 |
16 |
|
<= 0.39 |
7 |
|
<= 0.50 |
0 |
|
<= 0.70 |
-12 |
|
<= 0.90 |
-26 |
|
> 0.90 |
-41 |
Age |
<= 29 |
-17 |
|
<= 36 |
-14 |
|
<= 42 |
-10 |
|
<= 47 |
-7 |
|
<= 52 |
-4 |
|
<= 55 |
-1 |
|
<= 59 |
7 |
|
<= 62 |
10 |
|
<= 67 |
21 |
|
> 67 |
32 |
NumberOfTime30-59DaysPastDueNotWorse |
<= 0 |
15 |
|
<= 1 |
-26 |
|
> 1 |
-54 |
DebtRatio |
<= 0.02 |
14 |
|
<= 0.14 |
-1 |
|
<= 0.42 |
4 |
|
<= 0.50 |
-3 |
|
<= 0.65 |
-10 |
|
<= 3.97 |
-18 |
|
> 3.97 |
6 |
MonthlyIncome |
<= 391 |
14 |
|
<= 3331 |
-10 |
|
<= 4833 |
-6 |
|
<= 6643 |
-1 |
|
> 6643 |
9 |
NumberOfOpenCreditLinesAndLoans |
<= 2 |
-19 |
|
<= 3 |
-4 |
|
<= 13 |
4 |
|
> 13 |
-2 |
NumberOfTimes90DaysLate |
<= 0 |
11 |
|
<= 1 |
-57 |
|
> 1 |
-83 |
NumberRealEstateLoansOrLines |
<= 0 |
-6 |
|
<= 2 |
6 |
|
<= 3 |
-1 |
|
<= 4 |
-10 |
|
<= 5 |
-19 |
|
<= 6 |
-26 |
|
> 6 |
-37 |
NumberOfTime60-89DaysPastDueNotWorse |
<= 0 |
-8 |
|
<= 1 |
-53 |
|
<= 5 |
-80 |
|
<= 6 |
-109 |
|
> 6 |
-77 |
NumberOfDependents |
<= 0 |
5 |
|
<= 1 |
-3 |
|
<= 2 |
-6 |
|
> 2 |
-11 |
4. 分值分配
建立一个用户数据自动进行打分的模块:
owoe<-function(x,n)#除x8,x9的woe
{
woe1<-((get(paste("smbin",n,sep="")))$ivtable)[table(c((get(paste("smbin",n,sep=""))$cuts<x),"TRUE"))[["TRUE"]],13]
return(woe1)
}
owoe1<-function(x,n)#x8,x9的woe
{
woe2<-(get(paste("c",n,sep="")))[table(c((get(paste("smbin",n,sep=""))<x),"TRUE"))[["TRUE"]]]
return(woe2)
}
twoe=c()
for (i in 1:150000)
{ twoe1<-(513.5614-28.8539*(-3.310338803))-28.8539*(owoe(train2[i,1],1)+owoe(train2[i,2],2)+owoe(train2[i,3],3)+owoe(train2[i,4],4)+owoe(train2[i,5],5)+owoe(train2[1,6],6)+owoe(train2[i,7],7)+owoe1(train2[i,7],8)+owoe1(train2[i,7],9)+owoe(train2[i,10],10))
twoe<-c(twoe,twoe1)
}
JG<-cbind(train2,twoe)
最后对评分结果与违约率关系进行可视化展现
p<-ggplot(JG,aes(x=twoe))
p+geom_histogram(binwidth =20,aes(fill=factor(y)),position = "stack")+scale_color_manual(values = c("red4","blue4"))
p<-ggplot(JG,aes(x=twoe))
p+geom_histogram(binwidth =20,aes(fill=factor(y)),position = "fill")+scale_color_manual(values = c("red4","blue4"))
由于低分段人数较少,导致实际结果比率不稳定,但是总体上对不同违约比率的人群做了较好的划分。
六、 番外
在完成评分卡之余,把数据来源的kaggle竞赛结果提交后,有84.5%的预测正确,还不错。
#导入数据
adress1<-"E:\\project\\pfk\\cs-test.csv"
text1<-read.table(adress1,header=TRUE,sep=",",row.names="id")
#重命名列
names(text1)[1:11]<-c("y","x1","x2","x3","x4","x5","x6","x7","x8","x9","x10")
#缺失值信息
md.pattern(text1)
#缺失值图
matrixplot(text1)
#填补缺失值
text2<-knnImputation(text1[,2:11],k=11, scale =T,meth = "median", distData =NULL)
#建立模型
glm2<-glm(y~.,family = binomial(link = "logit"),data = train33)
summary(glm2)
#输出预测
library(pROC)
library(e1071)
pre1 <- predict(glm2,text2)
pre1 <- exp(pre1)/(1+exp(pre1))
text4<-cbind(text2,pre1)
adress2<-"E:\\project\\pfk\\jg.csv"
write.csv(pre1,file=adress2,quote=T,row.name=T)