机器学习实战——Logistic回归

时间:2021-01-20 23:51:58
书籍:《机器学习实战》中文版
IDE:PyCharm Edu 4.02

环境:Adaconda3  python3.6


关键词:sigmoid函数、批梯度上升法、随机梯度上升法


from numpy import *
import matplotlib.pyplot as plt
def loadDataSet():
dataMat = []
labelMat = []
with open('testSet.txt') as fr:
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
labelMat.append(int(lineArr[2]))
return dataMat,labelMat
dataMat,labelMat = loadDataSet()
def sigmoid(inX):
return 1.0/(1+exp(-inX))
# 批梯度上升算法(计算量大)
def gradAscent(dataMatIn,classLabels):
#convert to NumPy matrix
dataMatrix = mat(dataMatIn) # 100 by 3
labelMat = mat(classLabels).transpose() # 100 by 1
m,n = shape(dataMatrix)
alpha = 0.001
maxCycles = 500 #迭代次数
weights = ones((n,1)) #矩阵 3 by 1
for k in range(maxCycles):
h = sigmoid(dataMatrix*weights) # 两个矩阵类型 *表示矩阵乘法
error = labelMat-h
weights = weights + alpha * dataMatrix.transpose() * error #批梯度下降法公式
return weights
weights1 = gradAscent(dataMat,labelMat)
#print(weights1) # print(weights1.getA())
# 画出数据集和logistic回归最佳拟合直线
def plotBestFit(weights):
dataArr = array(dataMat) #二维时,array()与mat()函数效果相同
n = shape(dataArr)[0] #行数
xcord1 = [];ycord1 = []
xcord2 = [];ycord2 = []
for i in range(n):
if int(labelMat[i]) == 1:
xcord1.append(dataArr[i,1])
ycord1.append(dataArr[i,2])
else:
xcord2.append(dataArr[i,1])
ycord2.append(dataArr[i,2])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
ax.scatter(xcord2,ycord2,s=30,c='green')
x = arange(-3.0,3.0,0.1) # 数组(60,)
#根据sigmoid(z)函数,0是两个分类的分界处
#z=w0x0+w1x1+w2x2 令z=0,x0=1解出x1与x2的关系
y = (-weights[0]-weights[1]*x)/weights[2] # 矩阵(1,60)
# 原文 ax.plot(x,y)
ax.plot(x,y.transpose())
plt.xlabel('X1');plt.ylabel('X2')
plt.show()
#原文 plotBestFit(weights1.getA())
#print(plotBestFit(weights1))
# 随机梯度上升算法
def stocGradAscent0(dataMatrix,classLabels):
m,n = shape(dataMatrix)
alpha = 0.01
weights = ones(n) # 数组
for i in range(m):
h = sigmoid(sum(dataMatrix[i]*weights)) # 元素相乘再求和即w0x0+w1x1+w2x2
error = classLabels[i] - h
weights = weights + alpha * error * dataMatrix[i]
return weights
weights2 = stocGradAscent0(array(dataMat),labelMat)
# print(weights2)
# print(plotBestFit(weights2))
# 改进的随机梯度下降法
# alpha随着迭代次数不断减小
def stocGradAscent1(dataMatrix,classLabels,numIter=150):
m,n = shape(dataMatrix)
weights = ones(n) # 数组对象
dataMatrix = array(dataMatrix) #转换为numpy格式
for j in range(numIter):
# 原文 dataIndex = range(m)
dataIndex = list(range(m))
for i in range(m):
# 随机选择一个样本进行权重的更新
alpha = 4/(1.0+j+i)+0.001 #apha decreases with iteration, does not
randIndex = int(random.uniform(0,len(dataIndex))) #go to 0 because of the constant
h = sigmoid(sum(dataMatrix[randIndex]*weights))
error = classLabels[randIndex] - h
weights = weights + alpha * error * dataMatrix[randIndex]
del(dataIndex[randIndex])
return weights
weights3 = stocGradAscent1(dataMat,labelMat)
print(plotBestFit(weights3))


注解:

1、numpy:矩阵和数组的转换

np.mat(变量)函数 :将对象转换为matrix

np.变量.getA():将矩阵转换为数组

例子:批梯度下降法返回一个矩阵weight1s,而plotBestFit(weights)函数接收一个数组,

因此,调用命令为plotBestFit(weights1.getA())。


直接使用 plotBestFit(weights1)报错:x and y must have same first dimension, but have shapes (60,) and (1, 60)

解决方法:将原文的ax.plot(x,y) 改为ax.plot(x,y.transpose())

  

2、区分list、numpy的矩阵及数组

(1)

list对象中间有逗号[1,1,1,1,1]
print(ones(5))                       #数组
print(ones((5,1)))                 #矩阵
[ 1.  1.  1.  1.  1.]
[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]

(2)

矩阵对象:* 表示矩阵乘法

ndarray对象: * 表示元素乘法;dot(A,B)表示矩阵乘法。

当然,二维的ndarray与matrix相同。

python列表:print([1,2,3]*2)  结果:[1,2,3,1,2,3]

若想从列表得到数乘结果,可以使用列表生成式!


(3)记住随机梯度上升法的推导公式

weights = weights + alpha * error * dataMatrix[randIndex]