I am new to machine learning with python. I've managed to draw the straight decision boundary for logistic regression using matplotlib. However, I am facing a bit of difficulty in plotting a curve line to understand the case of overfitting using some sample dataset.
我是python机器学习的新手。我已经设法使用matplotlib绘制逻辑回归的直接决策边界。但是,在绘制曲线以了解使用某些样本数据集过度拟合的情况时,我遇到了一些困难。
I am trying to build a logistic regression model using regularization and use regularization to control overfitting my data set.
我正在尝试使用正则化建立逻辑回归模型,并使用正则化来控制过度拟合我的数据集。
I am aware of the sklearn library, however I prefer writing code separately
我知道sklearn库,但我更喜欢单独编写代码
The test data sample I am working on is given below:
我正在研究的测试数据样本如下:
x=np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650')
y=np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0')
The decision boundary I am expecting is given in the graph below:
我期待的决策边界如下图所示:
Any help would be appreciated.
任何帮助,将不胜感激。
I could plot a straight decision boundary using the code below:
我可以使用下面的代码绘制直线决策边界:
# plot of x 2D
plt.figure()
pos=np.where(y==1)
neg=np.where(y==0)
plt.plot(X[pos[0],0], X[pos[0],1], 'ro')
plt.plot(X[neg[0],0], X[neg[0],1], 'bo')
plt.xlim([min(X[:,0]),max(X[:,0])])
plt.ylim([min(X[:,1]),max(X[:,1])])
plt.show()
# plot of the decision boundary
plt.figure()
pos=np.where(y==1)
neg=np.where(y==0)
plt.plot(x[pos[0],1], x[pos[0],2], 'ro')
plt.plot(x[neg[0],1], x[neg[0],2], 'bo')
plt.xlim([x[:, 1].min()-2 , x[:, 1].max()+2])
plt.ylim([x[:, 2].min()-2 , x[:, 2].max()+2])
plot_x = [min(x[:,1])-2, max(x[:,1])+2] # Takes a lerger decision line
plot_y = (-1/theta_NM[2])*(theta_NM[1]*plot_x +theta_NM[0])
plt.plot(plot_x, plot_y)
And my decision boundary looks like this:
我的决定边界看起来像这样:
In an ideal scenario the above decision boundary is good but I would like to plot a curve decision boundary that will fit my training data very well but will overfit my test data. something similar to shown in the 1st plot
在一个理想的情况下,上面的决策边界是好的,但我想绘制一个曲线决策边界,它将非常适合我的训练数据,但会过度拟合我的测试数据。类似于第一个图中所示的东西
1 个解决方案
#1
This can be done by gridding the parameter space and setting each grid point to the value of the closest point. Then running a contour plot on this grid.
这可以通过网格化参数空间并将每个网格点设置为最近点的值来完成。然后在此网格上运行等高线图。
But there are numerous variations, such as setting it to a value of a distance-weighted average; or smoothing the final contour; etc.
但是有很多变化,例如将其设置为距离加权平均值;或平滑最终轮廓;等等
Here's an example for finding the initial contour:
以下是查找初始轮廓的示例:
import numpy as np
import matplotlib.pyplot as plt
# get the data as numpy arrays
xys = np.array(np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650'))
vals = np.array(np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0'))[:,0]
N = len(vals)
# some basic spatial stuff
xs = np.linspace(min(xys[:,0])-2, max(xys[:,0])+1, 10)
ys = np.linspace(min(xys[:,1])-100, max(xys[:,1])+100, 10)
xr = max(xys[:,0]) - min(xys[:,0]) # ranges so distances can weight x and y equally
yr = max(xys[:,1]) - min(xys[:,1])
X, Y = np.meshgrid(xs, ys) # meshgrid for contour and distance calcs
# set each gridpoint to the value of the closest data point:
Z = np.zeros((len(xs), len(ys), N))
for n in range(N):
Z[:,:,n] = ((X-xys[n,0])/xr)**2 + ((Y-xys[n,1])/yr)**2 # stack arrays of distances to each points
z = np.argmin(Z, axis=2) # which data point is the closest to each grid point
v = vals[z] # set the grid value to the data point value
# do the contour plot (use only the level 0.5 since values are 0 and 1)
plt.contour(X, Y, v, cmap=plt.cm.gray, levels=[.5]) # contour the data point values
# now plot the data points
pos=np.where(vals==1)
neg=np.where(vals==0)
plt.plot(xys[pos,0], xys[pos,1], 'ro')
plt.plot(xys[neg,0], xys[neg,1], 'bo')
plt.show()
#1
This can be done by gridding the parameter space and setting each grid point to the value of the closest point. Then running a contour plot on this grid.
这可以通过网格化参数空间并将每个网格点设置为最近点的值来完成。然后在此网格上运行等高线图。
But there are numerous variations, such as setting it to a value of a distance-weighted average; or smoothing the final contour; etc.
但是有很多变化,例如将其设置为距离加权平均值;或平滑最终轮廓;等等
Here's an example for finding the initial contour:
以下是查找初始轮廓的示例:
import numpy as np
import matplotlib.pyplot as plt
# get the data as numpy arrays
xys = np.array(np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650'))
vals = np.array(np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0'))[:,0]
N = len(vals)
# some basic spatial stuff
xs = np.linspace(min(xys[:,0])-2, max(xys[:,0])+1, 10)
ys = np.linspace(min(xys[:,1])-100, max(xys[:,1])+100, 10)
xr = max(xys[:,0]) - min(xys[:,0]) # ranges so distances can weight x and y equally
yr = max(xys[:,1]) - min(xys[:,1])
X, Y = np.meshgrid(xs, ys) # meshgrid for contour and distance calcs
# set each gridpoint to the value of the closest data point:
Z = np.zeros((len(xs), len(ys), N))
for n in range(N):
Z[:,:,n] = ((X-xys[n,0])/xr)**2 + ((Y-xys[n,1])/yr)**2 # stack arrays of distances to each points
z = np.argmin(Z, axis=2) # which data point is the closest to each grid point
v = vals[z] # set the grid value to the data point value
# do the contour plot (use only the level 0.5 since values are 0 and 1)
plt.contour(X, Y, v, cmap=plt.cm.gray, levels=[.5]) # contour the data point values
# now plot the data points
pos=np.where(vals==1)
neg=np.where(vals==0)
plt.plot(xys[pos,0], xys[pos,1], 'ro')
plt.plot(xys[neg,0], xys[neg,1], 'bo')
plt.show()