I am trying to implement batch gradient descent on a data set with a single feature and multiple training examples (m
).
我正在尝试在一个具有单个特性和多个训练示例(m)的数据集上实现批处理梯度下降。
When I try using the normal equation, I get the right answer but the wrong one with this code below which performs batch gradient descent in MATLAB.
当我尝试使用正常的方程时,我得到了正确的答案,但错误的答案是下面的代码,在MATLAB中执行批量梯度下降。
function [theta] = gradientDescent(X, y, theta, alpha, iterations)
m = length(y);
delta=zeros(2,1);
for iter =1:1:iterations
for i=1:1:m
delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1)) ;
delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
end
theta= theta-( delta*(alpha/m) );
computeCost(X,y,theta)
end
end
y
is the vector with target values, X
is a matrix with the first column full of ones and second columns of values (variable).
y是带有目标值的向量,X是一个矩阵,第一列是满的,第二列是值(变量)。
I have implemented this using vectorization, i.e
我已经用矢量化来实现它,也就是。
theta = theta - (alpha/m)*delta
... where delta is a 2 element column vector initialized to zeroes.
…其中,delta是初始化为零的两个元素列向量。
The cost function J(Theta)
is 1/(2m)*(sum from i=1 to m [(h(theta)-y)^2])
.
成本函数J()是1/(2m)*(从i=1到m [(h()-y))。
1 个解决方案
#1
24
The error is very simple. Your delta
declaration should be inside the first for
loop. Every time you accumulate the weighted differences between the training sample and output, you should start accumulating from the beginning.
这个错误非常简单。您的delta声明应该在第一个for循环中。每次你积累了训练样本和输出之间的加权差异,你就应该从一开始就开始积累。
By not doing this, what you're doing is accumulating the errors from the previous iteration which takes the error of the the previous learned version of theta
into account which isn't correct. You must put this at the beginning of the first for
loop.
通过不这样做,你所做的是积累之前迭代的错误,它将以前学过的版本的错误考虑进去,这是不正确的。您必须在第一个for循环的开始时将它放在前面。
In addition, you seem to have an extraneous computeCost
call. I'm assuming this calculates the cost function at every iteration given the current parameters, and so I'm going to create a new output array called cost
that shows you this at each iteration. I'm also going to call this function and assign it to the corresponding elements in this array:
此外,您似乎有一个额外的computeCost调用。我假设这是根据当前参数计算每个迭代的成本函数,所以我将创建一个新的输出数组,叫做cost,它在每次迭代中向你展示这个。我还将调用这个函数并将它赋给数组中的相应元素:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
m = length(y);
costs = zeros(m,1); %// New
% delta=zeros(2,1); %// Remove
for iter =1:1:iterations
delta=zeros(2,1); %// Place here
for i=1:1:m
delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1)) ;
delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
end
theta= theta-( delta*(alpha/m) );
costs(iter) = computeCost(X,y,theta); %// New
end
end
A note on proper vectorization
FWIW, I don't consider this implementation completely vectorized. You can eliminate the second for
loop by using vectorized operations. Before we do that, let me cover some theory so we're on the same page. You are using gradient descent here in terms of linear regression. We want to seek the best parameters theta
that are our linear regression coefficients that seek to minimize this cost function:
FWIW,我不认为这个实现是完全向量化的。您可以使用矢量化操作消除第二个for循环。在此之前,让我来介绍一些理论,这样我们就可以在同一个页面上了。你使用的是线性回归的梯度下降。我们想要寻找最好的参数这是我们的线性回归系数寻求最小化这个代价函数:
m
corresponds to the number of training samples we have available and x^{i}
corresponds to the ith training example. y^{i}
corresponds to the ground truth value we have associated with the ith training sample. h
is our hypothesis, and it is given as:
m对应于我们现有的训练样本的数量,x {i}对应于第i个训练样本。y ^ {我}对应于地面真值与第i个训练样本。h是我们的假设,它是
Note that in the context of linear regression in 2D, we only have two values in theta
we want to compute - the intercept term and the slope.
注意,在二维线性回归的情况下,我们只需要计算两个值——截距项和斜率。
We can minimize the cost function J
to determine the best regression coefficients that can give us the best predictions that minimize the error of the training set. Specifically, starting with some initial theta
parameters... usually a vector of zeroes, we iterate over iterations from 1 up to as many as we see fit, and at each iteration, we update our theta
parameters by this relationship:
我们可以最小化成本函数J来确定最佳的回归系数,它能给我们最好的预测,从而最小化训练集的误差。通常一个0向量,我们迭代从1到尽可能多的迭代,在每次迭代中,我们通过这种关系更新我们的参数:
For each parameter we want to update, you need to determine the gradient of the cost function with respect to each variable and evaluate what that is at the current state of theta
. If you work this out using Calculus, we get:
对于我们想要更新的每个参数,您需要确定每个变量的成本函数的梯度,并计算在当前状态下的值。如果你用微积分算出来,我们得到:
If you're unclear with how this derivation happened, then I refer you to this nice Mathematics Stack Exchange post that talks about it:
如果你不清楚这个推导过程是如何发生的,那么我推荐你到这个漂亮的数学堆栈交换站,谈论它:
https://math.stackexchange.com/questions/70728/partial-derivative-in-gradient-descent-for-two-variables
Now... how can we apply this to our current problem? Specifically, you can calculate the entries of delta
quite easily analyzing all of the samples together in one go. What I mean is that you can just do this:
现在…我们如何将其应用于当前的问题?具体地说,你可以很容易地计算出所有的样本。我的意思是你可以这样做:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
m = length(y);
costs = zeros(m,1);
for iter = 1 : iterations
delta1 = theta(1) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,1)));
delta2 = theta(2) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,2)));
theta = [delta1; delta2];
costs(iter) = computeCost(X,y,theta);
end
end
The operations on delta(1)
and delta(2)
can completely be vectorized in a single statement for both. What you are doing theta^{T}*X^{i}
for each sample i
from 1, 2, ..., m
. You can conveniently place this into a single sum
statement.
对于这两种情况,对delta(1)和(2)的运算可以完全向矢量化。你所做的θ^ { T } * X ^ {我}为每个样本我从1,2,…m,你可以方便地把它放到一个求和语句中。
We can go even further and replace this with purely matrix operations. First off, what you can do is compute theta^{T}*X^{i}
for each input sample X^{i}
very quickly using matrix multiplication. Suppose if:
我们可以更进一步,用纯矩阵运算来代替它。首先,你可以做的是计算θ^ { T } * X ^ {我}对每个输入样本X ^ {我}很快使用矩阵乘法。假设如果:
Here, X
is our data matrix which composes of m
rows corresponding to m
training samples and n
columns corresponding to n
features. Similarly, theta
is our learned weight vector from gradient descent with n+1
features accounting for the intercept term.
在这里,X是我们的数据矩阵,它构成了m行对应的m训练样本和n列对应的n个特征。类似地,是我们从梯度下降到n+1的学习权向量,计算了截距项。
If we compute X*theta
, we get:
如果我们计算X*,得到:
As you can see here, we have computed the hypothesis for each sample and have placed each into a vector. Each element of this vector is the hypothesis for the ith training sample. Now, recall what the gradient term of each parameter is in gradient descent:
正如你在这里看到的,我们已经计算了每个样本的假设,并将它们分别放到一个向量中。这个向量的每个元素都是第i个训练样本的假设。现在,回忆一下梯度下降时每个参数的梯度项是什么
We want to implement this all in one go for all of the parameters in your learned vector, and so putting this into a vector gives us:
我们想要把所有的参数都应用到你学过的向量中,把它放到一个向量中
Finally:
最后:
Therefore, we know that y
is already a vector of length m
, and so we can very compactly compute gradient descent at each iteration by:
因此,我们知道y已经是长度为m的向量,因此我们可以非常紧凑地计算每次迭代时的梯度下降:
theta = theta - (alpha/m)*X'*(X*theta - y);
.... so your code is now just:
....所以你的代码现在是:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
m = length(y);
costs = zeros(m, 1);
for iter = 1 : iterations
theta = theta - (alpha/m)*X'*(X*theta - y);
costs(iter) = computeCost(X,y,theta);
end
end
#1
24
The error is very simple. Your delta
declaration should be inside the first for
loop. Every time you accumulate the weighted differences between the training sample and output, you should start accumulating from the beginning.
这个错误非常简单。您的delta声明应该在第一个for循环中。每次你积累了训练样本和输出之间的加权差异,你就应该从一开始就开始积累。
By not doing this, what you're doing is accumulating the errors from the previous iteration which takes the error of the the previous learned version of theta
into account which isn't correct. You must put this at the beginning of the first for
loop.
通过不这样做,你所做的是积累之前迭代的错误,它将以前学过的版本的错误考虑进去,这是不正确的。您必须在第一个for循环的开始时将它放在前面。
In addition, you seem to have an extraneous computeCost
call. I'm assuming this calculates the cost function at every iteration given the current parameters, and so I'm going to create a new output array called cost
that shows you this at each iteration. I'm also going to call this function and assign it to the corresponding elements in this array:
此外,您似乎有一个额外的computeCost调用。我假设这是根据当前参数计算每个迭代的成本函数,所以我将创建一个新的输出数组,叫做cost,它在每次迭代中向你展示这个。我还将调用这个函数并将它赋给数组中的相应元素:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
m = length(y);
costs = zeros(m,1); %// New
% delta=zeros(2,1); %// Remove
for iter =1:1:iterations
delta=zeros(2,1); %// Place here
for i=1:1:m
delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1)) ;
delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
end
theta= theta-( delta*(alpha/m) );
costs(iter) = computeCost(X,y,theta); %// New
end
end
A note on proper vectorization
FWIW, I don't consider this implementation completely vectorized. You can eliminate the second for
loop by using vectorized operations. Before we do that, let me cover some theory so we're on the same page. You are using gradient descent here in terms of linear regression. We want to seek the best parameters theta
that are our linear regression coefficients that seek to minimize this cost function:
FWIW,我不认为这个实现是完全向量化的。您可以使用矢量化操作消除第二个for循环。在此之前,让我来介绍一些理论,这样我们就可以在同一个页面上了。你使用的是线性回归的梯度下降。我们想要寻找最好的参数这是我们的线性回归系数寻求最小化这个代价函数:
m
corresponds to the number of training samples we have available and x^{i}
corresponds to the ith training example. y^{i}
corresponds to the ground truth value we have associated with the ith training sample. h
is our hypothesis, and it is given as:
m对应于我们现有的训练样本的数量,x {i}对应于第i个训练样本。y ^ {我}对应于地面真值与第i个训练样本。h是我们的假设,它是
Note that in the context of linear regression in 2D, we only have two values in theta
we want to compute - the intercept term and the slope.
注意,在二维线性回归的情况下,我们只需要计算两个值——截距项和斜率。
We can minimize the cost function J
to determine the best regression coefficients that can give us the best predictions that minimize the error of the training set. Specifically, starting with some initial theta
parameters... usually a vector of zeroes, we iterate over iterations from 1 up to as many as we see fit, and at each iteration, we update our theta
parameters by this relationship:
我们可以最小化成本函数J来确定最佳的回归系数,它能给我们最好的预测,从而最小化训练集的误差。通常一个0向量,我们迭代从1到尽可能多的迭代,在每次迭代中,我们通过这种关系更新我们的参数:
For each parameter we want to update, you need to determine the gradient of the cost function with respect to each variable and evaluate what that is at the current state of theta
. If you work this out using Calculus, we get:
对于我们想要更新的每个参数,您需要确定每个变量的成本函数的梯度,并计算在当前状态下的值。如果你用微积分算出来,我们得到:
If you're unclear with how this derivation happened, then I refer you to this nice Mathematics Stack Exchange post that talks about it:
如果你不清楚这个推导过程是如何发生的,那么我推荐你到这个漂亮的数学堆栈交换站,谈论它:
https://math.stackexchange.com/questions/70728/partial-derivative-in-gradient-descent-for-two-variables
Now... how can we apply this to our current problem? Specifically, you can calculate the entries of delta
quite easily analyzing all of the samples together in one go. What I mean is that you can just do this:
现在…我们如何将其应用于当前的问题?具体地说,你可以很容易地计算出所有的样本。我的意思是你可以这样做:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
m = length(y);
costs = zeros(m,1);
for iter = 1 : iterations
delta1 = theta(1) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,1)));
delta2 = theta(2) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,2)));
theta = [delta1; delta2];
costs(iter) = computeCost(X,y,theta);
end
end
The operations on delta(1)
and delta(2)
can completely be vectorized in a single statement for both. What you are doing theta^{T}*X^{i}
for each sample i
from 1, 2, ..., m
. You can conveniently place this into a single sum
statement.
对于这两种情况,对delta(1)和(2)的运算可以完全向矢量化。你所做的θ^ { T } * X ^ {我}为每个样本我从1,2,…m,你可以方便地把它放到一个求和语句中。
We can go even further and replace this with purely matrix operations. First off, what you can do is compute theta^{T}*X^{i}
for each input sample X^{i}
very quickly using matrix multiplication. Suppose if:
我们可以更进一步,用纯矩阵运算来代替它。首先,你可以做的是计算θ^ { T } * X ^ {我}对每个输入样本X ^ {我}很快使用矩阵乘法。假设如果:
Here, X
is our data matrix which composes of m
rows corresponding to m
training samples and n
columns corresponding to n
features. Similarly, theta
is our learned weight vector from gradient descent with n+1
features accounting for the intercept term.
在这里,X是我们的数据矩阵,它构成了m行对应的m训练样本和n列对应的n个特征。类似地,是我们从梯度下降到n+1的学习权向量,计算了截距项。
If we compute X*theta
, we get:
如果我们计算X*,得到:
As you can see here, we have computed the hypothesis for each sample and have placed each into a vector. Each element of this vector is the hypothesis for the ith training sample. Now, recall what the gradient term of each parameter is in gradient descent:
正如你在这里看到的,我们已经计算了每个样本的假设,并将它们分别放到一个向量中。这个向量的每个元素都是第i个训练样本的假设。现在,回忆一下梯度下降时每个参数的梯度项是什么
We want to implement this all in one go for all of the parameters in your learned vector, and so putting this into a vector gives us:
我们想要把所有的参数都应用到你学过的向量中,把它放到一个向量中
Finally:
最后:
Therefore, we know that y
is already a vector of length m
, and so we can very compactly compute gradient descent at each iteration by:
因此,我们知道y已经是长度为m的向量,因此我们可以非常紧凑地计算每次迭代时的梯度下降:
theta = theta - (alpha/m)*X'*(X*theta - y);
.... so your code is now just:
....所以你的代码现在是:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
m = length(y);
costs = zeros(m, 1);
for iter = 1 : iterations
theta = theta - (alpha/m)*X'*(X*theta - y);
costs(iter) = computeCost(X,y,theta);
end
end