2014.2.16 - 贷款预测第五天

早上来图书馆之后就开始准备训练，首先是照抄Andrew Ng的代码，包括sigmoid函数：

function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
%   J = SIGMOID(z) computes the sigmoid of z.

% You need to return the following variables correctly
g = zeros(size(z));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the sigmoid of each value of z (z can be a matrix,
%               vector or scalar).

g = 1.0 ./ (exp(z .* (-1)) .+ 1.0)

% =============================================================

end

还有用于计算目标函数J及其梯度的costFunction函数：

function [J, grad] = costFunctionReg(theta, X, y, lambda)
%COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
%   J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
%   theta as the parameter for regularized logistic regression and the
%   gradient of the cost w.r.t. to the parameters.

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
%               You should set J to the cost.
%               Compute the partial derivatives and set grad to the partial
%               derivatives of the cost w.r.t. each parameter in theta

n = size(theta, 1); % number of dimensions
J = (1 / m) .* sum(-y' * log(sigmoid(X * theta)) - (1 - y') * log(1 .- sigmoid(X * theta))) + (lambda / (2 * m)) .* (sum(theta .^ 2) - theta(1) ^ 2);
grad = (1 / m) .* (X' * (sigmoid(X * theta) - y));
grad(1) = grad(1) - (lambda / m) * theta(1);

% =============================================================

end

还有训练完theta之后用来预测的predict函数：

function p = predict(theta, X)
%PREDICT Predict whether the label is 0 or 1 using learned logistic
%regression parameters theta
%   p = PREDICT(theta, X) computes the predictions for X using a
%   threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1)

m = size(X, 1); % Number of training examples

% You need to return the following variables correctly
p = zeros(m, 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
%               your learned logistic regression parameters.
%               You should set p to a vector of 0's and 1's
%

p = sigmoid(X * theta);

for i = 1:m
if (p(i) > 0.5)
    p(i) = 1;
else
    p(i) = 0;
end

% =========================================================================

end

然后我自己还写了个求分类精度的accuracy函数：

function [accuracy] = accuracy(p, label)

accuracy = mean(double(p == label)) * 100;

end

一开始我用整个norm_1去训练，后来马上意识到这样做十分愚蠢，因为根本都不知道什么时候能训练完，于是我就从50条数据开始尝试，500条数据需要5分钟，而5000条数数据大概用了两个多小时：

octave> [theta, J, exit_flag] = fminunc(@(t)(costFunction(t, norm_1(1:5000, :), label_1(1:5000, :), 0.001)), initial_theta, options);

然后用得来的theta，对norm_2的数据统计了一下针对norm_2的精度竟然只有43.580：

octave> accuracy(predict(theta, norm_2), label_2);

ans = 43.580

我大概算了一下，如果是用硬币分类的话，就是50%分正例50%分反例。对于10%是正例的数据来说，正例一共分对10% * 50%，反例一共分对90% * 50%，那么硬币分类的精度应该是：

（10% * 50% + 90% * 50%） / 1 = 50%

然后就去LeaderBoard上面看看他们的情况，目前第一的是maternaj，他的score是0.62376，不过这个统计的是回归加分类的，而且评分的方法叫做MAE（Mean Absolute Error），计算方法是：

也就是score值越小排名越靠前。

在经历过打击之后，果断换SVM。在找SVM工具箱的时候发现了这个网站：

http://www.support-vector-machines.org/SVM_soft.html

里面包含了一些公认的优秀的SVM实现。

记得当初Andrew Ng讲课的时候介绍的是LIBSVM，所以先试一下这款。

然后就开始训练：

load("~/norm_1.mat");load("~/label_1.mat");

model_1 = svmtrain(label_1, norm_1, "-s 2 -d 3 -b 0 -q 0")

谁知道一天都训练进去了，还是没有出结果。

下午终于醒悟，又没有小数据开始试，于是就又重算，发现问题很大，我在逐步训练的时候顺便把耗时，训练误差，测试误差记下来了：

;; train
octave> model_1 = svmtrain(label_1(1:50), norm_1(1:50, :), "-s 2 -d 3 -b 0 -q 0")

;; train error
octave> [predicted_label, accuracy, prob_estimates] = svmpredict(label_1(1:50), norm_1(1:50, :), model_1);

;; test error
octave> [predicted_label, accuracy, prob_estimates] = svmpredict(label_2, norm_2, model_1);

结果是这样的：

+-----------------------------------------------------+
| data amount | time | train accuracy | test accuracy |
+-----------------------------------------------------+
| 50          | 5s   | 4%             | 2.145%        |
+-----------------------------------------------------+
| 100         | 15s | 6%             | 2.54%         |
+-----------------------------------------------------+
| 200         | 20s | 6.5%           | 3.54%         |
+-----------------------------------------------------+
| 400         | 1m   | 5.25%          | 3.275%        |
+-----------------------------------------------------+
| 1000        | 4m   | 5%             | 4.225%        |
+-----------------------------------------------------+
| 2000        | 7m   | 4.9%           | 4.44%         |
+-----------------------------------------------------+
| 4000        | 18m | 4.15%          | 4.31%         |
+-----------------------------------------------------+
| 10000       | 1h48m| 4.5%           | 4.37%         |
+-----------------------------------------------------+

注意这里统计的是精确度，竟然连十位数都没达到，我先在感觉非常揪心，完全不知到发生了什么。。。

顺便记一下，训练的时候闲着没事，逛微博发现写《机器学习》的Mitchell还有个公开课视频，讲半监督的，打算有时间看一下：

http://videolectures.net/mlas06_mitchell_sla/

秒客网

2014.2.16 - 贷款预测第五天

相关文章