2014.2.16 - 贷款预测第五天

时间:2021-09-25 20:44:56

早上来图书馆之后就开始准备训练,首先是照抄Andrew Ng的代码,包括sigmoid函数

function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
%   J = SIGMOID(z) computes the sigmoid of z.

% You need to return the following variables correctly
g = zeros(size(z));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the sigmoid of each value of z (z can be a matrix,
%               vector or scalar).

g = 1.0 ./ (exp(z .* (-1)) .+ 1.0)

% =============================================================

end

还有用于计算目标函数J及其梯度的costFunction函数


function [J, grad] = costFunctionReg(theta, X, y, lambda)
%COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
%   J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
%   theta as the parameter for regularized logistic regression and the
%   gradient of the cost w.r.t. to the parameters.

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
%               You should set J to the cost.
%               Compute the partial derivatives and set grad to the partial
%               derivatives of the cost w.r.t. each parameter in theta

n = size(theta, 1); % number of dimensions
J = (1 / m) .* sum(-y' * log(sigmoid(X * theta)) - (1 - y') * log(1 .- sigmoid(X * theta))) + (lambda / (2 * m)) .* (sum(theta .^ 2) - theta(1) ^ 2);
grad = (1 / m) .* (X' * (sigmoid(X * theta) - y));
grad(1) = grad(1) - (lambda / m) * theta(1);

% =============================================================

end

还有训练完theta之后用来预测的predict函数

function p = predict(theta, X)
%PREDICT Predict whether the label is 0 or 1 using learned logistic
%regression parameters theta
%   p = PREDICT(theta, X) computes the predictions for X using a
%   threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1)

m = size(X, 1); % Number of training examples

% You need to return the following variables correctly
p = zeros(m, 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
%               your learned logistic regression parameters.
%               You should set p to a vector of 0's and 1's
%

p = sigmoid(X * theta);

for i = 1:m
  if (p(i) > 0.5)
    p(i) = 1;
  else
    p(i) = 0;
end

% =========================================================================


end

然后我自己还写了个求分类精度的accuracy函数

function [accuracy] = accuracy(p, label)

  accuracy = mean(double(p == label)) * 100;

end


一开始我用整个norm_1去训练,后来马上意识到这样做十分愚蠢,因为根本都不知道什么时候能训练完,于是我就从50条数据开始尝试,500条数据需要5分钟,而5000条数数据大概用了两个多小时:

octave> [theta, J, exit_flag] = fminunc(@(t)(costFunction(t, norm_1(1:5000, :), label_1(1:5000, :), 0.001)), initial_theta, options);

然后用得来的theta,对norm_2的数据统计了一下针对norm_2的精度竟然只有43.580:

octave> accuracy(predict(theta, norm_2), label_2);

ans =  43.580

我大概算了一下,如果是用硬币分类的话,就是50%分正例50%分反例。对于10%是正例的数据来说,正例一共分对10% * 50%,反例一共分对90% * 50%,那么硬币分类的精度应该是:

(10% * 50% + 90% * 50%) /  1 = 50%

然后就去LeaderBoard上面看看他们的情况,目前第一的是maternaj,他的score是0.62376,不过这个统计的是回归加分类的,而且评分的方法叫做MAE(Mean Absolute Error),计算方法是:

2014.2.16 - 贷款预测第五天

也就是score值越小排名越靠前。

 

 

在经历过打击之后,果断换SVM。在找SVM工具箱的时候发现了这个网站 :

http://www.support-vector-machines.org/SVM_soft.html

里面包含了一些公认的优秀的SVM实现。

 

记得当初Andrew Ng讲课的时候介绍的是LIBSVM,所以先试一下这款。

然后就开始训练:

 

load("~/norm_1.mat");load("~/label_1.mat");

model_1 = svmtrain(label_1, norm_1, "-s 2 -d 3 -b 0 -q 0")

 

谁知道一天都训练进去了,还是没有出结果。


下午终于醒悟,又没有小数据开始试,于是就又重算,发现问题很大,我在逐步训练的时候顺便把耗时,训练误差,测试误差记下来了:

 

;; train
octave> model_1 = svmtrain(label_1(1:50), norm_1(1:50, :), "-s 2 -d 3 -b 0 -q 0") 


;; train error
octave> [predicted_label, accuracy, prob_estimates] = svmpredict(label_1(1:50), norm_1(1:50, :), model_1);

;; test error
octave> [predicted_label, accuracy, prob_estimates] = svmpredict(label_2, norm_2, model_1);

 

结果是这样的:

+-----------------------------------------------------+
| data amount | time | train accuracy | test accuracy |
+-----------------------------------------------------+
| 50          | 5s   | 4%             | 2.145%        |
+-----------------------------------------------------+
| 100         | 15s  | 6%             | 2.54%         |
+-----------------------------------------------------+
| 200         | 20s  | 6.5%           | 3.54%         |
+-----------------------------------------------------+
| 400         | 1m   | 5.25%          | 3.275%        |
+-----------------------------------------------------+
| 1000        | 4m   | 5%             | 4.225%        |
+-----------------------------------------------------+
| 2000        | 7m   | 4.9%           | 4.44%         |
+-----------------------------------------------------+
| 4000        | 18m  | 4.15%          | 4.31%         |
+-----------------------------------------------------+
| 10000       | 1h48m| 4.5%           | 4.37%         |
+-----------------------------------------------------+

 

注意这里统计的是精确度,竟然连十位数都没达到,我先在感觉非常揪心,完全不知到发生了什么。。。

 

顺便记一下,训练的时候闲着没事,逛微博发现写《机器学习》的Mitchell还有个公开课视频,讲半监督的,打算有时间看一下:

http://videolectures.net/mlas06_mitchell_sla/