Spam classification(垃圾邮件分类)—SVM、Logistic分类、SEA-Logistic(深度网络)分类

本文讲介绍垃圾邮件分类，其中用到SVM算法、Logistic回归、SEA-Logistic深度网络分类。下面分别讲解这几个算法在垃圾邮件分类中的用法。

数据集为spamData.mat，训练集有3065个样本，测试集有1536个样本，每个样本的维度为57.

数据集下载地址：https://github.com/probml/pmtkdata/tree/master/spamData

一、SVM分类

二、Logistic分类

三、SEA-Logistic分类

一、SVM分类

原理我就不讲了，前面博文有，而且网上有太多了资料，还有我没有别人讲的好，不能献丑了，我现在只讲如何利用Libsvm、MATLAB自带的svm工具箱分类垃圾邮件。

1.libsvm垃圾邮件分类

libsvm库下载：http://www.csie.ntu.edu.tw/~cjlin/libsvm/

详解：http://www.matlabsky.com/thread-11925-1-1.html

下载好的libsvm，然后添加到主路径下File->set path ->add with subfolders->加入libsvm-3.11文件夹的路径。

1.首先在MATLAB命令窗【Commond Window】中输入：mex -setup

2.出现 Would you like mex to locate installed compilers [y]/n? 选择y

3.Select a compiler:
[1] Microsoft Visual C++ 2010 in E:\VS2010
[0] None 选择：1

4.Are these correct [y]/n? 选择y

好了现在就可以用了

load('spamData.mat');
model = svmtrain(ytrain,Xtrain,'-t 0');
[predict_label,accuracy] = svmpredict(ytest,Xtest,model);

上面加红色标注的-t x x可以取0，1,2,3,4。系统默认为2，如果不加-t x

0）线性核函数
1）多项式核函数
2）RBF核函数
3）sigmoid核函数
4）自定义核函数

从上表可以看出，线性核函数效果最好，能够达到91.1458%。

数据要这样处理下：

ytrain(ytrain==0) = -1;

ytest(ytest==0) = -1;

2.MATLAB自带的svm工具箱分类垃圾邮件

Matlab自带了svm工具箱，现在我就介绍如何利用这个工具箱来做垃圾邮件分类。下面我先给出程序，再来解释。

load spamData
svmStruct = svmtrain(Xtrain,ytrain,'showplot',true);
classes=svmclassify(svmStruct,Xtest,'showplot',true);
nCorrect=sum(classes==ytest);
accuracy = nCorrect/length(classes);
accuracy = 100*accuracy;
accuracy = double(accuracy);
fprintf('accuracy=%s%%\n',accuracy);

运行会出现这个结果就忽略它，对结果没有什么影响。

Spam classification(垃圾邮件分类)—SVM、Logistic分类、SEA-Logistic(深度网络)分类

在程序中点击右键打开svmtrain.m这个文件。在文件的dflts ={'linear'，.......}，我得是287行。可以改变核函数，可以选择okfuns = {'linear','quadratic', 'radial','rbf','polynomial','mlp'}; 上面dflts ={'linear'，.......}可以改变。

linear是线性核。

quadratic是二次核函数。

radial是什么核？

rbf 是径向基核，通常叫做高斯核，但是Ng说跟高斯没有什么关系。

polynomial是多项式核。

mlp多层感知器核函数。

下面来看使用各个核在垃圾邮件分类中的识别率

可以看到二次核函数效果最好，能达到85.55%。

Spam classification(垃圾邮件分类)—SVM、Logistic分类、SEA-Logistic(深度网络)分类

整理的数据集见资源，三种不同的预处理方式

给出Excise8.1的作业结果

logistic回归

Spam classification(垃圾邮件分类)—SVM、Logistic分类、SEA-Logistic(深度网络)分类

SVM

Spam classification(垃圾邮件分类)—SVM、Logistic分类、SEA-Logistic(深度网络)分类

二、Logistic分类

可以参考这篇文献：http://www.docin.com/p-160363677.html

前面的博文讲过softmax分类，可以改为Logistic回归。http://blog.csdn.net/hlx371240/article/details/40015395

这篇博文是在《最优化计算方法》这门课写的，当时用LBFGS和SD法优化参数，现在我用工具箱直接进行分类。

%% STEP 0: Initialise constants and parameters
inputSize = 57; % Size of input vector 
numClasses = 2;     % Number of classes 
lambda = 1e-4; % Weight decay parameter
%%=====================================================================
%% STEP 1: Load data
load('D:\机器学习课程\作业三\spamData.mat');
Xtrain=Xtrain';
ytrain(ytrain==0) = 2; % Remap 0 to 10
inputData = Xtrain;
DEBUG = false;
if DEBUG
    inputSize = 8;
    inputData = randn(8, 100);
    labels = randi(10, 100, 1);
end
% Randomly initialise theta
theta = 0.005 * randn(numClasses * inputSize, 1);%输入的是一个列向量
%%======================================================================
%% STEP 2: Implement softmaxCost
[cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, inputData, ytrain);                                   
%%======================================================================
%% STEP 3: Learning parameters
options.maxIter = 100;
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
                            inputData, ytrain, options);
%%======================================================================
%% STEP 4: Testing
Xtest=Xtest';
ytest(ytest==0) = 2; % Remap 0 to 10
size(softmaxModel.optTheta)
size(inputData)
[pred] = softmaxPredict(softmaxModel, Xtest);
acc = mean(ytest(:) == pred(:));
fprintf('Accuracy: %0.3f%%\n', acc * 100);

softmaxTrain.m

<span style="font-family:Times New Roman;">function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels, options)
if ~exist('options', 'var')
    options = struct;
end
if ~isfield(options, 'maxIter')
    options.maxIter = 400;
end
% initialize parameters
theta = 0.005 * randn(numClasses * inputSize, 1);
% Use minFunc to minimize the function
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % softmaxCost.m satisfies this.
minFuncOptions.display = 'on';
[softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...
                                   numClasses, inputSize, lambda, ...
                                   inputData, labels), ...                                   
                              theta, options);
% Fold softmaxOptTheta into a nicer format
softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);
softmaxModel.inputSize = inputSize;
softmaxModel.numClasses = numClasses;                       
end

softmaxCost.m

<span style="font-size:14px;">function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)

% numClasses - the number of classes 
% inputSize - the size N of the input vector
% lambda - weight decay parameter
% data - the N x M input matrix, where each column data(:, i) corresponds to
%        a single test set
% labels - an M x 1 matrix containing the labels corresponding for the input data
%

% Unroll the parameters from theta
theta = reshape(theta, numClasses, inputSize);%将输入的参数列向量变成一个矩阵

numCases = size(data, 2);%输入样本的个数
groundTruth = full(sparse(labels, 1:numCases, 1));%这里sparse是生成一个稀疏矩阵，该矩阵中的值都是第三个值1
                                                    %稀疏矩阵的小标由labels和1:numCases对应值构成
cost = 0;
thetagrad = zeros(numClasses, inputSize);
M = bsxfun(@minus,theta*data,max(theta*data, [], 1));
M = exp(M);
p = bsxfun(@rdivide, M, sum(M));
cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(theta(:) .^ 2);
thetagrad = -1/numCases * (groundTruth - p) * data' + lambda * theta;

grad = [thetagrad(:)];
end</span>

softmaxPredict.m

function [pred] = softmaxPredict(softmaxModel, data)
% Unroll the parameters from theta
theta = softmaxModel.optTheta;  % this provides a numClasses x inputSize matrix
pred = zeros(1, size(data, 2));
[nop, pred] = max(theta * data);
end

initializeParameters.m

<span style="font-family:Times New Roman;">function theta = initializeParameters(hiddenSize, visibleSize)
%% Initialize parameters randomly based on layer sizes.
r  = sqrt(6) / sqrt(hiddenSize+visibleSize+1);   % we'll choose weights uniformly from the interval [-r, r]
W1 = rand(hiddenSize, visibleSize) * 2 * r - r;
W2 = rand(visibleSize, hiddenSize) * 2 * r - r;
b1 = zeros(hiddenSize, 1);
b2 = zeros(visibleSize, 1);
% Convert weights and bias gradients to the vector form.
% This step will "unroll" (flatten and concatenate together) all 
% your parameters into a vector, which can then be used with minFunc. 
theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];
end

sigmoidInv.m

function sigmInv = sigmoidInv(x)
    sigmInv = sigmoid(x).*(1-sigmoid(x));
end

这个算法利用的LBFGS，拟牛顿法，可以节约内存，用近似的Hessian矩阵代替精确的Hessian矩阵，这个是我的美女老师讲的，建议同学们选下学期美女老师的《数值优化》的课程，我觉得她讲得很好，听课绝对认真。

还有一个工具箱（minfunc）到资源下载。

最后得到的识别率为：92.057%。但是不是每次跑出来的程序都是这个识别率，因为参数是随机产生的。

三、SAE-Logistic分类
SAE全称为Sparse Auto Encoder，是加了一层自学习层进一步提取特征，因为邮件中每个词的出现可能存在某种关联，然后再分类，这也是神经网络提取特征。

可以参考这篇文献：http://nlp.stanford.edu/~socherr/sparseAutoencoder_2011new.pdf

博文：http://blog.csdn.net/hlx371240/article/details/40201499
网络结果如下图所示：
main.m

<span style="color:#3333ff;font-size:18px; font-weight: bold; font-family: 'Times New Roman';">%STEP 2: 初始化参数和load数据
</span><span style="font-family:Times New Roman;font-size:14px;">clear all;
clc;
load('D:\机器学习课程\作业三\spamData.mat');
Xtrain=Xtrain';
ytrain(ytrain==0) = 2;
Xtest=Xtest';
ytest(ytest == 0) = 2; % Remap 0 to 10
inputSize  = 57;
numLabels  = 2;
a=[57 50 45 40 35 30 25 20 15 10 5];
sparsityParam = 0.1;
lambda = 3e-3;       % weight decay parameter
beta = 3;            % weight of sparsity penalty term
numClasses = 2;     % Number of classes (MNIST images fall into 10 classes)
lambda = 1e-4; % Weight decay parameter
%% ======================================================================
%STEP 2: 训练自学习层SAE
for i=1:11
    hiddenSize = a(i);
    theta = initializeParameters(hiddenSize, inputSize);
    %-------------------------------------------------------------------
    opttheta = theta;
    addpath minFunc/
    options.Method = 'lbfgs';
    options.maxIter = 400;
    options.display = 'on';
    [opttheta, loss] = minFunc( @(p) sparseAutoencoderCost(p, ...
        inputSize, hiddenSize, ...
        lambda, sparsityParam, ...
        beta, Xtrain), ...
        theta, options);
    trainFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
        Xtrain);
    %% ================================================
    %STEP 3: 训练Softmax分类器
    saeSoftmaxTheta = 0.005 * randn(hiddenSize * numClasses, 1);
    softmaxLambda = 1e-4;
    numClasses = 2;
    softoptions = struct;
    softoptions.maxIter = 500;
    softmaxModel = softmaxTrain(hiddenSize,numClasses,softmaxLambda,...
        trainFeatures,ytrain,softoptions);
    theta_new = softmaxModel.optTheta(:);
    %% ============================================================
    stack = cell(1,1);
    stack{1}.w = reshape(opttheta(1:hiddenSize * inputSize), hiddenSize, inputSize);
    stack{1}.b =opttheta(2*hiddenSize*inputSize+1:2*hiddenSize*inputSize+hiddenSize);
    [stackparams, netconfig] = stack2params(stack);
    stackedAETheta = [theta_new;stackparams];
    addpath minFunc/;
    options = struct;
    options.Method = 'lbfgs';
    options.maxIter = 400;
    options.display = 'on';
    [stackedAEOptTheta,cost] =  minFunc(@(p)stackedAECost(p,inputSize,hiddenSize,numClasses, netconfig,lambda, Xtrain, ytrain),stackedAETheta,options);
    %% =================================================================
    %STEP 4: 测试
    [pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSize, ...
        numClasses, netconfig, Xtest);
    acc = mean(ytest(:) == pred(:));
    fprintf('Accuracy = %0.3f%%\n', acc * 100);
    result(i)=acc * 100;
end</span><span style="color:#3333ff;font-size:18px; font-weight: bold; font-family: 'Times New Roman';">
</span>

stackedAEPredict.m

<span style="font-family:Times New Roman;font-size:14px;">function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)                     
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;
for layer = (1:depth)
  z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
  a{layer+1} = sigmoid(z{layer+1});
end
[~, pred] = max(softmaxTheta * a{depth+1});
end
% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end</span>

stackedAECost.m

<span style="font-family:Times New Roman;font-size:14px;">function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
                                              numClasses, netconfig, ...
                                              lambda, data, labels)                                       
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.                                         
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% netconfig:   the network configuration of the stack
% lambda:      the weight regularization penalty
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% labels: A vector containing labels, where labels(i) is the label for the
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
    stackgrad{d}.w = zeros(size(stack{d}.w));
    stackgrad{d}.b = zeros(size(stack{d}.b));
end
cost = 0; % You need to compute this
numCases = size(data, 2);%输入样本的个数
groundTruth = full(sparse(labels, 1:numCases, 1));
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;
for layer = (1:depth)
  z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
  a{layer+1} = sigmoid(z{layer+1});
end
M = softmaxTheta * a{depth+1};
M = bsxfun(@minus, M, max(M));
p = bsxfun(@rdivide, exp(M), sum(exp(M)));
cost = -1/numClasses * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2);
softmaxThetaGrad = -1/numClasses * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta;
d = cell(depth+1);
d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1});
for layer = (depth:-1:2)
  d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer});
end
for layer = (depth:-1:1)
  stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}';
  stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2);
end
% -------------------------------------------------------------------------
%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];
end
% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end</span>

feedForwardAutoencoder.m

<span style="font-family:Times New Roman;font-size:14px;">function [activation] = feedForwardAutoencoder(theta, hiddenSize, visibleSize, data)
% theta: trained weights from the autoencoder
% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
%  Instructions: Compute the activation of the hidden layer for the Sparse Autoencoder.
activation  = sigmoid(W1*data+repmat(b1,[1,size(data,2)]));
end
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end</span>

sparseAutoencoderCost.m

<span style="font-family:Times New Roman;font-size:14px;">function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)
% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example. 
% The input theta is a vector (because minFunc expects the parameters to be a vector). 
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
% Cost and gradient variables (your code needs to compute these values). 
% Here, we initialize them to zeros. 
cost = 0;
W1grad = zeros(size(W1)); 
W2grad = zeros(size(W2));
b1grad = zeros(size(b1)); 
b2grad = zeros(size(b2));

Jcost = 0;%直接误差
Jweight = 0;%权值惩罚
Jsparse = 0;%稀疏性惩罚
[n m] = size(data);%m为样本的个数，n为样本的特征数
%前向算法计算各神经网络节点的线性组合值和active值
z2 = W1*data+repmat(b1,1,m);%注意这里一定要将b1向量复制扩展成m列的矩阵
a2 = sigmoid(z2);
z3 = W2*a2+repmat(b2,1,m);
a3 = sigmoid(z3);
% 计算预测产生的误差
Jcost = (0.5/m)*sum(sum((a3-data).^2));
%计算权值惩罚项
Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));
%计算稀释性规则项
rho = (1/m).*sum(a2,2);%求出第一个隐含层的平均值向量
Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ...
        (1-sparsityParam).*log((1-sparsityParam)./(1-rho)));
%损失函数的总表达式
cost = Jcost+lambda*Jweight+beta*Jsparse;
%反向算法求出每个节点的误差值
d3 = -(data-a3).*sigmoidInv(z3);
sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%因为加入了稀疏规则项，所以
                                                            %计算偏导时需要引入该项
d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2); 
%计算W1grad 
W1grad = W1grad+d2*data';
W1grad = (1/m)*W1grad+lambda*W1;
%计算W2grad  
W2grad = W2grad+d3*a2';
W2grad = (1/m).*W2grad+lambda*W2;
%计算b1grad 
b1grad = b1grad+sum(d2,2);
b1grad = (1/m)*b1grad;%注意b的偏导是一个向量，所以这里应该把每一行的值累加起来
%计算b2grad 
b2grad = b2grad+sum(d3,2);
b2grad = (1/m)*b2grad;
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
end

function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end
function sigmInv = sigmoidInv(x)
    sigmInv = sigmoid(x).*(1-sigmoid(x));
end</span>

stack2params.m

<span style="font-family:Times New Roman;font-size:14px;">function [params, netconfig] = stack2params(stack)

params = [];
for d = 1:numel(stack)
    params = [params ; stack{d}.w(:) ; stack{d}.b(:) ];
    assert(size(stack{d}.w, 1) == size(stack{d}.b, 1), ...
        ['The bias should be a *column* vector of ' ...
         int2str(size(stack{d}.w, 1)) 'x1']);
    if d < numel(stack)
        assert(size(stack{d}.w, 1) == size(stack{d+1}.w, 2), ...
            ['The adjacent layers L' int2str(d) ' and L' int2str(d+1) ...
             ' should have matching sizes.']);
    end
end
if nargout > 1
    % Setup netconfig
    if numel(stack) == 0
        netconfig.inputsize = 0;
        netconfig.layersizes = {};
    else
        netconfig.inputsize = size(stack{1}.w, 2);
        netconfig.layersizes = {};
        for d = 1:numel(stack)
            netconfig.layersizes = [netconfig.layersizes ; size(stack{d}.w,1)];
        end
    end
end
end</span>

实验结果：

从图中可以看出识别率基本都在90%以上，当隐含层的神经元为25个时候，识别率达到93.36%。
综上几种方法识别率 Spam classification(垃圾邮件分类)—SVM、Logistic分类、SEA-Logistic(深度网络)分类

可以看出，SAE-Logistic>Logistic>Libsvm>SVM-MATLAB.可以看出SAE-Logistic的识别率最高，所以这就是deep learning深度网络的魅力所在。

怀柔风光

秒客网

Spam classification(垃圾邮件分类)—SVM、Logistic分类、SEA-Logistic(深度网络)分类

相关文章