本文讲介绍垃圾邮件分类,其中用到SVM算法、Logistic回归、SEA-Logistic深度网络分类。下面分别讲解这几个算法在垃圾邮件分类中的用法。
数据集为spamData.mat,训练集有3065个样本,测试集有1536个样本,每个样本的维度为57.
数据集下载地址:https://github.com/probml/pmtkdata/tree/master/spamData
一、SVM分类
二、Logistic分类
三、SEA-Logistic分类
一、SVM分类
原理我就不讲了,前面博文有,而且网上有太多了资料,还有我没有别人讲的好,不能献丑了,我现在只讲如何利用Libsvm、MATLAB自带的svm工具箱分类垃圾邮件。
1.libsvm垃圾邮件分类
libsvm库下载:http://www.csie.ntu.edu.tw/~cjlin/libsvm/
详解:http://www.matlabsky.com/thread-11925-1-1.html
下载好的libsvm,然后添加到主路径下File->set path ->add with subfolders->加入libsvm-3.11文件夹的路径。
1.首先在MATLAB命令窗【Commond Window】中输入:mex -setup
2.出现 Would you like mex to locate installed compilers [y]/n? 选择y
3.Select a compiler:
[1] Microsoft Visual C++ 2010 in E:\VS2010
[0] None 选择:1
4.Are these correct [y]/n? 选择y
好了现在就可以用了
load('spamData.mat'); model = svmtrain(ytrain,Xtrain,'-t 0'); [predict_label,accuracy] = svmpredict(ytest,Xtest,model);上面加红色标注的-t x x可以取0,1,2,3,4。系统默认为2,如果不加-t x
0)线性核函数
1)多项式核函数
2)RBF核函数
3)sigmoid核函数
4)自定义核函数
从上表可以看出,线性核函数效果最好,能够达到91.1458%。
数据要这样处理下:
ytrain(ytrain==0) = -1;
ytest(ytest==0) = -1;
2.MATLAB自带的svm工具箱分类垃圾邮件
Matlab自带了svm工具箱,现在我就介绍如何利用这个工具箱来做垃圾邮件分类。下面我先给出程序,再来解释。
load spamData svmStruct = svmtrain(Xtrain,ytrain,'showplot',true); classes=svmclassify(svmStruct,Xtest,'showplot',true); nCorrect=sum(classes==ytest); accuracy = nCorrect/length(classes); accuracy = 100*accuracy; accuracy = double(accuracy); fprintf('accuracy=%s%%\n',accuracy);
运行会出现这个结果就忽略它,对结果没有什么影响。
在程序中点击右键打开svmtrain.m这个文件。在文件的dflts ={'linear',.......},我得是287行。可以改变核函数,可以选择okfuns = {'linear','quadratic', 'radial','rbf','polynomial','mlp'}; 上面dflts ={'linear',.......}可以改变。
linear是线性核。
quadratic是二次核函数。
radial是什么核?
rbf 是径向基核,通常叫做高斯核,但是Ng说跟高斯没有什么关系。
polynomial是多项式核。
mlp多层感知器核函数。
下面来看使用各个核在垃圾邮件分类中的识别率
可以看到二次核函数效果最好,能达到85.55%。
整理的数据集见资源,三种不同的预处理方式
给出Excise8.1的作业结果
logistic回归
SVM
二、Logistic分类
可以参考这篇文献:http://www.docin.com/p-160363677.html
前面的博文讲过softmax分类,可以改为Logistic回归。http://blog.csdn.net/hlx371240/article/details/40015395
这篇博文是在《最优化计算方法》这门课写的,当时用LBFGS和SD法优化参数,现在我用工具箱直接进行分类。
%% STEP 0: Initialise constants and parameters inputSize = 57; % Size of input vector numClasses = 2; % Number of classes lambda = 1e-4; % Weight decay parameter %%===================================================================== %% STEP 1: Load data load('D:\机器学习课程\作业三\spamData.mat'); Xtrain=Xtrain'; ytrain(ytrain==0) = 2; % Remap 0 to 10 inputData = Xtrain; DEBUG = false; if DEBUG inputSize = 8; inputData = randn(8, 100); labels = randi(10, 100, 1); end % Randomly initialise theta theta = 0.005 * randn(numClasses * inputSize, 1);%输入的是一个列向量 %%====================================================================== %% STEP 2: Implement softmaxCost [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, inputData, ytrain); %%====================================================================== %% STEP 3: Learning parameters options.maxIter = 100; softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ... inputData, ytrain, options); %%====================================================================== %% STEP 4: Testing Xtest=Xtest'; ytest(ytest==0) = 2; % Remap 0 to 10 size(softmaxModel.optTheta) size(inputData) [pred] = softmaxPredict(softmaxModel, Xtest); acc = mean(ytest(:) == pred(:)); fprintf('Accuracy: %0.3f%%\n', acc * 100);softmaxTrain.m
<span style="font-family:Times New Roman;">function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels, options) if ~exist('options', 'var') options = struct; end if ~isfield(options, 'maxIter') options.maxIter = 400; end % initialize parameters theta = 0.005 * randn(numClasses * inputSize, 1); % Use minFunc to minimize the function addpath minFunc/ options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost % function. Generally, for minFunc to work, you % need a function pointer with two outputs: the % function value and the gradient. In our problem, % softmaxCost.m satisfies this. minFuncOptions.display = 'on'; [softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ... numClasses, inputSize, lambda, ... inputData, labels), ... theta, options); % Fold softmaxOptTheta into a nicer format softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize); softmaxModel.inputSize = inputSize; softmaxModel.numClasses = numClasses; end
softmaxCost.m
<span style="font-size:14px;">function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels) % numClasses - the number of classes % inputSize - the size N of the input vector % lambda - weight decay parameter % data - the N x M input matrix, where each column data(:, i) corresponds to % a single test set % labels - an M x 1 matrix containing the labels corresponding for the input data % % Unroll the parameters from theta theta = reshape(theta, numClasses, inputSize);%将输入的参数列向量变成一个矩阵 numCases = size(data, 2);%输入样本的个数 groundTruth = full(sparse(labels, 1:numCases, 1));%这里sparse是生成一个稀疏矩阵,该矩阵中的值都是第三个值1 %稀疏矩阵的小标由labels和1:numCases对应值构成 cost = 0; thetagrad = zeros(numClasses, inputSize); M = bsxfun(@minus,theta*data,max(theta*data, [], 1)); M = exp(M); p = bsxfun(@rdivide, M, sum(M)); cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(theta(:) .^ 2); thetagrad = -1/numCases * (groundTruth - p) * data' + lambda * theta; grad = [thetagrad(:)]; end</span>
softmaxPredict.m
function [pred] = softmaxPredict(softmaxModel, data) % Unroll the parameters from theta theta = softmaxModel.optTheta; % this provides a numClasses x inputSize matrix pred = zeros(1, size(data, 2)); [nop, pred] = max(theta * data); end
initializeParameters.m
<span style="font-family:Times New Roman;">function theta = initializeParameters(hiddenSize, visibleSize) %% Initialize parameters randomly based on layer sizes. r = sqrt(6) / sqrt(hiddenSize+visibleSize+1); % we'll choose weights uniformly from the interval [-r, r] W1 = rand(hiddenSize, visibleSize) * 2 * r - r; W2 = rand(visibleSize, hiddenSize) * 2 * r - r; b1 = zeros(hiddenSize, 1); b2 = zeros(visibleSize, 1); % Convert weights and bias gradients to the vector form. % This step will "unroll" (flatten and concatenate together) all % your parameters into a vector, which can then be used with minFunc. theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)]; endsigmoidInv.m
function sigmInv = sigmoidInv(x) sigmInv = sigmoid(x).*(1-sigmoid(x)); end
这个算法利用的LBFGS,拟牛顿法,可以节约内存,用近似的Hessian矩阵代替精确的Hessian矩阵,这个是我的美女老师讲的,建议同学们选下学期美女老师的《数值优化》的课程,我觉得她讲得很好,听课绝对认真。
还有一个工具箱(minfunc)到资源下载。最后得到的识别率为:92.057%。但是不是每次跑出来的程序都是这个识别率,因为参数是随机产生的。
三、SAE-Logistic分类
SAE全称为Sparse Auto Encoder,是加了一层自学习层进一步提取特征,因为邮件中每个词的出现可能存在某种关联,然后再分类,这也是神经网络提取特征。
可以参考这篇文献:http://nlp.stanford.edu/~socherr/sparseAutoencoder_2011new.pdf
<span style="color:#3333ff;font-size:18px; font-weight: bold; font-family: 'Times New Roman';">%STEP 2: 初始化参数和load数据 </span><span style="font-family:Times New Roman;font-size:14px;">clear all; clc; load('D:\机器学习课程\作业三\spamData.mat'); Xtrain=Xtrain'; ytrain(ytrain==0) = 2; Xtest=Xtest'; ytest(ytest == 0) = 2; % Remap 0 to 10 inputSize = 57; numLabels = 2; a=[57 50 45 40 35 30 25 20 15 10 5]; sparsityParam = 0.1; lambda = 3e-3; % weight decay parameter beta = 3; % weight of sparsity penalty term numClasses = 2; % Number of classes (MNIST images fall into 10 classes) lambda = 1e-4; % Weight decay parameter %% ====================================================================== %STEP 2: 训练自学习层SAE for i=1:11 hiddenSize = a(i); theta = initializeParameters(hiddenSize, inputSize); %------------------------------------------------------------------- opttheta = theta; addpath minFunc/ options.Method = 'lbfgs'; options.maxIter = 400; options.display = 'on'; [opttheta, loss] = minFunc( @(p) sparseAutoencoderCost(p, ... inputSize, hiddenSize, ... lambda, sparsityParam, ... beta, Xtrain), ... theta, options); trainFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ... Xtrain); %% ================================================ %STEP 3: 训练Softmax分类器 saeSoftmaxTheta = 0.005 * randn(hiddenSize * numClasses, 1); softmaxLambda = 1e-4; numClasses = 2; softoptions = struct; softoptions.maxIter = 500; softmaxModel = softmaxTrain(hiddenSize,numClasses,softmaxLambda,... trainFeatures,ytrain,softoptions); theta_new = softmaxModel.optTheta(:); %% ============================================================ stack = cell(1,1); stack{1}.w = reshape(opttheta(1:hiddenSize * inputSize), hiddenSize, inputSize); stack{1}.b =opttheta(2*hiddenSize*inputSize+1:2*hiddenSize*inputSize+hiddenSize); [stackparams, netconfig] = stack2params(stack); stackedAETheta = [theta_new;stackparams]; addpath minFunc/; options = struct; options.Method = 'lbfgs'; options.maxIter = 400; options.display = 'on'; [stackedAEOptTheta,cost] = minFunc(@(p)stackedAECost(p,inputSize,hiddenSize,numClasses, netconfig,lambda, Xtrain, ytrain),stackedAETheta,options); %% ================================================================= %STEP 4: 测试 [pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSize, ... numClasses, netconfig, Xtest); acc = mean(ytest(:) == pred(:)); fprintf('Accuracy = %0.3f%%\n', acc * 100); result(i)=acc * 100; end</span><span style="color:#3333ff;font-size:18px; font-weight: bold; font-family: 'Times New Roman';"> </span>stackedAEPredict.m
<span style="font-family:Times New Roman;font-size:14px;">function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data) softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize); stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig); depth = numel(stack); z = cell(depth+1,1); a = cell(depth+1, 1); a{1} = data; for layer = (1:depth) z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]); a{layer+1} = sigmoid(z{layer+1}); end [~, pred] = max(softmaxTheta * a{depth+1}); end % You might find this useful function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end</span>stackedAECost.m
<span style="font-family:Times New Roman;font-size:14px;">function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ... numClasses, netconfig, ... lambda, data, labels) % stackedAECost: Takes a trained softmaxTheta and a training data set with labels, % and returns cost and gradient using a stacked autoencoder model. Used for % finetuning. % theta: trained weights from the autoencoder % visibleSize: the number of input units % hiddenSize: the number of hidden units *at the 2nd layer* % numClasses: the number of categories % netconfig: the network configuration of the stack % lambda: the weight regularization penalty % data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example. % labels: A vector containing labels, where labels(i) is the label for the softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize); stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig); % You will need to compute the following gradients softmaxThetaGrad = zeros(size(softmaxTheta)); stackgrad = cell(size(stack)); for d = 1:numel(stack) stackgrad{d}.w = zeros(size(stack{d}.w)); stackgrad{d}.b = zeros(size(stack{d}.b)); end cost = 0; % You need to compute this numCases = size(data, 2);%输入样本的个数 groundTruth = full(sparse(labels, 1:numCases, 1)); depth = numel(stack); z = cell(depth+1,1); a = cell(depth+1, 1); a{1} = data; for layer = (1:depth) z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]); a{layer+1} = sigmoid(z{layer+1}); end M = softmaxTheta * a{depth+1}; M = bsxfun(@minus, M, max(M)); p = bsxfun(@rdivide, exp(M), sum(exp(M))); cost = -1/numClasses * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2); softmaxThetaGrad = -1/numClasses * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta; d = cell(depth+1); d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1}); for layer = (depth:-1:2) d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer}); end for layer = (depth:-1:1) stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}'; stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2); end % ------------------------------------------------------------------------- %% Roll gradient vector grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)]; end % You might find this useful function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end</span>feedForwardAutoencoder.m
<span style="font-family:Times New Roman;font-size:14px;">function [activation] = feedForwardAutoencoder(theta, hiddenSize, visibleSize, data) % theta: trained weights from the autoencoder % visibleSize: the number of input units (probably 64) % hiddenSize: the number of hidden units (probably 25) % data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example. % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); % Instructions: Compute the activation of the hidden layer for the Sparse Autoencoder. activation = sigmoid(W1*data+repmat(b1,[1,size(data,2)])); end function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end</span>sparseAutoencoderCost.m
<span style="font-family:Times New Roman;font-size:14px;">function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ... lambda, sparsityParam, beta, data) % visibleSize: the number of input units (probably 64) % hiddenSize: the number of hidden units (probably 25) % lambda: weight decay parameter % sparsityParam: The desired average activation for the hidden units (denoted in the lecture % notes by the greek alphabet rho, which looks like a lower-case "p"). % beta: weight of sparsity penalty term % data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example. % The input theta is a vector (because minFunc expects the parameters to be a vector). % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize); b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end); % Cost and gradient variables (your code needs to compute these values). % Here, we initialize them to zeros. cost = 0; W1grad = zeros(size(W1)); W2grad = zeros(size(W2)); b1grad = zeros(size(b1)); b2grad = zeros(size(b2)); Jcost = 0;%直接误差 Jweight = 0;%权值惩罚 Jsparse = 0;%稀疏性惩罚 [n m] = size(data);%m为样本的个数,n为样本的特征数 %前向算法计算各神经网络节点的线性组合值和active值 z2 = W1*data+repmat(b1,1,m);%注意这里一定要将b1向量复制扩展成m列的矩阵 a2 = sigmoid(z2); z3 = W2*a2+repmat(b2,1,m); a3 = sigmoid(z3); % 计算预测产生的误差 Jcost = (0.5/m)*sum(sum((a3-data).^2)); %计算权值惩罚项 Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2))); %计算稀释性规则项 rho = (1/m).*sum(a2,2);%求出第一个隐含层的平均值向量 Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ... (1-sparsityParam).*log((1-sparsityParam)./(1-rho))); %损失函数的总表达式 cost = Jcost+lambda*Jweight+beta*Jsparse; %反向算法求出每个节点的误差值 d3 = -(data-a3).*sigmoidInv(z3); sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%因为加入了稀疏规则项,所以 %计算偏导时需要引入该项 d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2); %计算W1grad W1grad = W1grad+d2*data'; W1grad = (1/m)*W1grad+lambda*W1; %计算W2grad W2grad = W2grad+d3*a2'; W2grad = (1/m).*W2grad+lambda*W2; %计算b1grad b1grad = b1grad+sum(d2,2); b1grad = (1/m)*b1grad;%注意b的偏导是一个向量,所以这里应该把每一行的值累加起来 %计算b2grad b2grad = b2grad+sum(d3,2); b2grad = (1/m)*b2grad; grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)]; end function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end function sigmInv = sigmoidInv(x) sigmInv = sigmoid(x).*(1-sigmoid(x)); end</span>stack2params.m
<span style="font-family:Times New Roman;font-size:14px;">function [params, netconfig] = stack2params(stack) params = []; for d = 1:numel(stack) params = [params ; stack{d}.w(:) ; stack{d}.b(:) ]; assert(size(stack{d}.w, 1) == size(stack{d}.b, 1), ... ['The bias should be a *column* vector of ' ... int2str(size(stack{d}.w, 1)) 'x1']); if d < numel(stack) assert(size(stack{d}.w, 1) == size(stack{d+1}.w, 2), ... ['The adjacent layers L' int2str(d) ' and L' int2str(d+1) ... ' should have matching sizes.']); end end if nargout > 1 % Setup netconfig if numel(stack) == 0 netconfig.inputsize = 0; netconfig.layersizes = {}; else netconfig.inputsize = size(stack{1}.w, 2); netconfig.layersizes = {}; for d = 1:numel(stack) netconfig.layersizes = [netconfig.layersizes ; size(stack{d}.w,1)]; end end end end</span>
实验结果:
从图中可以看出识别率基本都在 90% 以上,当隐含层的神经元为 25 个时候,识别率达到 93.36% 。