本系列文章由云端暮雪编辑,转载请注明出处
http://blog.csdn.net/lyunduanmuxue/article/details/20068781
多谢合作!
基础介绍
今天介绍一种简单高效的分类器——朴素贝叶斯分类器(Naive Bayes Classifier)。
相信学过概率论的同学对贝叶斯这个名字应该不会感到陌生,因为在概率论中有一条重要的公式,就是以贝叶斯命名的,这就是“贝叶斯公式”:
贝叶斯分类器就是基于这条公式发展起来的,之所以这里还加上了朴素二字,是因为该分类器对各类的分布做了一个假设,即不同类的数据样本之间是相互独立的。这样的假设是非常强的,但并不影响朴素贝叶斯分类器的适用性。1997年,微软研究院的 Domingos 和 Pazzani 通过实验证明,即使在其前提假设不成立的情况下,该分类器依然表现出良好的性能。对这一现象的一个解释是,该分类器需要训练的参数比较少,所以能够很好的避免发生过拟合(overfitting)。
实现说明
下面我们一步步来实现贝叶斯分类器。
分类器的训练分两步:
- 计算先验概率;
- 计算似然函数;
应用过程只需利用训练过程中得到的先验概率和似然函数计算出后验概率即可。
所谓先验概率,其实就是每个类出现的概率,这个是个简单的统计问题,即把训练数据集中不同类所占的比值都计算出来即可。
训练似然函数与此类似,就是看各个特征对应的值属于某个类的概率值。
至于后验概率,一般不会真的去完整计算,而是只计算贝叶斯公式右边分子部分,因为分母部分知识归一因子,对特定的问题是一个常数值。
代码示例
对朴素贝叶斯分类器有了最基本的认识之后,下面我们开始尝试用 MATLAB 设计一个出来。
首先计算先验概率:
- function priors = nbc_Priors(training)
- %NBC_PRIORS calculates the priors for each class by using the training data
- %set.
- %% priors = nbc_Priors(training)
- %% Input:
- % training - a struct representing the training data set
- % training.class - the class of each data
- % training.features - the feature of each data
- %% Output:
- % priors - a struct representing the priors of each class
- % priors.class - the class labels
- % priors.value - the priors of its corresponding classes
- %% Running these code to get some examples:
- %nbc_mushroom
- %% Edited by X. Sun
- % My homepage: http://pamixsun.github.io/
- %%
- % Check the input arguments
- if nargin < 1
- error(message('MATLAB:UNIQUE:NotEnoughInputs'));
- end
- % Extract the class labels
- priors.class = unique(training.class);
- % Initialize the priors.value
- priors.value = zeros(1, length(priors.class));
- % Calculate the priors
- for i = 1 : length(priors.class)
- priors.value(i) = (sum(training.class == class(i))) / (length(training.class));
- end
- % Check the results
- if sum(priors.value) ~= 1
- error('Prior error');
- end
- end
紧接着,是训练完整的朴素贝叶斯分类器:
function [likelihood, priors] = train_nbc(training, featureValues, addOne)
%TRAIN_NBC trains a naive bayes classifier using the training data set.
%% [likelihood, priors] = train_nbc(training, featureNames, addOne)
%% Input:
% training - a struct representing the training data set
% training.class - the class of each data
% training.features - the feature of each data
% featureValues - a cell that contains the values of each feature
% addOne - to chose whether use add one smoothing or not,
% 1 indicates yes, 0 otherwise.
%% Output:
% likelihood - a struct representing the likelihood
% likelihood.matrixColnames - the feature values
% likelihood.matrixRownames - the class labels
% likelihood.matrix - the likelihood values
% priors - a struct representing the priors of each class
% priors.class - the class labels
% priors.value - the priors of its corresponding classes
%% Running these code to get some examples:
%nbc_mushroom
%% Edited by X. Sun
% My homepage: http://pamixsun.github.io/
%%
% Check the input arguments
if nargin < 2
error(message('MATLAB:UNIQUE:NotEnoughInputs'));
end
% Set the default value for addOne if it is not given
if nargin == 2
addOne = 0;
end
% Calculate the priors
priors = nbc_Priors(training);
% Learn the features by calculating likelihood
for i = 1 : size(training.features, 2)
uniqueFeatureValues = featureValues{i};
trainingFeatureValues = training.features(:, i);
likelihood.matrixColnames{i} = uniqueFeatureValues;
likelihood.matrixRownames{i} = priors.class;
likelihood.matrix{i} = zeros(length(priors.class), length(uniqueFeatureValues));
for j = 1 : length(uniqueFeatureValues)
item = uniqueFeatureValues(j);
for k = 1 : length(priors.class)
class = priors.class(k);
featureValuesInclass = trainingFeatureValues(training.class == class);
likelihood.matrix{i}(k, j) = ...
(length(featureValuesInclass(featureValuesInclass == item)) + 1 * addOne)...
/ (length(featureValuesInclass) + addOne * length(uniqueFeatureValues));
end
end
end
end
最后,使用我们训练得到的分类器。
function [predictive, posterior] = predict_nbc(test, priors, likelihood)
%PREDICT_NBC uses a naive bayes classifier to predict the class labels of
%the test data set.
%% [predictive, posterior] = predict_nbc(test, priors, likelihood)
%% Input:
% test - a struct representing the test data set
% test.class - the class of each data
% test.features - the feature of each data
% priors - a struct representing the priors of each class
% priors.class - the class labels
% priors.value - the priors of its corresponding classes
% likelihood - a struct representing the likelihood
% likelihood.matrixColnames - the feature values
% likelihood.matrixRownames - the class labels
% likelihood.matrix - the likelihood values
%% Output:
% predictive - the predictive results of the test data set
% predictive.class - the predictive class for each data
% posterior - a struct representing the posteriors of each class
% posterior.class - the class labels
% posterior.value - the posteriors of the corresponding classes
%% Running these code to get some examples:
%nbc_mushroom
%% Edited by X. Sun
% My homepage: http://pamixsun.github.io/
%%
% Check the input arguments
if nargin < 3
error(message('MATLAB:UNIQUE:NotEnoughInputs'));
end
posterior.class = priors.class;
% Calculate posteriors for each test data record
predictive.class = zeros(length(size(test.features, 1)), 1);
posterior.value = zeros(size(test.features, 1), length(priors.class));
for i = 1 : size(test.features, 1)
record = test.features(i, :);
% Calculate posteriors for each possible class of that record
for j = 1 : length(priors.class)
class = priors.class(j);
% Initialize posterior as the prior value of that class
posteriorValue = priors.value(priors.class == class);
for k = 1 : length(record)
item = record(k);
likelihoodValue = ...
likelihood.matrix{k}(j, likelihood.matrixColnames{k}(:) == item);
posteriorValue = posteriorValue * likelihoodValue;
end
% Calculate the posteriors
posterior.value(i, j) = posteriorValue;
end
% Get the predictive class
predictive.class(i) = ...
posterior.class(posterior.value(i, :) == max(posterior.value(i, :)));
end
predictive.class = char(predictive.class);
predictive.class = predictive.class(:);
end
为了验证我们的分类器能否正常工作,我们使用 UCI 上的 mushroom 数据集来做测试。
测试代码如下(保存为 nbc_mushroom.m):
%% Initialize the enviroment
close all;
clear all;
clc;
%% Import data from file
originalData = importdata('agaricus-lepiota.data');
featureValues = importdata('featureValues');
%% Retrieve class and feature
N = length(originalData);
predata = zeros(N, 23);
for i = 1 : N
originalData{i} = strrep(originalData{i}, ',', '');
predata(i, :) = originalData{i}(:)';
end
for i = 1 : length(featureValues)
featureValues{i} = strrep(featureValues{i}, ',', '');
end
predata = char(predata);
data.class = predata(:, 1);
data.features = predata(:, 2:end);
clear originalData;
clear predata;
%% Visualize data to gain a intuitive understanding
figure('color', 'white');
visualData_mushroom(data);
%% Train and test Naive Bayes
% Set seed to make the results reproduceable
seed = 1;
rng(seed);
% Randomly permutation
dataSize = length(data.class);
permIndex = randperm(dataSize);
% Construct the training data set
training.class = data.class(permIndex(5001 : end));
training.features = data.features(permIndex(5001 : end), :);
% Cpmstruct the testing data set
test.class = data.class(permIndex(1 : 5000));
test.features = data.features(permIndex(1 : 5000), :);
% Train a NBC
[likelihood, priors] = train_nbc(training, featureValues);
% Apply a NBC
[predictive, posterior] = predict_nbc(test, priors, likelihood);
% Calculate the accuracy
accuracy = sum(predictive.class == test.class) / length(test.class)
可视化数据得到结果如下所示,准确率是 99.94%。
写在最后
所有源代码和数据集可以在我的 下载页 上下载到:
http://download.csdn.net/detail/longyindiyi/7994137
当然,上面的代码也没有做到尽善尽美,还是会存在一些缺陷和不足,请读者自己找出。细心的读者可能还会发现,上面的代码只适用于特征值是离散值的情况,那么,对于特征值是连续值的情况应该作何处理呢?欢迎大家在评论中加以讨论。
如若有其他问题,请在回复中给予说明。