朴素贝叶斯分类工作过程:
1,设D是训练元组和相关联的类标号的集合。
2,假定有m个类C1,C2,C3,...Cm。给定元组X,分类法将预测X属于具有最高后验概率(条件X下)的类,即,当P(Ci|X)>P(Cj|X),朴素贝叶斯分类法预测X属于类Cj
贝叶斯定理:P(Ci|X)=P(X|Ci)P(Ci)/P(X)
3,问题转换为根据P(X|Ci)P(Ci)/P(X)的大小判断类别,先求P(Ci)的先验概率
4,假定类条件独立,P(X|Ci)=P(x1|Ci)*P(x2|Ci).....*P(xn|Ci),比较结果确定属于哪个类别。
训练集:
<30 high no fair no
<30 high no excellent no
30-40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
30-40 low yes excellent yes
<30 medium no fair no
<30 low yes fair yes
>40 medium yes fair yes
<30 medium yes excellent yes
30-40 medium yes excellent yes
30-40 high yes fair yes
>40 medium no excellent no
测试集:
<30 medium yes fair
>40 high no excellent
30-40 low no excellent
>40 high no fair
<30 medium no fair
源码:
%function out=my_bayes(X,Y)
%X为原数据集,Y是要预测的数据,out是返回预测的结果
function out=bayes()
%%%%%%%%%%%%%%%%%%%%%%打开test.txt文件
clc;
file = textread('train.txt','%s','delimiter','\n','whitespace','');%以换行为分隔符读取,whitespace用‘’代替
[m,n]=size(file);
for i=1:m
words=strread(file{i},'%s','delimiter',' ');%将字符串file(i),以空格分隔符进行分割,并存到数组中
words=words';
X{i}=words;
end%这时候X是1*14,每个元素实际上是个cell,每个cell保存的是个字符串,如X{1}即'<30' 'high' 'no' 'fair' 'no'
X=X';%转置14*1
%%%%%%%%%%%%%%%%%%%%%打开predict.txt文件
file = textread('predict1.txt','%s','delimiter','\n','whitespace','');
[m,n]=size(file);
for i=1:m
words=strread(file{i},'%s','delimiter',' ');
words=words';
Y{i}=words;
end
Y=Y';%转置
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%训练部分
[M,N]=size(X);
[m,n]=size(X{1});
decision=attribute(X,n); %提取决策属性,将类别列提取出来
[ProName,Pro]=probality(decision);%计算决策属性个分量概率,各样本概率
for i=1:n-1
[post_pro{i},post_name{i}]=post_prob(attribute(X,i),decision); %求各条件属性后验概率
end
%%%%%%%%%%%%%%%%%%%%%%%%预测部分
uniq_decis=unique(decision); %求决策属性的类别
P_X=ones(size(uniq_decis,1),1); %初始化决策属性后验概率
[M,N]=size(Y);
k=1;
for i=1:M
for j=1:n-1
[temp,loc]=ismember(attribute({Y{i}},j),unique(attribute(X,j)));%决策属性计算后验概率
P_X=post_pro{j}(:,loc).*P_X;%各条件属性后验概率之积(贝叶斯公式)
%post_pro{j}(:,loc)对应的含义:loc表示是第几列属性,:,loc代表loc属性在no和yes情况下的条件概率,j代表的是某类别
end
%P_X中两行,代表在不同决策类别下的各独立概率之积
P_X=P_X.*Pro;
[MAX,I]=max(P_X);%寻找最大值
out{k}=uniq_decis{I};%哪一类决策属性后验概率最大,则次样本属于那一类
k=k+1;
P_X=ones(size(uniq_decis,1),1);%再次初始化决策属性后验概率P_X,以便为下一样本计算作准备
end
out=out'; %输出结果(转置形式)
结果:
>> out
out =
'yes'
'no'
'yes'
'no'
'no'