朴素贝叶斯分类算法(Naive Bayes classifier)以贝叶斯定理为基础,故统称为贝叶斯分类。朴素贝叶斯算法简单高效,在处理分类问题上,是应该首先考虑的方法之一。
贝叶斯定理解决了现实生活里经常遇到的问题:已知某条件概率,如何得到两个事件交换后的概率,也就是在已知P(A|B)的情况下如何求得P(B|A)。
一、准备知识
1、条件概率
设有事件A、B,条件概率记为\(P(B|A)\),表示事件A发生前提下,事件B发生的概率。条件概率\(P(B|A)\)和事件原概率有如下关系:
\[P(B|A)=\frac{P(AB)}{P(A)}\]
例 I、五个乒乓球(3个新的、2个旧的),每次取一个,无放回地取两次,求第1次取到新球的条件下第2次取到新球的概率。
解、
记,A="第1次取到新球"; B="第2次取到新球"。第1次取到新球的条件下第2次取到新球的概率为,
\[P(B|A)=\frac{P(AB)}{P(A)}=\frac{\frac{3}{5}\times\frac{2}{4}}{\frac{3}{5}}=\frac{2}{4}=0.5\]
2、事件的独立性
如果事件A、B中一个事件的发生并不影响另一个事件发生的概率,称两个事件是相互独立的。即,
\[P(AB)=P(A)\times P(B)\]
当两个事件相互独立时,
\[P(B|A)=\frac{P(AB)}{P(A)}=\frac{P(A)\times P(B)}{P(A)}=P(B)\]
例 II、五个乒乓球(3个新的、2个旧的),每次取一个,有放回地取两次,求第1次取到新球的条件下第2次取到新球的概率。
解、由于有放回抽样,第1次抽到新球或旧球都对第2次抽到新球或旧球的概率没有影响。
记,A="第1次取到新球"; B="第2次取到新球"。第1次取到新球的条件下第2次取到新球的概率为,
\[P(B|A)=\frac{P(AB)}{P(A)}=\frac{P(A)\times P(B)}{P(A)}=P(B)=\frac{3}{5}\]
3、全概公式
如果事件组\(A_1,A_2,\dots,A_n\)满足:
a. \(A_1,A_2,\dots,A_n\)互不相容(不能同时发生),且\(P(A_i)>0 (i=1,2,\dots,n)\);
b. \(A_1+A_2+\dots+A_n=U\)(完备性)。
则对任一事件B皆有,
\[P(B)=\sum^n_{i=1}P(A_i)\times P(B|A_i)\]
例III、五个乒乓球(3个新的、2个旧的),每次取一个,无放回地取两次,求第2次取到新球的概率。
解、由于有放回抽样对第2次抽到新球或旧球的概率有影响,需要考虑不知第1次抽到是新球还是旧球。
记,A="第1次取到新球"; \(\overline{A}\)="第1次取到旧球";B="第2次取到新球"。由于事件,
\[B=BA+B\overline{A}\]
且\(BA,B\overline{A}\)互不相容。则有,
\[P(B)=P(BA)+P(B\overline{A})\]
所以,第2次取到新球的概率为,
\[P(B)=P(BA)+P(B\overline{A})=\frac{3}{5}\times\frac{2}{4}+\frac{2}{5}\times\frac{3}{4}=\frac{3}{5}\]
例IV、甲、乙、丙三人向同一飞机射击,每人射中的概率分别为0.4、0.5、0.7。又若一人射中,飞机坠毁的概率为0.2;若两人射中,飞机坠毁的概率为0.6;若三人射中,飞机坠毁的必坠毁。求飞机坠毁的概率。
解、记B="飞机坠毁";\(A_0\)="三人皆射不中";\(A_1\)="一人射中";\(A_2\)="两人射中";\(A_3\)="三人皆射中"。显然,\(A_0,A_1,A_2,A_3\)是完备事件组(包括了所以可能发生的事件)。按加法和乘法概率公式有:
\(P(A_0)=(1-0.4)\times (1-0.5)\times (1-0.7)=0.6\times 0.5\times 0.3=0.09\)
\(P(A_1)=0.4\times 0.5\times 0.3+0.6\times 0.5\times 0.3+0.6\times 0.5\times 0.7=0.36\)
\(P(A_2)=0.6\times 0.5\times 0.7+0.4\times 0.5\times 0.7+0.4\times 0.5\times 0.3=0.41\)
\(P(A_3)=0.4\times 0.5\times 0.7=0.14\)
再由题意可知,
\[P(B|A_0)=0,\hspace{1cm}P(B|A_1)=0.2,\hspace{1cm}P(B|A_2)=0.6,\hspace{1cm}P(B|A_3)=1\]
利用全概公式得,
\[P(B)=\sum^3_{i=0}P(A_i)\times P(B|A_i)=0.09\times 0+0.36\times 0.2+0.41\times 0.6+0.14\times 1=0.458\]
4、贝叶斯公式(逆概公式)
设\(A_1,A_2,\dots,A_n\)为一完备事件组,则对任一事件B有,
\[P(A_j|B)=\frac{P(A_jB)}{P(B)}=\frac{P(A_i)\times P(B|A_i)}{\sum^n_{i=1}P(A_i)\times P(B|A_i)}\]
例V、甲、乙、丙三人向同一飞机射击,每人射中的概率分别为0.4、0.5、0.7。又若一人射中,飞机坠毁的概率为0.2;若两人射中,飞机坠毁的概率为0.6;若飞机被射中坠毁,求甲射中的的概率。
解、记B="飞机坠毁";S="甲击中",若飞机坠毁甲射中的的概率为,
\[P(S|B)=\frac{P(SB)}{P(B)}=\frac{P(SB)}{\sum^3_{i=0}P(A_i)\times P(B|A_i)}=\frac{0.4\times 0.5\times 0.3\times 0.2}{0.458}=\frac{0.012}{0.458}=0.0262\]
5、朴素贝叶斯分类器公式
假设某样本集有n项特征(Feature),分别为\(F_1、F_2、\dots、F_n\)。现有判别特征C有m个类别(Category),分别为\(C_1、C_2、\dots、C_m\)。贝叶斯分类器就是计算出给定特征\(F_1、F_2、\dots、F_n\)某一水平的判别特征C的概率,即,
\[P(C|F_1F_2\dots F_n)=\frac{P(F_1F_2\dots F_n|C)\times P(C)}{P(F_1F_2\dots F_n)}\]
这里用到了条件(多个条件)概率和逆概公式。由于 \(P(F_1F_2\dots F_n)\) 对于所有的类别都是相同的,可以省略,问题就变成了求分子\(P(F_1F_2\dots F_n|C)\times P(C)\)的最大值(最大发生概率)。
朴素贝叶斯分类器的一个重要特征是假设所有特征都彼此独立,因此,
\[P(F_1F_2\dots F_n|C)\times P(C)=P(F_1|C)\times P(F_2|C)\dots P(F_n|C)\times P(C)\]
上式等号右边的每一项,都可以从统计资料中得到,由此就可以计算出每个类别对应的概率,从而找出最大概率的那个类。虽然"所有特征彼此独立"这个假设在现实中不太可能成立,但是它可以大大简化计算,而且有研究表明对分类结果的准确性影响不大。
二、朴素贝叶斯概率模型
1、特征属性为离散值
现有离散型特征属性信息表(附表 - 1:购买计算机信息)如下,
No. | Age | Income | Student | Credit-rating | Buy-computer |
---|---|---|---|---|---|
1 | $\leq 30$ | hight | no | fair | no |
2 | $\leq 30$ | hight | no | excellent | no |
3 | 31-40 | hight | no | fair | yes |
4 | >40 | medium | no | fair | yes |
5 | >40 | low | yes | fair | yes |
6 | >40 | low | yes | excellent | no |
7 | 31-40 | low | yes | excellent | yes |
8 | $\leq 30$ | medium | no | fair | no |
9 | $\leq 30$ | low | yes | fair | yes |
10 | >40 | medium | yes | fair | yes |
11 | $\leq 30$ | medium | yes | excellent | yes |
12 | 31-40 | medium | no | excellent | yes |
13 | 31-40 | hight | yes | fair | yes |
14 | >40 | medium | no | excellent | no |
当某人的数据信息如下:
\(Age\leq 30, Income = medium, Student = yes, Credit-rating = fair\),试用朴素贝叶斯分类器判断此人是否购买计算机(Buys-computer = ?)。
为了方便使用概率公式描述问题解决过程,特征属性信息表(附表 - 2)简化如下:
No. | Age | Income | Student | Credit | Buy |
---|---|---|---|---|---|
1 | A | H | N | F | N |
2 | A | H | N | E | N |
3 | B | H | N | F | Y |
4 | C | M | N | F | Y |
5 | C | L | Y | F | Y |
6 | C | L | Y | E | N |
7 | B | L | Y | E | Y |
8 | A | M | N | F | N |
9 | A | L | Y | F | Y |
10 | C | M | Y | F | Y |
11 | A | M | Y | E | Y |
12 | B | M | N | E | Y |
13 | B | H | Y | F | Y |
14 | C | M | N | E | N |
某人数据信息改为如下:
\(Age = A, Income = M, Student = Y, Credit = F\),
试用朴素贝叶斯分类器判断此人是否购买计算机(Buy = ?)。
决策特征变量(Buy-computer)的每个类的先验概率为,
\[P(Buy=Y)=\frac{9}{14}=0.642857, \hspace{1cm}P(Buy=N)=\frac{5}{14}=0.357143\]
每个学习样本特征属性值对决策特征变量每个类(Buy = Y)的条件概率,
\[P(Age=A|Buy=Y)=\frac{P(Age=A\hspace{0.2cm} and\hspace{0.2cm} Buy=Y)}{P(Buy=Y)}=\frac{2/14}{9/14}=\frac{2}{9}=0.222222\]
\[P(Income=M|Buy=Y)=\frac{P(Income=M\hspace{0.2cm} and\hspace{0.2cm} Buy=Y)}{P(Buy=Y)}=\frac{4/14}{9/14}=\frac{4}{9}=0.444444\]
\[P(Student=Y|Buy=Y)=\frac{P(Student=Y\hspace{0.2cm} and\hspace{0.2cm} Buy=Y)}{P(Buy=Y)}=\frac{6/14}{9/14}=\frac{6}{9}=0.666667\]
\[P(Credit=F|Buy=Y)=\frac{P(Credit=F\hspace{0.2cm} and\hspace{0.2cm} Buy=Y)}{P(Buy=Y)}=\frac{6/14}{9/14}=\frac{6}{9}=0.666667\]
每个学习样本特征属性值对决策特征变量每个类(Buy = N)的条件概率,
\[P(Age=A|Buy=N)=\frac{P(Age=A\hspace{0.2cm} and\hspace{0.2cm} Buy=N)}{P(Buy=N)}=\frac{3/14}{5/14}=\frac{3}{5}=0.6\]
\[P(Income=M|Buy=N)=\frac{P(Income=M\hspace{0.2cm} and\hspace{0.2cm} Buy=N)}{P(Buy=N)}=\frac{2/14}{5/14}=\frac{2}{5}=0.4\]
\[P(Student=Y|Buy=N)=\frac{P(Student=Y\hspace{0.2cm} and\hspace{0.2cm} Buy=N)}{P(Buy=N)}=\frac{1/14}{5/14}=\frac{1}{5}=0.2\]
\[P(Credit=F|Buy=N)=\frac{P(Credit=F\hspace{0.2cm} and\hspace{0.2cm} Buy=N)}{P(Buy=N)}=\frac{2/14}{5/14}=\frac{2}{5}=0.4\]
已知某人信息为,
\(Age = A, Income = M, Student = Y, Credit = F\),
设所有特征都彼此独立,由贝叶斯公式,购买计算机的概率公式为,
\[P(Buy=Y|Age=A\hspace{0.1cm}and\hspace{0.1cm}Income=M\hspace{0.1cm}and\hspace{0.1cm}Student=Y\hspace{0.1cm}and\hspace{0.1cm}Credit=F)\]
\[\small{=\frac{P(Age=A\hspace{0.1cm}and\hspace{0.1cm}Income=M\hspace{0.1cm}and\hspace{0.1cm}Student=Y\hspace{0.1cm}and\hspace{0.1cm}Credit=F|Buy=Y)\times P(Buy=Y)}{P(Age=A\hspace{0.1cm}and\hspace{0.1cm}Income=M\hspace{0.1cm}and\hspace{0.1cm}Student=Y\hspace{0.1cm}and\hspace{0.1cm}Credit=F)}}\]
\[\small{=\frac{P(Age=A|Buy=Y)\times P(Income=M|Buy=Y)\times P(Student=Y|Buy=Y)\times P(Credit=F|Buy=Y)\times P(Buy=Y)}{P(Age=A\hspace{0.1cm}and\hspace{0.1cm}Income=M\hspace{0.1cm}and\hspace{0.1cm}Student=Y\hspace{0.1cm}and\hspace{0.1cm}Credit=F)}}\hspace{1cm}(1)\]
该人不购买计算机的概率公式为,
\[P(Buy=N|Age=A\hspace{0.1cm}and\hspace{0.1cm}Income=M\hspace{0.1cm}and\hspace{0.1cm}Student=Y\hspace{0.1cm}and\hspace{0.1cm}Credit=F)\]
\[\small{=\frac{P(Age=A\hspace{0.1cm}and\hspace{0.1cm}Income=M\hspace{0.1cm}and\hspace{0.1cm}Student=Y\hspace{0.1cm}and\hspace{0.1cm}Credit=F|Buy=N)\times P(Buy=N)}{P(Age=A\hspace{0.1cm}and\hspace{0.1cm}Income=M\hspace{0.1cm}and\hspace{0.1cm}Student=Y\hspace{0.1cm}and\hspace{0.1cm}Credit=F)}}\]
\[\small{=\frac{P(Age=A|Buy=N)\times P(Income=M|Buy=N)\times P(Student=Y|Buy=N)\times P(Credit=F|Buy=N)\times P(Buy=N)}{P(Age=A\hspace{0.1cm}and\hspace{0.1cm}Income=M\hspace{0.1cm}and\hspace{0.1cm}Student=Y\hspace{0.1cm}and\hspace{0.1cm}Credit=F)}}\hspace{1cm}(2)\]
由于公式(1)和公式(2)分母相等,所有其值得大小由分子决定。
公式(1)计算得,
\[\small{=P(Age=A|Buy=Y)\times P(Income=M|Buy=Y)\times P(Student=Y|Buy=Y)\times P(Credit=F|Buy=Y)\times P(Buy=Y)}=\frac{3}{5}\times\frac{2}{5}\times\frac{1}{5}\times\frac{2}{5}\times\frac{5}{14}=0.028219\]
公式(2)计算得,
\[\small{=P(Age=A|Buy=N)\times P(Income=M|Buy=N)\times P(Student=Y|Buy=N)\times P(Credit=F|Buy=N)\times P(Buy=N)}=\frac{2}{9}\times\frac{4}{9}\times\frac{6}{9}\times\frac{6}{9}\times\frac{9}{14}=0.006857\]
最后,根据该人特征数据信息(\(Age = A, Income = M, Student = Y, Credit = F\)),判断为购买计算机。
2、特征属性为连续值
现有连续型特征属性信息表(附表 - 3)如下,
序号 | 身高(英尺) | 体重(磅) | 脚长(英尺) | 性别 |
---|---|---|---|---|
1 | 6 | 180 | 12 | 男 |
2 | 5.92 | 190 | 11 | 男 |
3 | 5.58 | 170 | 12 | 男 |
4 | 5.92 | 165 | 10 | 男 |
5 | 5 | 100 | 6 | 女 |
6 | 5.5 | 150 | 8 | 女 |
7 | 5.42 | 130 | 7 | 女 |
8 | 5.75 | 150 | 9 | 女 |
当某人的数据信息如下:
身高 = 6, 体重 = 130,脚长 = 8,试用朴素贝叶斯分类器判断此人的性别(性别 = ?)。
为了方便使用概率公式描述问题解决过程,特征属性信息表(附表 - 4)简化如下:
No. | H | W | F | S |
---|---|---|---|---|
1 | 6 | 180 | 12 | M |
2 | 5.92 | 190 | 11 | M |
3 | 5.58 | 170 | 12 | M |
4 | 5.92 | 165 | 10 | M |
5 | 5 | 100 | 6 | F |
6 | 5.5 | 150 | 8 | F |
7 | 5.42 | 130 | 7 | F |
8 | 5.75 | 150 | 9 | F |
某人数据信息改为如下:
H = 6, W = 130,F = 8,试用朴素贝叶斯分类器判断此人的性别(性别 = ?)。
和离散型特征属性信息相比较,由于身高、体重、脚的尺寸都是连续变量,不能采用离散变量的方法计算概率。而且由于样本太少,所以也无法分成区间计算。可以假设男性和女性的身高、体重、脚掌都是正态分布,通过样本计算出均值和方差,也就是得到正态分布的密度函数。有了密度函数,就可以根据学习样本值算出密度函数的值。
分别计算出各特征属性值得均值和方差如下,
性别 | 均值(身高) | 方差(身高) | 均值(体重) | 方差(体重) | 均值(脚长) | 方差(脚长) |
---|---|---|---|---|---|---|
男性 | 5.855 | 0.0350 | 176.25 | 122.9200 | 11.25 | 0.9167 |
女性 | 5.4175 | 0.0972 | 132.5 | 558.3300 | 7.5 | 1.6667 |
为计算概率方便,简化表示为,
S | Hm | Hv | Wm | Wv | Fm | Fv |
---|---|---|---|---|---|---|
M | 5.855 | 0.0350 | 176.25 | 122.9200 | 11.25 | 0.9167 |
F | 5.4175 | 0.097225 | 132.5 | 558.3300 | 7.5 | 1.6667 |
根据贝叶斯分类器公式,由学习样本判断为男性的概率公式为,
\[P(H=6|S=M)\times P(W=130|S=M)\times P(F=8|S=M)\times P(S=M)\]
判断为女性的概率公式为,
\[P(H=6|S=F)\times P(W=130|S=F)\times P(F=8|S=F)\times P(S=F)\]
决策特征变量为离散型,每个类的先验概率为,
\[P(S=M)=P(S=F)=\frac{4}{8}=0.5\]
性别为男性时,身高、体重和脚长的条件概率分别为,
\[P(H=6|S=M)=\frac{1}{\sqrt{2\times \pi\times 0.035}}exp[-\frac{(6-5.855)^{^2}}{2\times 0.035}]\approx1.5789\]
\[P(W=130|S=M)=\frac{1}{\sqrt{2\times \pi\times 122.92}}exp[-\frac{(130-176.25)^{^2}}{2\times 122.92}]\approx0.0000059881\]
\[P(F=8|S=M)=\frac{1}{\sqrt{2\times \pi\times 0.9167}}exp[-\frac{(8-11.25)^{^2}}{2\times 0.9167}]\approx0.001311472\]
性别为女性时,身高、体重和脚长的条件概率分别为,
\[P(H=6|S=F)=\frac{1}{\sqrt{2\times \pi\times 0.097225}}exp[-\frac{(6-5.4175)^{^2}}{2\times 0.097225}]\approx0.223459\]
\[P(W=130|S=F)=\frac{1}{\sqrt{2\times \pi\times 558.33}}exp[-\frac{(130-132.5)^{^2}}{2\times 558.33}]\approx0.01678935\]
\[P(F=8|S=F)=\frac{1}{\sqrt{2\times \pi\times 1.6667}}exp[-\frac{(8-7.5)^{^2}}{2\times 1.6667}]\approx0.28668826\]
由学习样本判断为男性的概率公式和概率值为,
\(P(H=6|S=M)\times P(W=130|S=M)\times P(F=8|S=M)\times P(S=M)\)
\(=1.5789\times 0.0000059881\times 0.001311472\times 0.5=0.0000000062\)
由学习样本判断为女性的概率公式和概率值为,
\(P(H=6|S=F)\times P(W=130|S=F)\times P(F=8|S=F)\times P(S=F)\)
\(=0.223459\times 0.01678935\times 0.28668826\times 0.5=0.000537789\)
该学习样本判断为女性(判断为女性的可能比判断为男性大的多)。
三、样例代码
样例中采用鸢尾花数据(附表 - 5),
鸢尾花[iris]数据(R语言经典聚类、分类案例数据)
ID | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
12 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
13 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
15 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
17 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
18 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
19 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
21 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
22 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
23 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
24 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
25 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
26 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
27 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
28 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
29 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
30 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
31 | 4.8 | 3.1 | 1.6 | 0.2 | setosa |
32 | 5.4 | 3.4 | 1.5 | 0.4 | setosa |
33 | 5.2 | 4.1 | 1.5 | 0.1 | setosa |
34 | 5.5 | 4.2 | 1.4 | 0.2 | setosa |
35 | 4.9 | 3.1 | 1.5 | 0.2 | setosa |
36 | 5.0 | 3.2 | 1.2 | 0.2 | setosa |
37 | 5.5 | 3.5 | 1.3 | 0.2 | setosa |
38 | 4.9 | 3.6 | 1.4 | 0.1 | setosa |
39 | 4.4 | 3.0 | 1.3 | 0.2 | setosa |
40 | 5.1 | 3.4 | 1.5 | 0.2 | setosa |
41 | 5.0 | 3.5 | 1.3 | 0.3 | setosa |
42 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
43 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
44 | 5.0 | 3.5 | 1.6 | 0.6 | setosa |
45 | 5.1 | 3.8 | 1.9 | 0.4 | setosa |
46 | 4.8 | 3.0 | 1.4 | 0.3 | setosa |
47 | 5.1 | 3.8 | 1.6 | 0.2 | setosa |
48 | 4.6 | 3.2 | 1.4 | 0.2 | setosa |
49 | 5.3 | 3.7 | 1.5 | 0.2 | setosa |
50 | 5.0 | 3.3 | 1.4 | 0.2 | setosa |
51 | 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
52 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
53 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
54 | 5.5 | 2.3 | 4.0 | 1.3 | versicolor |
55 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
56 | 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
57 | 6.3 | 3.3 | 4.7 | 1.6 | versicolor |
58 | 4.9 | 2.4 | 3.3 | 1.0 | versicolor |
59 | 6.6 | 2.9 | 4.6 | 1.3 | versicolor |
60 | 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
61 | 5.0 | 2.0 | 3.5 | 1.0 | versicolor |
62 | 5.9 | 3.0 | 4.2 | 1.5 | versicolor |
63 | 6.0 | 2.2 | 4.0 | 1.0 | versicolor |
64 | 6.1 | 2.9 | 4.7 | 1.4 | versicolor |
65 | 5.6 | 2.9 | 3.6 | 1.3 | versicolor |
66 | 6.7 | 3.1 | 4.4 | 1.4 | versicolor |
67 | 5.6 | 3.0 | 4.5 | 1.5 | versicolor |
68 | 5.8 | 2.7 | 4.1 | 1.0 | versicolor |
69 | 6.2 | 2.2 | 4.5 | 1.5 | versicolor |
70 | 5.6 | 2.5 | 3.9 | 1.1 | versicolor |
71 | 5.9 | 3.2 | 4.8 | 1.8 | versicolor |
72 | 6.1 | 2.8 | 4.0 | 1.3 | versicolor |
73 | 6.3 | 2.5 | 4.9 | 1.5 | versicolor |
74 | 6.1 | 2.8 | 4.7 | 1.2 | versicolor |
75 | 6.4 | 2.9 | 4.3 | 1.3 | versicolor |
76 | 6.6 | 3.0 | 4.4 | 1.4 | versicolor |
77 | 6.8 | 2.8 | 4.8 | 1.4 | versicolor |
78 | 6.7 | 3.0 | 5.0 | 1.7 | versicolor |
79 | 6.0 | 2.9 | 4.5 | 1.5 | versicolor |
80 | 5.7 | 2.6 | 3.5 | 1.0 | versicolor |
81 | 5.5 | 2.4 | 3.8 | 1.1 | versicolor |
82 | 5.5 | 2.4 | 3.7 | 1.0 | versicolor |
83 | 5.8 | 2.7 | 3.9 | 1.2 | versicolor |
84 | 6.0 | 2.7 | 5.1 | 1.6 | versicolor |
85 | 5.4 | 3.0 | 4.5 | 1.5 | versicolor |
86 | 6.0 | 3.4 | 4.5 | 1.6 | versicolor |
87 | 6.7 | 3.1 | 4.7 | 1.5 | versicolor |
88 | 6.3 | 2.3 | 4.4 | 1.3 | versicolor |
89 | 5.6 | 3.0 | 4.1 | 1.3 | versicolor |
90 | 5.5 | 2.5 | 4.0 | 1.3 | versicolor |
91 | 5.5 | 2.6 | 4.4 | 1.2 | versicolor |
92 | 6.1 | 3.0 | 4.6 | 1.4 | versicolor |
93 | 5.8 | 2.6 | 4.0 | 1.2 | versicolor |
94 | 5.0 | 2.3 | 3.3 | 1.0 | versicolor |
95 | 5.6 | 2.7 | 4.2 | 1.3 | versicolor |
96 | 5.7 | 3.0 | 4.2 | 1.2 | versicolor |
97 | 5.7 | 2.9 | 4.2 | 1.3 | versicolor |
98 | 6.2 | 2.9 | 4.3 | 1.3 | versicolor |
99 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
100 | 5.7 | 2.8 | 4.1 | 1.3 | versicolor |
101 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
102 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
103 | 7.1 | 3.0 | 5.9 | 2.1 | virginica |
104 | 6.3 | 2.9 | 5.6 | 1.8 | virginica |
105 | 6.5 | 3.0 | 5.8 | 2.2 | virginica |
106 | 7.6 | 3.0 | 6.6 | 2.1 | virginica |
107 | 4.9 | 2.5 | 4.5 | 1.7 | virginica |
108 | 7.3 | 2.9 | 6.3 | 1.8 | virginica |
109 | 6.7 | 2.5 | 5.8 | 1.8 | virginica |
110 | 7.2 | 3.6 | 6.1 | 2.5 | virginica |
111 | 6.5 | 3.2 | 5.1 | 2.0 | virginica |
112 | 6.4 | 2.7 | 5.3 | 1.9 | virginica |
113 | 6.8 | 3.0 | 5.5 | 2.1 | virginica |
114 | 5.7 | 2.5 | 5.0 | 2.0 | virginica |
115 | 5.8 | 2.8 | 5.1 | 2.4 | virginica |
116 | 6.4 | 3.2 | 5.3 | 2.3 | virginica |
117 | 6.5 | 3.0 | 5.5 | 1.8 | virginica |
118 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
119 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
120 | 6.0 | 2.2 | 5.0 | 1.5 | virginica |
121 | 6.9 | 3.2 | 5.7 | 2.3 | virginica |
122 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
123 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
124 | 6.3 | 2.7 | 4.9 | 1.8 | virginica |
125 | 6.7 | 3.3 | 5.7 | 2.1 | virginica |
126 | 7.2 | 3.2 | 6.0 | 1.8 | virginica |
127 | 6.2 | 2.8 | 4.8 | 1.8 | virginica |
128 | 6.1 | 3.0 | 4.9 | 1.8 | virginica |
129 | 6.4 | 2.8 | 5.6 | 2.1 | virginica |
130 | 7.2 | 3.0 | 5.8 | 1.6 | virginica |
131 | 7.4 | 2.8 | 6.1 | 1.9 | virginica |
132 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
133 | 6.4 | 2.8 | 5.6 | 2.2 | virginica |
134 | 6.3 | 2.8 | 5.1 | 1.5 | virginica |
135 | 6.1 | 2.6 | 5.6 | 1.4 | virginica |
136 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
137 | 6.3 | 3.4 | 5.6 | 2.4 | virginica |
138 | 6.4 | 3.1 | 5.5 | 1.8 | virginica |
139 | 6.0 | 3.0 | 4.8 | 1.8 | virginica |
140 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
141 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
143 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
144 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
145 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
146 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
147 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
## 函数 - 朴素贝叶斯分类算法
webTJ.Datamining.setNaive_Bayes(arrs,crr,srrs);
##参数
【arrs,,crr,srrs】
【训练样本数组,决策特征变量数组,学习样本数组】
代码样例
var oTxt="5.1,3.5,1.4,0.2,setosa|4.9,3,1.4,0.2,setosa|4.7,3.2,1.3,0.2,setosa|4.6,3.1,1.5,0.2,setosa|5,3.6,1.4,0.2,setosa|5.4,3.9,1.7,0.4,setosa|4.6,3.4,1.4,0.3,setosa|5,3.4,1.5,0.2,setosa|4.4,2.9,1.4,0.2,setosa|4.9,3.1,1.5,0.1,setosa|5.4,3.7,1.5,0.2,setosa|4.8,3.4,1.6,0.2,setosa|4.8,3,1.4,0.1,setosa|4.3,3,1.1,0.1,setosa|5.8,4,1.2,0.2,setosa|5.7,4.4,1.5,0.4,setosa|5.4,3.9,1.3,0.4,setosa|5.1,3.5,1.4,0.3,setosa|5.7,3.8,1.7,0.3,setosa|5.1,3.8,1.5,0.3,setosa|5.4,3.4,1.7,0.2,setosa|5.1,3.7,1.5,0.4,setosa|4.6,3.6,1,0.2,setosa|5.1,3.3,1.7,0.5,setosa|4.8,3.4,1.9,0.2,setosa|5,3,1.6,0.2,setosa|5,3.4,1.6,0.4,setosa|5.2,3.5,1.5,0.2,setosa|5.2,3.4,1.4,0.2,setosa|4.7,3.2,1.6,0.2,setosa|4.8,3.1,1.6,0.2,setosa|5.4,3.4,1.5,0.4,setosa|5.2,4.1,1.5,0.1,setosa|5.5,4.2,1.4,0.2,setosa|4.9,3.1,1.5,0.2,setosa|5,3.2,1.2,0.2,setosa|5.5,3.5,1.3,0.2,setosa|4.9,3.6,1.4,0.1,setosa|4.4,3,1.3,0.2,setosa|5.1,3.4,1.5,0.2,setosa|5,3.5,1.3,0.3,setosa|4.5,2.3,1.3,0.3,setosa|4.4,3.2,1.3,0.2,setosa|5,3.5,1.6,0.6,setosa|5.1,3.8,1.9,0.4,setosa|4.8,3,1.4,0.3,setosa|5.1,3.8,1.6,0.2,setosa|4.6,3.2,1.4,0.2,setosa|5.3,3.7,1.5,0.2,setosa|5,3.3,1.4,0.2,setosa|7,3.2,4.7,1.4,versicolor|6.4,3.2,4.5,1.5,versicolor|6.9,3.1,4.9,1.5,versicolor|5.5,2.3,4,1.3,versicolor|6.5,2.8,4.6,1.5,versicolor|5.7,2.8,4.5,1.3,versicolor|6.3,3.3,4.7,1.6,versicolor|4.9,2.4,3.3,1,versicolor|6.6,2.9,4.6,1.3,versicolor|5.2,2.7,3.9,1.4,versicolor|5,2,3.5,1,versicolor|5.9,3,4.2,1.5,versicolor|6,2.2,4,1,versicolor|6.1,2.9,4.7,1.4,versicolor|5.6,2.9,3.6,1.3,versicolor|6.7,3.1,4.4,1.4,versicolor|5.6,3,4.5,1.5,versicolor|5.8,2.7,4.1,1,versicolor|6.2,2.2,4.5,1.5,versicolor|5.6,2.5,3.9,1.1,versicolor|5.9,3.2,4.8,1.8,versicolor|6.1,2.8,4,1.3,versicolor|6.3,2.5,4.9,1.5,versicolor|6.1,2.8,4.7,1.2,versicolor|6.4,2.9,4.3,1.3,versicolor|6.6,3,4.4,1.4,versicolor|6.8,2.8,4.8,1.4,versicolor|6.7,3,5,1.7,versicolor|6,2.9,4.5,1.5,versicolor|5.7,2.6,3.5,1,versicolor|5.5,2.4,3.8,1.1,versicolor|5.5,2.4,3.7,1,versicolor|5.8,2.7,3.9,1.2,versicolor|6,2.7,5.1,1.6,versicolor|5.4,3,4.5,1.5,versicolor|6,3.4,4.5,1.6,versicolor|6.7,3.1,4.7,1.5,versicolor|6.3,2.3,4.4,1.3,versicolor|5.6,3,4.1,1.3,versicolor|5.5,2.5,4,1.3,versicolor|5.5,2.6,4.4,1.2,versicolor|6.1,3,4.6,1.4,versicolor|5.8,2.6,4,1.2,versicolor|5,2.3,3.3,1,versicolor|5.6,2.7,4.2,1.3,versicolor|5.7,3,4.2,1.2,versicolor|5.7,2.9,4.2,1.3,versicolor|6.2,2.9,4.3,1.3,versicolor|5.1,2.5,3,1.1,versicolor|5.7,2.8,4.1,1.3,versicolor|6.3,3.3,6,2.5,virginica|5.8,2.7,5.1,1.9,virginica|7.1,3,5.9,2.1,virginica|6.3,2.9,5.6,1.8,virginica|6.5,3,5.8,2.2,virginica|7.6,3,6.6,2.1,virginica|4.9,2.5,4.5,1.7,virginica|7.3,2.9,6.3,1.8,virginica|6.7,2.5,5.8,1.8,virginica|7.2,3.6,6.1,2.5,virginica|6.5,3.2,5.1,2,virginica|6.4,2.7,5.3,1.9,virginica|6.8,3,5.5,2.1,virginica|5.7,2.5,5,2,virginica|5.8,2.8,5.1,2.4,virginica|6.4,3.2,5.3,2.3,virginica|6.5,3,5.5,1.8,virginica|7.7,3.8,6.7,2.2,virginica|7.7,2.6,6.9,2.3,virginica|6,2.2,5,1.5,virginica|6.9,3.2,5.7,2.3,virginica|5.6,2.8,4.9,2,virginica|7.7,2.8,6.7,2,virginica|6.3,2.7,4.9,1.8,virginica|6.7,3.3,5.7,2.1,virginica|7.2,3.2,6,1.8,virginica|6.2,2.8,4.8,1.8,virginica|6.1,3,4.9,1.8,virginica|6.4,2.8,5.6,2.1,virginica|7.2,3,5.8,1.6,virginica|7.4,2.8,6.1,1.9,virginica|7.9,3.8,6.4,2,virginica|6.4,2.8,5.6,2.2,virginica|6.3,2.8,5.1,1.5,virginica|6.1,2.6,5.6,1.4,virginica|7.7,3,6.1,2.3,virginica|6.3,3.4,5.6,2.4,virginica|6.4,3.1,5.5,1.8,virginica|6,3,4.8,1.8,virginica|6.9,3.1,5.4,2.1,virginica|6.7,3.1,5.6,2.4,virginica|6.9,3.1,5.1,2.3,virginica|5.8,2.7,5.1,1.9,virginica|6.8,3.2,5.9,2.3,virginica|6.7,3.3,5.7,2.5,virginica|6.7,3,5.2,2.3,virginica|6.3,2.5,5,1.9,virginica|6.5,3,5.2,2,virginica|6.2,3.4,5.4,2.3,virginica|5.9,3,5.1,1.8,virginica";
var oArrs=webTJ.getArrs(oTxt,"|",",");
var oColData=webTJ.Array.getColData(oArrs,4);
oArrs=webTJ.Matrix.getRemoveCol(oArrs,4);
var oSrrs=[[6.9,3.2,5.7,2.3],[4.9,3,1.4,0.2],[5.9,3,5.1,1.8]];
webTJ.Datamining.setNaive_Bayes(oArrs,oColData,oSrrs);
在函数 webTJ.Datamining.setNaive_Bayes中,训练样本、决策特征变量数组和学习样本都以数组形式表达。如果学习样本只有一组,应按一维数组形式输入,如[6.9,3.2,5.7,2.3]。
四、案例分析
1、特征属性为离散值(参见附表 - 2)
代码样例
var oTxt="A,H,N,F|A,H,N,E|B,H,N,F|C,M,N,F|C,L,Y,F|C,L,Y,E|B,L,Y,E|A,M,N,F|A,L,Y,F|C,M,Y,F|A,M,Y,E|B,M,N,E|B,H,Y,F|C,M,N,E";
var oArrs=webTJ.getArrs(oTxt,"|",",");
var oCrr=['N','N','Y','Y','Y','N','Y','N','Y','Y','Y','Y','Y','N'];
var oSrrs=[['A','M','Y','F']];
webTJ.Datamining.setNaive_Bayes(oArrs,oCrr,oSrrs);
注:如果学校样本自有一组数据,应该用二维数组形式表达,如“[['A','M','Y','F']]”,两组数据为“[['A','M','Y','F'],['C','M','Y','F']]”
2、特征属性为连续值(参见附表 - 4)
代码样例
var oTxt="6,180,12|5.92,190,11|5.58,170,12|5.92,165,10|5,100,6|5.5,150,8|5.42,130,7|5.75,150,9";
var oArrs=webTJ.getArrs(oTxt,"|",",");
var oCrr=['男','男','男','男','女','女','女','女'];
var oSrrs=[[6,130,9]];
webTJ.Datamining.setNaive_Bayes(oArrs,oCrr,oSrrs);