Python:如何规范化一个混乱矩阵?

时间:2020-12-05 15:58:18

I calculated a confusion matrix for my classifier using the method confusion_matrix() from the sklearn package. The diagonal elements of the confusion matrix represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.

我使用sklearn包中的方法confusion_matrix()为我的分类器计算了一个混淆矩阵。混淆矩阵的对角线元素表示预测标签等于真实标签的点的数量,而非对角线元素则表示分类器错误标记的点的数量。

I would like to normalize my confusion matrix so that it contains only numbers between 0 and 1. I would like to read the percentage of correctly classified samples from the matrix.

我想规范化我的混淆矩阵,使它只包含0到1之间的数字。我想从矩阵中读出正确分类的样本的百分比。

I found several methods how to normalize a matrix (row and column normalization) but I don't know much about maths and am not sure if this is the correct approach. Can someone help please?

我找到了一些方法来规范矩阵(行和列归一化),但我不太懂数学,也不确定这是不是正确的方法。有人能帮吗?

4 个解决方案

#1


6  

I'm assuming that M[i,j] stands for Element of real class i was classified as j. If its the other way around you are going to need to transpose everything I say. I'm also going to use the following matrix for concrete examples:

我假设M[I,j]表示实类I的元素被归类为j,如果反过来,你需要把我说的所有东西都转置。我还将使用下面的矩阵作为具体的例子:

1 2 3
4 5 6
7 8 9

There are essentially two things you can do:

基本上你可以做两件事:

Finding how each class has been classified

The first thing you can ask is what percentage of elements of real class i here classified as each class. To do so, we take a row fixing the i and divide each element by the sum of the elements in the row. In our example, objects from class 2 are classified as class 1 4 times, are classified correctly as class 2 5 times and are classified as class 3 6 times. To find the percentages we just divide everything by the sum 4 + 5 + 6 = 15

你首先要问的是,实际类i中每一个类的元素的百分比是多少。为此,我们使用一行来固定i,并将每个元素除以行中元素的总和。在我们的示例中,类2中的对象被分为类1 4次,被正确地分为类2 5次,被分为类3 6次。为了找到这个百分比,我们把所有的都除以4 + 5 + 6 = 15。

4/15 of the class 2 objects are classified as class 1
5/15 of the class 2 objects are classified as class 2
6/15 of the class 2 objects are classified as class 3

Finding what classes are responsible for each classification

The second thing you can do is to look at each result from your classifier and ask how many of those results originate from each real class. Its going to be similar to the other case but with columns instead of rows. In our example, our classifier returns "1" 1 time when the original class is 1, 4 times when the original class is 2 and 7 times when the original class is 3. To find the percentages we divide by the sum 1 + 4 + 7 = 12

您可以做的第二件事是查看分类器的每个结果,并询问这些结果中有多少来自于每个真实的类。它与另一种情况类似,只是用列代替行。在我们的示例中,当原始类为1时,分类器返回1;当原始类为2时返回4;当原始类为3时返回7。为了找出百分比,我们除以总和1 + 4 + 7 = 12

1/12 of the objects classified as class 1 were from class 1
4/12 of the objects classified as class 1 were from class 2
7/12 of the objects classified as class 1 were from class 3

--

- - -

Of course, both the methods I gave only apply to single row column at a time and I'm not sure if it would be a good idea to actually modify your confusion matrix in this form. However, this should give the percentages you are looking for.

当然,我给出的两种方法一次只适用于单个行列,我不确定是否应该用这种形式修改混淆矩阵。然而,这应该会给出你想要的百分比。

#2


14  

Suppose that

假设

>>> y_true = [0, 0, 1, 1, 2, 0, 1]
>>> y_pred = [0, 1, 0, 1, 2, 2, 1]
>>> C = confusion_matrix(y_true, y_pred)
>>> C
array([[1, 1, 1],
       [1, 2, 0],
       [0, 0, 1]])

Then, to find out how many samples per class have received their correct label, you need

然后,要找出每个类有多少个样本已经收到正确的标签,您需要

>>> C / C.astype(np.float).sum(axis=1)
array([[ 0.33333333,  0.33333333,  1.        ],
       [ 0.33333333,  0.66666667,  0.        ],
       [ 0.        ,  0.        ,  1.        ]])

The diagonal contains the required values. Another way to compute these is to realize that what you're computing is the recall per class:

对角线包含所需的值。另一种计算方法是意识到你计算的是每个类的回忆:

>>> from sklearn.metrics import precision_recall_fscore_support
>>> _, recall, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> recall
array([ 0.33333333,  0.66666667,  1.        ])

Similarly, if you divide by the sum over axis=0, you get the precision (fraction of class-k predictions that have ground truth label k):

类似地,如果除以轴=0的和,就会得到精度(k类预测中有ground truth标签k的部分):

>>> C / C.astype(np.float).sum(axis=0)
array([[ 0.5       ,  0.33333333,  0.5       ],
       [ 0.5       ,  0.66666667,  0.        ],
       [ 0.        ,  0.        ,  0.5       ]])
>>> prec, _, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> prec
array([ 0.5       ,  0.66666667,  0.5       ])

#3


7  

The matrix output by sklearn's confusion_matrix() is such that

sklearn的confusion_matrix()输出的矩阵是这样的

C_{i, j} is equal to the number of observations known to be in group i but predicted to be in group j

{i, j}等于已知的在第一组但预计在j组的观察数

so to get the percentages for each class (often called specificity and sensitivity in binary classification) you need to normalize by row: replace each element in a row by itself divided by the sum of the elements of that row.

因此,为了得到每个类的百分比(通常称为二进制分类中的特异性和敏感性),您需要按行规范化:将一行中的每个元素替换为该行中元素的和。

Note that sklearn has a summary function available that computes metrics from the confusion matrix : classification_report. It outputs precision and recall rather than specificity and sensitivity, but those are often regarded as more informative in general (especially for imbalanced multi-class classification.)

注意,sklearn有一个可用的摘要函数,可以从混淆矩阵中计算指标:classification_report。它输出的是精确和回忆,而不是特异性和敏感性,但这些通常被认为具有更大的信息量(特别是对于不平衡的多类分类)。

#4


2  

From the sklearn documentation (plot example)

从sklearn文档(图示例)

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

where cm is the confusion matrix as provided by sklearn.

其中cm为sklearn提供的混淆矩阵。

#1


6  

I'm assuming that M[i,j] stands for Element of real class i was classified as j. If its the other way around you are going to need to transpose everything I say. I'm also going to use the following matrix for concrete examples:

我假设M[I,j]表示实类I的元素被归类为j,如果反过来,你需要把我说的所有东西都转置。我还将使用下面的矩阵作为具体的例子:

1 2 3
4 5 6
7 8 9

There are essentially two things you can do:

基本上你可以做两件事:

Finding how each class has been classified

The first thing you can ask is what percentage of elements of real class i here classified as each class. To do so, we take a row fixing the i and divide each element by the sum of the elements in the row. In our example, objects from class 2 are classified as class 1 4 times, are classified correctly as class 2 5 times and are classified as class 3 6 times. To find the percentages we just divide everything by the sum 4 + 5 + 6 = 15

你首先要问的是,实际类i中每一个类的元素的百分比是多少。为此,我们使用一行来固定i,并将每个元素除以行中元素的总和。在我们的示例中,类2中的对象被分为类1 4次,被正确地分为类2 5次,被分为类3 6次。为了找到这个百分比,我们把所有的都除以4 + 5 + 6 = 15。

4/15 of the class 2 objects are classified as class 1
5/15 of the class 2 objects are classified as class 2
6/15 of the class 2 objects are classified as class 3

Finding what classes are responsible for each classification

The second thing you can do is to look at each result from your classifier and ask how many of those results originate from each real class. Its going to be similar to the other case but with columns instead of rows. In our example, our classifier returns "1" 1 time when the original class is 1, 4 times when the original class is 2 and 7 times when the original class is 3. To find the percentages we divide by the sum 1 + 4 + 7 = 12

您可以做的第二件事是查看分类器的每个结果,并询问这些结果中有多少来自于每个真实的类。它与另一种情况类似,只是用列代替行。在我们的示例中,当原始类为1时,分类器返回1;当原始类为2时返回4;当原始类为3时返回7。为了找出百分比,我们除以总和1 + 4 + 7 = 12

1/12 of the objects classified as class 1 were from class 1
4/12 of the objects classified as class 1 were from class 2
7/12 of the objects classified as class 1 were from class 3

--

- - -

Of course, both the methods I gave only apply to single row column at a time and I'm not sure if it would be a good idea to actually modify your confusion matrix in this form. However, this should give the percentages you are looking for.

当然,我给出的两种方法一次只适用于单个行列,我不确定是否应该用这种形式修改混淆矩阵。然而,这应该会给出你想要的百分比。

#2


14  

Suppose that

假设

>>> y_true = [0, 0, 1, 1, 2, 0, 1]
>>> y_pred = [0, 1, 0, 1, 2, 2, 1]
>>> C = confusion_matrix(y_true, y_pred)
>>> C
array([[1, 1, 1],
       [1, 2, 0],
       [0, 0, 1]])

Then, to find out how many samples per class have received their correct label, you need

然后,要找出每个类有多少个样本已经收到正确的标签,您需要

>>> C / C.astype(np.float).sum(axis=1)
array([[ 0.33333333,  0.33333333,  1.        ],
       [ 0.33333333,  0.66666667,  0.        ],
       [ 0.        ,  0.        ,  1.        ]])

The diagonal contains the required values. Another way to compute these is to realize that what you're computing is the recall per class:

对角线包含所需的值。另一种计算方法是意识到你计算的是每个类的回忆:

>>> from sklearn.metrics import precision_recall_fscore_support
>>> _, recall, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> recall
array([ 0.33333333,  0.66666667,  1.        ])

Similarly, if you divide by the sum over axis=0, you get the precision (fraction of class-k predictions that have ground truth label k):

类似地,如果除以轴=0的和,就会得到精度(k类预测中有ground truth标签k的部分):

>>> C / C.astype(np.float).sum(axis=0)
array([[ 0.5       ,  0.33333333,  0.5       ],
       [ 0.5       ,  0.66666667,  0.        ],
       [ 0.        ,  0.        ,  0.5       ]])
>>> prec, _, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> prec
array([ 0.5       ,  0.66666667,  0.5       ])

#3


7  

The matrix output by sklearn's confusion_matrix() is such that

sklearn的confusion_matrix()输出的矩阵是这样的

C_{i, j} is equal to the number of observations known to be in group i but predicted to be in group j

{i, j}等于已知的在第一组但预计在j组的观察数

so to get the percentages for each class (often called specificity and sensitivity in binary classification) you need to normalize by row: replace each element in a row by itself divided by the sum of the elements of that row.

因此,为了得到每个类的百分比(通常称为二进制分类中的特异性和敏感性),您需要按行规范化:将一行中的每个元素替换为该行中元素的和。

Note that sklearn has a summary function available that computes metrics from the confusion matrix : classification_report. It outputs precision and recall rather than specificity and sensitivity, but those are often regarded as more informative in general (especially for imbalanced multi-class classification.)

注意,sklearn有一个可用的摘要函数,可以从混淆矩阵中计算指标:classification_report。它输出的是精确和回忆,而不是特异性和敏感性,但这些通常被认为具有更大的信息量(特别是对于不平衡的多类分类)。

#4


2  

From the sklearn documentation (plot example)

从sklearn文档(图示例)

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

where cm is the confusion matrix as provided by sklearn.

其中cm为sklearn提供的混淆矩阵。