scikitt -learn predict_proba给出错误的答案

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn

这是一个后续问题，关于如何知道在Scikit-learn中的predict_proba的返回数组中表示什么类

In that question, I quoted the following code:

在这个问题中，我引用了以下代码:

>>> import sklearn
>>> sklearn.__version__
'0.13.1'
>>> from sklearn import svm
>>> model = svm.SVC(probability=True)
>>> X = [[1,2,3], [2,3,4]] # feature vectors
>>> Y = ['apple', 'orange'] # classes
>>> model.fit(X, Y)
>>> model.predict_proba([1,2,3])
array([[ 0.39097541,  0.60902459]])

I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_

我在这个问题中发现，这个结果表示属于每个类的点的概率，按照model.classes_给出的顺序

>>> zip(model.classes_, model.predict_proba([1,2,3])[0])
[('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]

So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. Just to be sure, I tested the reverse as well:

所以…如果解释正确的话，这个答案可能是“橙色”(由于数据量很小，可信度相当低)。但直觉上，这个结果显然是错误的，因为给出的点与“苹果”的训练数据是一致的。当然，我也做了相反的测试:

>>> zip(model.classes_, model.predict_proba([2,3,4])[0])
[('apple', 0.60705475211840931), ('orange', 0.39294524788159074)]

Again, obviously incorrect, but in the other direction.

同样，显然是错误的，但在另一个方向。

Finally, I tried it with points that were much further away.

最后，我用远得多的点来尝试。

>>> X = [[1,1,1], [20,20,20]] # feature vectors
>>> model.fit(X, Y)
>>> zip(model.classes_, model.predict_proba([1,1,1])[0])
[('apple', 0.33333332048410247), ('orange', 0.66666667951589786)]

Again, the model predicts the wrong probabilities. BUT, the model.predict function gets it right!

这个模型再次预测了错误的概率。但是,模型。预测函数正确!

>>> model.predict([1,1,1])[0]
'apple'

Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. Is this the expected behaviour, or am I doing something wrong? If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? And importantly, how big does the dataset need to be before I can trust the results from predict_proba?

现在，我记得我在文档中读到过关于predict_proba对小数据集不准确的信息，尽管我似乎再也找不到它了。这是预期的行为，还是我做错了什么?如果这是预期行为，那么为什么预测和预测函数与输出不一致呢?重要的是，在我信任来自predict_proba的结果之前，数据集需要多大?

-------- UPDATE --------

- - - - - - - - - - - - - - - - - - - -更新

Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way!

好的，所以我做了更多的实验:predict_proba的行为严重依赖于‘n’，但不是以任何可预测的方式!

>>> def train_test(n):
...     X = [[1,2,3], [2,3,4]] * n
...     Y = ['apple', 'orange'] * n
...     model.fit(X, Y)
...     print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0])
... 
>>> train_test(1)
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
>>> for n in range(1,10):
...     train_test(n)
... 
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)]
n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)]
n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)]
n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)]
n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)]
n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)]
n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)]
n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)]

How should I use this function safely in my code? At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict?

如何在代码中安全地使用这个函数?至少，是否有n的值可以保证它与模型的结果一致。

3 个解决方案

#1

if you use svm.LinearSVC() as estimator, and .decision_function() (which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict() function. Plus, this estimator is faster and gives almost the same results with svm.SVC()

如果您使用svm. linearsvc()作为估计值，而.decision_function()则与svm类似。用于将最可能的类的结果排序为最不可能的类。这与. predic()函数一致。另外，这个估计值更快，并且与svm.SVC()的结果几乎相同

the only drawback for you might be that .decision_function() gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.

唯一的缺点可能是。decision_function()给出了一个有符号的值，比如-1和3，而不是一个概率值。但它与预测相符。

#2

predict_probas is using the Platt scaling feature of libsvm to callibrate probabilities, see:

predict_probas使用libsvm的Platt伸缩特性来计算概率，见:

How does sklearn.svm.svc's function predict_proba() work internally?
sklearn.svm如何。svc的函数predict_proba()在内部工作吗?

So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.

所以实际上超平面预测和proba校准是不一致的，尤其是当你的数据集中只有两个样本时。在这种情况下，libsvm用于扩展概率的内部交叉验证不会(显式地)失败，这很奇怪。也许这是一个bug。我们必须深入到libsvm的Platt伸缩代码中，以了解正在发生的事情。

#3

-1

There is some confusion as to what predict_proba actually does. It does not predict probabilities as the title suggests, but outputs distances. In the apple vs orange example 0.39097541, 0.60902459 the shortest distance 0.39097541 is the apple class. which is counter intuitive. you are looking at the highest probability, but its not the case.

对于谓词proba实际做什么，存在一些混淆。它不像题目说的那样预测概率，而是输出距离。在苹果vs橘子的例子中，最短的距离为0.39097541,0.60902459，苹果类为0.39097541。这是反直觉的。你看到的是最高的概率，但事实并非如此。

Another source of confusion stems from that predict_proba does match hard labels, just not in the order of classes, from 0..n sequentially . Scikit seems to shuffle the classes, but it is possible to map them.

另一个混淆的来源是谓词proba确实匹配硬标签，只是不按照类的顺序，从0开始。n顺序。Scikit似乎打乱了类，但是可以映射它们。

here is how it works.

这就是它的工作原理。

   say we have 5 classes with labels:
   classifier.classes_ = [0 1 2 3 4]
   target names = ['1', '2', '3', '6', '8']

predicted labels [2 0 1 0 4]

预测标签[2 0 1 0 4]

    classifier.predict_proba
    [[ 0.20734121  0.20451986  0.17262553  0.20768649  0.20782692]
     [ 0.19099348  0.2018391   0.20222314  0.20136784  0.20357644]
     [ 0.19982284  0.19497121  0.20399376  0.19824784  0.20296435]
     [ 0.19884577  0.1999416   0.19998889  0.20092702  0.20029672]
     [ 0.20328893  0.2025956   0.20500402  0.20383255  0.1852789 ]]

    Confusion matrix:
    [[1 0 0 0 0]
     [0 1 0 0 0]
     [0 0 1 0 0]
     [1 0 0 0 0]
     [0 0 0 0 1]]

    y_test [2 0 1 3 4]
    pred [2 0 1 0 4]
    classifier.classes_ = [0 1 2 3 4]

anything but the third class is a match. according to predicted labels in cm, class 0 is predicted and actual class is 0 argmax(pred_prob). But, its mapped to

除了第三节课，什么都不匹配。根据cm中预测的标签，预测类0，实际类为0 argmax(pred_prob)。但是,其映射到

     y_test [2 0 1 3 4]

so find the second class

找到第二课

    0              1             2          3          4
    [ 0.20734121  0.20451986  0.17262553  0.20768649  0.20782692]
    and the winner is **0.17262553**

let's do it again. look at the misclassification result numero 4 where actual lebel 4, predicted 1 according to cm.

让我们再做一次。看错分类结果的数字4，实际的lebel 4，根据cm预测1。

    BUT y_test [2 0 1 3 4] pred [2 0 1 0 4]
    which translates to actual label 3 predicted label 0
    0             1             2            3        4
    ]0.19884577  0.1999416   0.19998889  0.20092702  0.20029672]
    look at label number 0, and the winner is **0.19884577**

These are my 0.02.

这些是我的0.02。

#1