This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn
In that question, I quoted the following code:
>>> import sklearn
>>> sklearn.__version__
>>> from sklearn import svm
>>> model = svm.SVC(probability=True)
>>> X = [[1,2,3], [2,3,4]] # feature vectors
>>> Y = ['apple', 'orange'] # classes
>>>, Y)
>>> model.predict_proba([1,2,3])
array([[ 0.39097541, 0.60902459]])
I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_
>>> zip(model.classes_, model.predict_proba([1,2,3])[0])
[('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. Just to be sure, I tested the reverse as well:
>>> zip(model.classes_, model.predict_proba([2,3,4])[0])
[('apple', 0.60705475211840931), ('orange', 0.39294524788159074)]
Again, obviously incorrect, but in the other direction.
Finally, I tried it with points that were much further away.
>>> X = [[1,1,1], [20,20,20]] # feature vectors
>>>, Y)
>>> zip(model.classes_, model.predict_proba([1,1,1])[0])
[('apple', 0.33333332048410247), ('orange', 0.66666667951589786)]
Again, the model predicts the wrong probabilities. BUT, the model.predict function gets it right!
>>> model.predict([1,1,1])[0]
Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. Is this the expected behaviour, or am I doing something wrong? If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? And importantly, how big does the dataset need to be before I can trust the results from predict_proba?
-------- UPDATE --------
- - - - - - - - - - - - - - - - - - - -更新
Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way!
>>> def train_test(n):
... X = [[1,2,3], [2,3,4]] * n
... Y = ['apple', 'orange'] * n
..., Y)
... print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0])
>>> train_test(1)
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
>>> for n in range(1,10):
... train_test(n)
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)]
n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)]
n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)]
n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)]
n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)]
n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)]
n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)]
n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)]
How should I use this function safely in my code? At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict?
3 个解决方案
if you use svm.LinearSVC()
as estimator, and .decision_function()
(which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict()
function. Plus, this estimator is faster and gives almost the same results with svm.SVC()
如果您使用svm. linearsvc()作为估计值,而.decision_function()则与svm类似。用于将最可能的类的结果排序为最不可能的类。这与. predic()函数一致。另外,这个估计值更快,并且与svm.SVC()的结果几乎相同
the only drawback for you might be that .decision_function()
gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.
is using the Platt scaling feature of libsvm to callibrate probabilities, see:
- How does sklearn.svm.svc's function predict_proba() work internally?
- sklearn.svm如何。svc的函数predict_proba()在内部工作吗?
So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.
There is some confusion as to what predict_proba actually does. It does not predict probabilities as the title suggests, but outputs distances. In the apple vs orange example 0.39097541, 0.60902459 the shortest distance 0.39097541 is the apple class. which is counter intuitive. you are looking at the highest probability, but its not the case.
Another source of confusion stems from that predict_proba does match hard labels, just not in the order of classes, from 0..n sequentially . Scikit seems to shuffle the classes, but it is possible to map them.
here is how it works.
say we have 5 classes with labels:
classifier.classes_ = [0 1 2 3 4]
target names = ['1', '2', '3', '6', '8']
predicted labels [2 0 1 0 4]
预测标签[2 0 1 0 4]
[[ 0.20734121 0.20451986 0.17262553 0.20768649 0.20782692]
[ 0.19099348 0.2018391 0.20222314 0.20136784 0.20357644]
[ 0.19982284 0.19497121 0.20399376 0.19824784 0.20296435]
[ 0.19884577 0.1999416 0.19998889 0.20092702 0.20029672]
[ 0.20328893 0.2025956 0.20500402 0.20383255 0.1852789 ]]
Confusion matrix:
[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
y_test [2 0 1 3 4]
pred [2 0 1 0 4]
classifier.classes_ = [0 1 2 3 4]
anything but the third class is a match. according to predicted labels in cm, class 0 is predicted and actual class is 0 argmax(pred_prob). But, its mapped to
除了第三节课,什么都不匹配。根据cm中预测的标签,预测类0,实际类为0 argmax(pred_prob)。但是,其映射到
y_test [2 0 1 3 4]
so find the second class
0 1 2 3 4
[ 0.20734121 0.20451986 0.17262553 0.20768649 0.20782692]
and the winner is **0.17262553**
let's do it again. look at the misclassification result numero 4 where actual lebel 4, predicted 1 according to cm.
让我们再做一次。看错分类结果的数字4,实际的lebel 4,根据cm预测1。
BUT y_test [2 0 1 3 4] pred [2 0 1 0 4]
which translates to actual label 3 predicted label 0
0 1 2 3 4
]0.19884577 0.1999416 0.19998889 0.20092702 0.20029672]
look at label number 0, and the winner is **0.19884577**
These are my 0.02.
if you use svm.LinearSVC()
as estimator, and .decision_function()
(which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict()
function. Plus, this estimator is faster and gives almost the same results with svm.SVC()
如果您使用svm. linearsvc()作为估计值,而.decision_function()则与svm类似。用于将最可能的类的结果排序为最不可能的类。这与. predic()函数一致。另外,这个估计值更快,并且与svm.SVC()的结果几乎相同
the only drawback for you might be that .decision_function()
gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.
is using the Platt scaling feature of libsvm to callibrate probabilities, see:
- How does sklearn.svm.svc's function predict_proba() work internally?
- sklearn.svm如何。svc的函数predict_proba()在内部工作吗?
So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.
There is some confusion as to what predict_proba actually does. It does not predict probabilities as the title suggests, but outputs distances. In the apple vs orange example 0.39097541, 0.60902459 the shortest distance 0.39097541 is the apple class. which is counter intuitive. you are looking at the highest probability, but its not the case.
Another source of confusion stems from that predict_proba does match hard labels, just not in the order of classes, from 0..n sequentially . Scikit seems to shuffle the classes, but it is possible to map them.
here is how it works.
say we have 5 classes with labels:
classifier.classes_ = [0 1 2 3 4]
target names = ['1', '2', '3', '6', '8']
predicted labels [2 0 1 0 4]
预测标签[2 0 1 0 4]
[[ 0.20734121 0.20451986 0.17262553 0.20768649 0.20782692]
[ 0.19099348 0.2018391 0.20222314 0.20136784 0.20357644]
[ 0.19982284 0.19497121 0.20399376 0.19824784 0.20296435]
[ 0.19884577 0.1999416 0.19998889 0.20092702 0.20029672]
[ 0.20328893 0.2025956 0.20500402 0.20383255 0.1852789 ]]
Confusion matrix:
[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
y_test [2 0 1 3 4]
pred [2 0 1 0 4]
classifier.classes_ = [0 1 2 3 4]
anything but the third class is a match. according to predicted labels in cm, class 0 is predicted and actual class is 0 argmax(pred_prob). But, its mapped to
除了第三节课,什么都不匹配。根据cm中预测的标签,预测类0,实际类为0 argmax(pred_prob)。但是,其映射到
y_test [2 0 1 3 4]
so find the second class
0 1 2 3 4
[ 0.20734121 0.20451986 0.17262553 0.20768649 0.20782692]
and the winner is **0.17262553**
let's do it again. look at the misclassification result numero 4 where actual lebel 4, predicted 1 according to cm.
让我们再做一次。看错分类结果的数字4,实际的lebel 4,根据cm预测1。
BUT y_test [2 0 1 3 4] pred [2 0 1 0 4]
which translates to actual label 3 predicted label 0
0 1 2 3 4
]0.19884577 0.1999416 0.19998889 0.20092702 0.20029672]
look at label number 0, and the winner is **0.19884577**
These are my 0.02.