谓词proba或decision_function作为估计量“confidence”

时间:2021-07-04 23:56:14

I'm using LogisticRegression as a model to train an estimator in scikit-learn. The features I use are (mostly) categorical; and so are the labels. Therefore, I use a DictVectorizer and a LabelEncoder, respectively, to encode the values properly.

我正在用逻辑回归作为一个模型来训练一个在科学学习中的估价师。我使用的特性(大部分)是绝对的;标签也是如此。因此,我分别使用一个DictVectorizer和一个LabelEncoder来正确编码值。

The training part is fairly straightforward, but I'm having problems with the test part. The simple thing to do is to use the "predict" method of the trained model and get the predicted label. However, for the processing I need to do afterwards, I need the probability for each possible label (class) for each particular instance. I decided to use the "predict_proba" method. However, I get different results for the same test instance, whether I use this method when the instance is by itself or accompanied by others.

培训部分很简单,但是我在测试部分有问题。简单的做法是使用训练后的模型的“预测”方法,得到预测的标签。但是,对于以后需要处理的处理,我需要每个特定实例的每个可能的标签(类)的概率。我决定使用“predict_proba”方法。但是,对于同一个测试实例,我得到了不同的结果,无论我是在该实例单独使用这个方法,还是在其他实例的陪伴下使用这个方法。

Next, is a code that reproduces the problem.

接下来,是重新生成问题的代码。

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder


X_real = [{'head': u'n\xe3o', 'dep_rel': u'ADVL'}, 
          {'head': u'v\xe3o', 'dep_rel': u'ACC'}, 
          {'head': u'empresa', 'dep_rel': u'SUBJ'}, 
          {'head': u'era', 'dep_rel': u'ACC'}, 
          {'head': u't\xeam', 'dep_rel': u'ACC'}, 
          {'head': u'import\xe2ncia', 'dep_rel': u'PIV'}, 
          {'head': u'balan\xe7o', 'dep_rel': u'SUBJ'}, 
          {'head': u'ocupam', 'dep_rel': u'ACC'}, 
          {'head': u'acesso', 'dep_rel': u'PRED'}, 
          {'head': u'elas', 'dep_rel': u'SUBJ'}, 
          {'head': u'assinaram', 'dep_rel': u'ACC'}, 
          {'head': u'agredido', 'dep_rel': u'SUBJ'}, 
          {'head': u'pol\xedcia', 'dep_rel': u'ADVL'}, 
          {'head': u'se', 'dep_rel': u'ACC'}] 
y_real = [u'AM-NEG', u'A1', u'A0', u'A1', u'A1', u'A1', u'A0', u'A1', u'AM-ADV', u'A0', u'A1', u'A0', u'A2', u'A1']

feat_encoder =  DictVectorizer()
feat_encoder.fit(X_real)

label_encoder = LabelEncoder()
label_encoder.fit(y_real)

model = LogisticRegression()
model.fit(feat_encoder.transform(X_real), label_encoder.transform(y_real))

print "Test 1..."
X_test1 = [{'head': u'governo', 'dep_rel': u'SUBJ'}]
X_test1_encoded = feat_encoder.transform(X_test1)
print "Features Encoded"
print X_test1_encoded
print "Shape"
print X_test1_encoded.shape
print "decision_function:"
print model.decision_function(X_test1_encoded)
print "predict_proba:"
print model.predict_proba(X_test1_encoded)

print "Test 2..."
X_test2 = [{'head': u'governo', 'dep_rel': u'SUBJ'}, 
           {'head': u'atrav\xe9s', 'dep_rel': u'ADVL'}, 
           {'head': u'configuram', 'dep_rel': u'ACC'}]

X_test2_encoded = feat_encoder.transform(X_test2)
print "Features Encoded"
print X_test2_encoded
print "Shape"
print X_test2_encoded.shape
print "decision_function:"
print model.decision_function(X_test2_encoded)
print "predict_proba:"
print model.predict_proba(X_test2_encoded)


print "Test 3..."
X_test3 = [{'head': u'governo', 'dep_rel': u'SUBJ'}, 
           {'head': u'atrav\xe9s', 'dep_rel': u'ADVL'}, 
           {'head': u'configuram', 'dep_rel': u'ACC'},
           {'head': u'configuram', 'dep_rel': u'ACC'},]

X_test3_encoded = feat_encoder.transform(X_test3)
print "Features Encoded"
print X_test3_encoded
print "Shape"
print X_test3_encoded.shape
print "decision_function:"
print model.decision_function(X_test3_encoded)
print "predict_proba:"
print model.predict_proba(X_test3_encoded)

Following is the output obtained:

得到的输出如下:

Test 1...
Features Encoded
  (0, 4)    1.0
Shape
(1, 19)
decision_function:
[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347]]
predict_proba:
[[ 1.  1.  1.  1.  1.]]
Test 2...
Features Encoded
  (0, 4)    1.0
  (1, 1)    1.0
  (2, 0)    1.0
Shape
(3, 19)
decision_function:
[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347]
 [-1.07370197 -0.69103629 -0.89306092 -1.51402163 -0.89306092]
 [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112]]
predict_proba:
[[ 0.59710757  0.19486904  0.26065002  0.32612646  0.26065002]
 [ 0.23950111  0.24715931  0.51348452  0.3916478   0.51348452]
 [ 0.16339132  0.55797165  0.22586546  0.28222574  0.22586546]]
Test 3...
Features Encoded
  (0, 4)    1.0
  (1, 1)    1.0
  (2, 0)    1.0
  (3, 0)    1.0
Shape
(4, 19)
decision_function:
[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347]
 [-1.07370197 -0.69103629 -0.89306092 -1.51402163 -0.89306092]
 [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112]
 [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112]]
predict_proba:
[[ 0.5132474   0.12507868  0.21262531  0.25434403  0.21262531]
 [ 0.20586462  0.15864173  0.4188751   0.30544372  0.4188751 ]
 [ 0.14044399  0.3581398   0.1842498   0.22010613  0.1842498 ]
 [ 0.14044399  0.3581398   0.1842498   0.22010613  0.1842498 ]]

As can be seen, the values obtained with "predict_proba" for the instance in "X_test1" change when that same instance is with others in X_test2. Also, "X_test3" just reproduces the "X_test2" and adds one more instance (that is equal to the last in "X_test2"), but the probability values for all of them change. Why does this happen? Also, I find it really strange that ALL the probabilities for "X_test1" are 1, shouldn't the sum of all be 1?

可以看到,当X_test1中的实例与其他实例在X_test2中的实例在一起时,使用“predict_proba”获取的值会发生变化。另外,“X_test3”只会重新生成“X_test2”并添加一个实例(这等于“X_test2”中的最后一个实例),但所有实例的概率值都会改变。这为什么会发生?而且,我发现很奇怪的是"X_test1"的所有概率都是1,难道所有的和不应该都是1吗?

Now, if instead of using "predict_proba" I use "decision_function", I get the consistency in the values obtained that I need. The problem is that I get negative coefficients, and even some of the positives ones are greater than 1.

现在,如果我使用的不是“predict_proba”,而是“decision_function”,我就会得到我需要的值的一致性。问题是我得到了负系数,甚至一些正的系数都大于1。

So, what should I use? Why do the values of "predict_proba" change that way? Am I not understanding correctly what those values mean?

那么,我该用什么呢?为什么“predict_proba”的值会发生这种变化?难道我没有正确理解这些价值观的含义吗?

Thanks in advance for any help you could give me.

提前谢谢你给我的任何帮助。

UPDATE

更新

As suggested, I changed the code so as to also print the encoded "X_test1", "X_test2" and "X_test3", as well as their shapes. This doesn't appear to be the problem, as the encoding is consistant for the same instances between the test sets.

如前所述,我更改了代码,以便也打印编码的“X_test1”、“X_test2”和“X_test3”以及它们的形状。这似乎不是问题所在,因为对于测试集之间的相同实例,编码是一致的。

1 个解决方案

#1


7  

As indicated on the question's comments, the error was caused by a bug in the implementation for the version of scikit-learn I was using. The problem was solved updating to the most recent stable version 0.12.1

正如问题的注释所示,错误是由于我正在使用的版本scikit-learn的实现中出现了一个错误。问题解决了更新到最新的稳定版本0.12.1

#1


7  

As indicated on the question's comments, the error was caused by a bug in the implementation for the version of scikit-learn I was using. The problem was solved updating to the most recent stable version 0.12.1

正如问题的注释所示,错误是由于我正在使用的版本scikit-learn的实现中出现了一个错误。问题解决了更新到最新的稳定版本0.12.1