I have a dataframe object in R/Python that looks like:
我在R / Python中有一个数据框对象,如下所示:
df columns:
fraud = [1,1,0,0,0,0,0,0,0,1]
score = [0.84, 1, 1.1, 0.4, 0.6, 0.13, 0.32, 1.4, 0.9, 0.45]
When I use roc_curve
in Python I get fpr
, fnr
and thresholds
.
当我在Python中使用roc_curve时,我得到fpr,fnr和阈值。
I have 2 questions, maybe a bit theoretical but please explain it to me:
我有2个问题,也许有点理论,但请向我解释一下:
-
Are these thresholds are calculated actually? I have calculated manually
fpr
andfnr
, but are these thresholds = the score above?这些阈值是否实际计算?我手动计算了fpr和fnr,但是这些阈值=上面的分数?
-
How can I generate same
fpr
,fnr
andthresholds
inR
?如何在R中生成相同的fpr,fnr和阈值?
1 个解决方案
#1
2
thresholds usually correspond to the value which maximizes tpr + tnr (sensitivity + specificity) this is called the Youden J index (tpr + tnr - 1) but has also several other names.
阈值通常对应于最大化TPR + TNR(灵敏度+特异性)这就是所谓的约登Ĵ指数的值(TPR + TNR - 1),但是也有其它几个名字。
take the following example with Sonar dataset:
以Sonar数据集为例:
library(mlbench)
library(xgboost)
library(caret)
library(pROC)
data(Sonar)
lets fit a model on part of Sonar data and predict on another part:
让我们在Sonar数据的一部分上拟合模型并预测另一部分:
ind <- createDataPartition(Sonar$Class, p = 0.7, list = FALSE)
train <- Sonar[ind,]
test <- Sonar[-ind,]
X = as.matrix(train[, -61])
dtrain = xgb.DMatrix(data = X, label = as.numeric(train$Class)-1)
dtest <- xgb.DMatrix(data = as.matrix(test[, -61]))
fit the model on the train data:
在火车数据上拟合模型:
model = xgb.train(data = dtrain,
eval = "auc",
verbose = 0, maximize = TRUE,
params = list(objective = "binary:logistic",
eta = 0.1,
max_depth = 6,
subsample = 0.8,
lambda = 0.1 ),
nrounds = 10)
preds <- predict(model, dtest)
true <- as.numeric(test$Class)-1
plot(roc(response = true,
predictor = preds,
levels=c(0, 1)),
lwd=1.5, print.thres = T, print.auc = T, print.auc.y = 0.5)
So if you set the threshold at 0.578 you will maximize the value tpr + tnr
and the values in the parenthesis on the plot are tpr and tnr. Verify:
因此,如果将阈值设置为0.578,则会使值tpr + tnr最大化,并且图中括号中的值为tpr和tnr。校验:
sensitivity(as.factor(ifelse(preds > 0.578, "1", "0")), as.factor(true))
#output
[1] 0.9090909
specificity(as.factor(ifelse(preds > 0.578, "1", "0")), as.factor(true))\
#output
[1] 0.7586207
you could create prediction over many possible thresholds:
您可以在许多可能的阈值上创建预测:
do.call(rbind, lapply((1:1000)/1000, function(x){
sens <- sensitivity(as.factor(ifelse(preds > x, "1", "0")), as.factor(true))
spec <- specificity(as.factor(ifelse(preds > x, "1", "0")), as.factor(true))
data.frame(sens, spec)
})) -> thresh
and now:
thresh[which.max(rowSums(thresh)),]
#output
sens spec
560 0.9090909 0.7586207
You can also check this out:
你也可以看一下:
thresh[555:600,]
That being said, usually when considering financial data, not just the class is if off interested but also the cost associated with false predictions which is usually not the same for false negatives and false positives. So these models are fit using cost-sensitive classification. More on the mater. On another note, when deciding on the threshold, you should do it either on cross validated data or on a validation set specifically designated for the task. If you use it one the test set that inevitably leads to over-optimistic predictions.
话虽这么说,通常在考虑财务数据的时候,不仅仅是阶级是否感兴趣,而且与错误预测相关的成本通常与假阴性和误报相同。所以这些模型适合使用成本敏感的分类。更多关于母校。另一方面,在决定阈值时,您应该在交叉验证的数据上或在专门为该任务指定的验证集上执行此操作。如果你使用它一个测试集,不可避免地导致过度乐观的预测。
#1
2
thresholds usually correspond to the value which maximizes tpr + tnr (sensitivity + specificity) this is called the Youden J index (tpr + tnr - 1) but has also several other names.
阈值通常对应于最大化TPR + TNR(灵敏度+特异性)这就是所谓的约登Ĵ指数的值(TPR + TNR - 1),但是也有其它几个名字。
take the following example with Sonar dataset:
以Sonar数据集为例:
library(mlbench)
library(xgboost)
library(caret)
library(pROC)
data(Sonar)
lets fit a model on part of Sonar data and predict on another part:
让我们在Sonar数据的一部分上拟合模型并预测另一部分:
ind <- createDataPartition(Sonar$Class, p = 0.7, list = FALSE)
train <- Sonar[ind,]
test <- Sonar[-ind,]
X = as.matrix(train[, -61])
dtrain = xgb.DMatrix(data = X, label = as.numeric(train$Class)-1)
dtest <- xgb.DMatrix(data = as.matrix(test[, -61]))
fit the model on the train data:
在火车数据上拟合模型:
model = xgb.train(data = dtrain,
eval = "auc",
verbose = 0, maximize = TRUE,
params = list(objective = "binary:logistic",
eta = 0.1,
max_depth = 6,
subsample = 0.8,
lambda = 0.1 ),
nrounds = 10)
preds <- predict(model, dtest)
true <- as.numeric(test$Class)-1
plot(roc(response = true,
predictor = preds,
levels=c(0, 1)),
lwd=1.5, print.thres = T, print.auc = T, print.auc.y = 0.5)
So if you set the threshold at 0.578 you will maximize the value tpr + tnr
and the values in the parenthesis on the plot are tpr and tnr. Verify:
因此,如果将阈值设置为0.578,则会使值tpr + tnr最大化,并且图中括号中的值为tpr和tnr。校验:
sensitivity(as.factor(ifelse(preds > 0.578, "1", "0")), as.factor(true))
#output
[1] 0.9090909
specificity(as.factor(ifelse(preds > 0.578, "1", "0")), as.factor(true))\
#output
[1] 0.7586207
you could create prediction over many possible thresholds:
您可以在许多可能的阈值上创建预测:
do.call(rbind, lapply((1:1000)/1000, function(x){
sens <- sensitivity(as.factor(ifelse(preds > x, "1", "0")), as.factor(true))
spec <- specificity(as.factor(ifelse(preds > x, "1", "0")), as.factor(true))
data.frame(sens, spec)
})) -> thresh
and now:
thresh[which.max(rowSums(thresh)),]
#output
sens spec
560 0.9090909 0.7586207
You can also check this out:
你也可以看一下:
thresh[555:600,]
That being said, usually when considering financial data, not just the class is if off interested but also the cost associated with false predictions which is usually not the same for false negatives and false positives. So these models are fit using cost-sensitive classification. More on the mater. On another note, when deciding on the threshold, you should do it either on cross validated data or on a validation set specifically designated for the task. If you use it one the test set that inevitably leads to over-optimistic predictions.
话虽这么说,通常在考虑财务数据的时候,不仅仅是阶级是否感兴趣,而且与错误预测相关的成本通常与假阴性和误报相同。所以这些模型适合使用成本敏感的分类。更多关于母校。另一方面,在决定阈值时,您应该在交叉验证的数据上或在专门为该任务指定的验证集上执行此操作。如果你使用它一个测试集,不可避免地导致过度乐观的预测。