python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线（calibration curve）、使用直方图可视化每个模型平均预测概率的分布图

导入包和库

仿真数据、数据划分

校准曲线（calibration curve）是指模型预测结果与实际结果之间的差异度量。在机器学习中，校准曲线常用于评估二元分类模型的表现，特别是分类概率的准确度。

校准曲线以预测概率为横坐标，真实概率为纵坐标，绘制出一个图像。一个良好的分类器应该具有可靠的性能，即预测概率与实际概率非常接近，因此，其校准曲线应该靠近完美的45度对角线。

通过观察校准曲线的形状，我们可以得出模型的预测能力参数，并识别不平衡的分类器。在实际应用中，校准曲线可用于选择最佳的预测概率阈值，以达到最佳的分类器性能。

导入包和库





from  import make_classification




from sklearn.model_selection import train_test_split




from  import CalibrationDisplay




from  import RandomForestClassifier




from sklearn.linear_model import LogisticRegression




from sklearn.naive_bayes import GaussianNB




import numpy as np




from  import LinearSVC

仿真数据、数据划分




X, y = make_classification(



    n_samples=10000, n_features=20, n_informative=2, n_redundant=2, random_state=42




)



 



train_samples = 100  # Samples used for training the models



X_train, X_test, y_train, y_test = train_test_split(



    X,



    y,



    shuffle=False,



    test_size=10000 - train_samples,



)

python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线（calibration curve）、使用直方图可视化每个模型平均预测概率的分布图





class NaivelyCalibratedLinearSVC(LinearSVC):



    """LinearSVC with `predict_proba` method that naively scales





    `decision_function` output."""




 



    def fit(self, X, y):



        super().fit(X, y)



        df = self.decision_function(X)



        self.df_min_ = ()



        self.df_max_ = ax()




 



    def predict_proba(self, X):



        """Min-max scale output of `decision_function` to [0,1]."""




        df = self.decision_function(X)



        calibrated_df = (df - self.df_min_) / (self.df_max_ - self.df_min_)



        proba_pos_class = (calibrated_df, 0, 1)



        proba_neg_class = 1 - proba_pos_class




        proba = np.c_[proba_neg_class, proba_pos_class]



        return proba



 



 



 



# Create classifiers



lr = LogisticRegression()



gnb = GaussianNB()



svc = NaivelyCalibratedLinearSVC(C=1.0)



rfc = RandomForestClassifier()



 



clf_list = [



    (lr, "Logistic"),



    (gnb, "Naive Bayes"),



    (svc, "SVC"),



    (rfc, "Random forest"),



]



 



import  as plt




from  import GridSpec



 



fig = (figsize=(10, 10))



gs = GridSpec(4, 2)



colors = .get_cmap("Dark2")



 



ax_calibration_curve = fig.add_subplot(gs[:2, :2])



calibration_displays = {}



markers = ["^", "v", "s", "o"]




for i, (clf, name) in enumerate(clf_list):



    (X_train, y_train)



    display = CalibrationDisplay.from_estimator(



        clf,



        X_test,



        y_test,



        n_bins=10,



        name=name,



        ax=ax_calibration_curve,



        color=colors(i),



        marker=markers[i],



    )



    calibration_displays[name] = display




 



ax_calibration_curve.grid()



ax_calibration_curve.set_title("Calibration plots")



 



# Add histogram



grid_positions = [(2, 0), (2, 1), (3, 0), (3, 1)]




for i, (_, name) in enumerate(clf_list):



    row, col = grid_positions[i]



    ax = fig.add_subplot(gs[row, col])



 



    (



        calibration_displays[name].y_prob,



        range=(0, 1),



        bins=10,



        label=name,



        color=colors(i),



    )



    ax.set(title=name, xlabel="Mean predicted probability", ylabel="Count")



 



plt.tight_layout()



()

A calibration curve, in the context of machine learning, is a tool used to evaluate the performance of a binary classifier. It measures the accuracy of the predicted probabilities produced by a model against the true probabilities of the events it's trying to predict.

More specifically, a calibration curve is a graph that plots the actual probabilities of a given event against the predicted probabilities assigned by a classifier. The ideal curve would be a 45-degree line, which would indicate that the predicted probabilities are perfectly calibrated with the true probabilities of the event.

In practice, however, most classifiers tend to be overconfident or underconfident in their predictions. For example, a classifier might assign a 90% probability to an event that only has a 60% chance of occurring. A calibration curve can help identify this kind of mis-calibration by measuring the difference between the predicted and actual probabilities at different thresholds.

A good calibration curve will show a high level of agreement between the predicted and actual probabilities, indicating that the classifier is well-calibrated. A poorly calibrated curve, on the other hand, will show a significant discrepancy between the predicted and actual probabilities, indicating that the classifier is over or under-confident.

参考：sklearn

秒客网

python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线（calibration curve）、使用直方图可视化每个模型平均预测概率的分布图

python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线（calibration curve）、使用直方图可视化每个模型平均预测概率的分布图

导入包和库

仿真数据、数据划分

python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线（calibration curve）、使用直方图可视化每个模型平均预测概率的分布图

相关文章