python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线(calibration curve)、使用直方图可视化每个模型平均预测概率的分布图

时间:2025-01-17 12:00:01

python拟合多个机器学习模型可视化多个机器学习模型对应的校准曲线(calibration curve)、使用直方图可视化每个模型平均预测概率的分布图
 

目录

python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线(calibration curve)、使用直方图可视化每个模型平均预测概率的分布图

导入包和库

仿真数据、数据划分

python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线(calibration curve)、使用直方图可视化每个模型平均预测概率的分布图


校准曲线(calibration curve)是指模型预测结果与实际结果之间的差异度量。在机器学习中,校准曲线常用于评估二元分类模型的表现,特别是分类概率的准确度。

校准曲线以预测概率为横坐标,真实概率为纵坐标,绘制出一个图像。一个良好的分类器应该具有可靠的性能,即预测概率与实际概率非常接近,因此,其校准曲线应该靠近完美的45度对角线。

通过观察校准曲线的形状,我们可以得出模型的预测能力参数,并识别不平衡的分类器。在实际应用中,校准曲线可用于选择最佳的预测概率阈值,以达到最佳的分类器性能。

导入包和库

  1. from import make_classification
  2. from sklearn.model_selection import train_test_split
  3. from import CalibrationDisplay
  4. from import RandomForestClassifier
  5. from sklearn.linear_model import LogisticRegression
  6. from sklearn.naive_bayes import GaussianNB
  7. import numpy as np
  8. from import LinearSVC

仿真数据、数据划分

  1. X, y = make_classification(
  2. n_samples=10000, n_features=20, n_informative=2, n_redundant=2, random_state=42
  3. )
  4. train_samples = 100 # Samples used for training the models
  5. X_train, X_test, y_train, y_test = train_test_split(
  6. X,
  7. y,
  8. shuffle=False,
  9. test_size=10000 - train_samples,
  10. )

python拟合多个机器学习模型、可视化多个机器学习模型对应的校准曲线(calibration curve)、使用直方图可视化每个模型平均预测概率的分布图
 

  1. class NaivelyCalibratedLinearSVC(LinearSVC):
  2. """LinearSVC with `predict_proba` method that naively scales
  3. `decision_function` output."""
  4. def fit(self, X, y):
  5. super().fit(X, y)
  6. df = self.decision_function(X)
  7. self.df_min_ = ()
  8. self.df_max_ = ax()
  9. def predict_proba(self, X):
  10. """Min-max scale output of `decision_function` to [0,1]."""
  11. df = self.decision_function(X)
  12. calibrated_df = (df - self.df_min_) / (self.df_max_ - self.df_min_)
  13. proba_pos_class = (calibrated_df, 0, 1)
  14. proba_neg_class = 1 - proba_pos_class
  15. proba = np.c_[proba_neg_class, proba_pos_class]
  16. return proba
  17. # Create classifiers
  18. lr = LogisticRegression()
  19. gnb = GaussianNB()
  20. svc = NaivelyCalibratedLinearSVC(C=1.0)
  21. rfc = RandomForestClassifier()
  22. clf_list = [
  23. (lr, "Logistic"),
  24. (gnb, "Naive Bayes"),
  25. (svc, "SVC"),
  26. (rfc, "Random forest"),
  27. ]
  28. import as plt
  29. from import GridSpec
  30. fig = (figsize=(10, 10))
  31. gs = GridSpec(4, 2)
  32. colors = .get_cmap("Dark2")
  33. ax_calibration_curve = fig.add_subplot(gs[:2, :2])
  34. calibration_displays = {}
  35. markers = ["^", "v", "s", "o"]
  36. for i, (clf, name) in enumerate(clf_list):
  37. (X_train, y_train)
  38. display = CalibrationDisplay.from_estimator(
  39. clf,
  40. X_test,
  41. y_test,
  42. n_bins=10,
  43. name=name,
  44. ax=ax_calibration_curve,
  45. color=colors(i),
  46. marker=markers[i],
  47. )
  48. calibration_displays[name] = display
  49. ax_calibration_curve.grid()
  50. ax_calibration_curve.set_title("Calibration plots")
  51. # Add histogram
  52. grid_positions = [(2, 0), (2, 1), (3, 0), (3, 1)]
  53. for i, (_, name) in enumerate(clf_list):
  54. row, col = grid_positions[i]
  55. ax = fig.add_subplot(gs[row, col])
  56. (
  57. calibration_displays[name].y_prob,
  58. range=(0, 1),
  59. bins=10,
  60. label=name,
  61. color=colors(i),
  62. )
  63. ax.set(title=name, xlabel="Mean predicted probability", ylabel="Count")
  64. plt.tight_layout()
  65. ()

 

A calibration curve, in the context of machine learning, is a tool used to evaluate the performance of a binary classifier. It measures the accuracy of the predicted probabilities produced by a model against the true probabilities of the events it's trying to predict.

More specifically, a calibration curve is a graph that plots the actual probabilities of a given event against the predicted probabilities assigned by a classifier. The ideal curve would be a 45-degree line, which would indicate that the predicted probabilities are perfectly calibrated with the true probabilities of the event.

In practice, however, most classifiers tend to be overconfident or underconfident in their predictions. For example, a classifier might assign a 90% probability to an event that only has a 60% chance of occurring. A calibration curve can help identify this kind of mis-calibration by measuring the difference between the predicted and actual probabilities at different thresholds.

A good calibration curve will show a high level of agreement between the predicted and actual probabilities, indicating that the classifier is well-calibrated. A poorly calibrated curve, on the other hand, will show a significant discrepancy between the predicted and actual probabilities, indicating that the classifier is over or under-confident.

参考:sklearn