如何在scikit_learn中使用shuffle in KFold ?

时间:2022-06-23 07:00:44

I am running 10-fold CV using the KFold function provided by scikit-learn in order to select some kernel parameters. I am implementing this (grid_search)procedure:

我使用scikit-learn提供的KFold函数运行10倍的CV,以选择一些内核参数。我正在执行这个(grid_search)程序:

1-pick up a selection of parameters
2-generate a svm
3-generate a KFold
4-get the data that correspons to training/cv_test
5-train the model (clf.fit)
6-classify with the cv_testdata
7-calculate the cv-error 
8-repeat 1-7
9-When ready pick the parameters that provide the lowest average(cv-error)

If I do not use shuffle in the KFold generation, I get very much the same results for the average( cv_errors) if I repeat the same runs and the "best results" are repeatable. If I use the shuffle, I am getting different values for the average (cv-errors) if I repeat the same run several times and the "best values" are not repeatable. I can understand that I should get different cv_errors for each KFold pass but the final average should be the same. How does the KFold with shuffle really work? Each time the KFold is called, it shuffles my indexes and it generates training/test data. How does it pick the different folds for "training/testing"? Does it have a random way to pick the different folds for training/testing? Any situations where its avantageous with "shuffle" and situations that are not??

如果我在KFold生成中不使用shuffle,那么如果我重复相同的运行,并且“最佳结果”是可重复的,那么我将得到与平均(cv_errors)相同的结果。如果我使用shuffle,我将得到不同的平均值(cv-error),如果我重复相同的运行几次,并且“最佳值”是不可重复的。我可以理解,我应该得到每个KFold传递的不同的cv_errors,但是最终的平均值应该是相同的。洗牌和洗牌是怎么回事呢?每次调用KFold时,它都会打乱我的索引,并生成训练/测试数据。它如何选择“训练/测试”的不同折叠?它是否有一个随机的方法来选择训练/测试的不同折叠?在任何情况下,它的“洗牌”和“洗牌”的情况都不是??

1 个解决方案

#1


5  

If shuffle is True, the whole data is first shuffled and then split into the K-Folds. For repeatable behavior, you can set the random_state, for example to an integer seed (random_state=0). If your parameters depend on the shuffling, this means your parameter selection is very unstable. Probably you have very little training data or you use to little folds (like 2 or 3).

如果洗牌是真的,整个数据首先被洗牌,然后分成k圈。对于可重复的行为,您可以设置random_state,例如一个整数种子(random_state=0)。如果您的参数依赖于洗牌,这意味着您的参数选择非常不稳定。可能你的训练数据很少,或者你使用的是小折线(比如2或3)。

The "shuffle" is mainly useful if your data is somehow sorted by classes, because then each fold might contain only samples from one class (in particular for stochastic gradient decent classifiers sorted classes are dangerous). For other classifiers, it should make no differences. If shuffling is very unstable, your parameter selection is likely to be uninformative (aka garbage).

如果数据以某种方式按类排序,那么“shuffle”主要是有用的,因为每个折叠可能只包含一个类的样本(特别是随机梯度合适的分类器类是危险的)。对于其他分类器,它应该没有区别。如果变换非常不稳定,您的参数选择很可能是没有信息的(也就是垃圾)。

#1


5  

If shuffle is True, the whole data is first shuffled and then split into the K-Folds. For repeatable behavior, you can set the random_state, for example to an integer seed (random_state=0). If your parameters depend on the shuffling, this means your parameter selection is very unstable. Probably you have very little training data or you use to little folds (like 2 or 3).

如果洗牌是真的,整个数据首先被洗牌,然后分成k圈。对于可重复的行为,您可以设置random_state,例如一个整数种子(random_state=0)。如果您的参数依赖于洗牌,这意味着您的参数选择非常不稳定。可能你的训练数据很少,或者你使用的是小折线(比如2或3)。

The "shuffle" is mainly useful if your data is somehow sorted by classes, because then each fold might contain only samples from one class (in particular for stochastic gradient decent classifiers sorted classes are dangerous). For other classifiers, it should make no differences. If shuffling is very unstable, your parameter selection is likely to be uninformative (aka garbage).

如果数据以某种方式按类排序,那么“shuffle”主要是有用的,因为每个折叠可能只包含一个类的样本(特别是随机梯度合适的分类器类是危险的)。对于其他分类器,它应该没有区别。如果变换非常不稳定,您的参数选择很可能是没有信息的(也就是垃圾)。