I'm following one of the kernels on Kaggle, mainly, I'm following A kernel for Credit Card Fraud Detection.
我正在追踪Kaggle上的一个内核,主要是,我在跟踪一个内核来进行信用卡欺诈检测。
I reached the step where I need to perform KFold in order to find the best parameters for Logistic Regression.
为了找到逻辑回归的最佳参数,我到达了需要执行KFold的步骤。
The following code is shown in the kernel itself, but for some reason (probably older version of scikit-learn, give me some errors).
下面的代码是在内核中显示的,但是出于某些原因(可能是scikitt的老版本,给我一些错误)。
def printing_Kfold_scores(x_train_data,y_train_data):
fold = KFold(len(y_train_data),5,shuffle=False)
# Different C parameters
c_param_range = [0.01,0.1,1,10,100]
results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range
# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
j = 0
for c_param in c_param_range:
print('-------------------------------------------')
print('C parameter: ', c_param)
print('-------------------------------------------')
print('')
recall_accs = []
for iteration, indices in enumerate(fold,start=1):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
# with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
recall_accs.append(recall_acc)
print('Iteration ', iteration,': recall score = ', recall_acc)
# The mean value of those recall scores is the metric we want to save and get hold of.
results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('Mean recall score ', np.mean(recall_accs))
print('')
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
# Finally, we can check which C parameter is the best amongst the chosen.
print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
print('*********************************************************************************')
return best_c
The errors I'm getting are as follows: for this line: fold = KFold(len(y_train_data),5,shuffle=False)
Error:
我得到的错误如下:对于这一行:fold = KFold(len(y_train_data),5,shuffle=False):
TypeError: init() got multiple values for argument 'shuffle'
TypeError: init()得到多个值,用于参数“shuffle”
if I remove the shuffle=False
from this line, I'm getting the following error:
如果我从这一行删除了shuffle=False,我将得到以下错误:
TypeError: shuffle must be True or False; got 5
类型错误:洗牌必须为真或假;得到5个
If I remove the 5
and keep the shuffle=False
, I'm getting the following error;
如果我移除5,并保持洗牌=False,我将得到以下错误;
TypeError: 'KFold' object is not iterable which is from this line:
for iteration, indices in enumerate(fold,start=1):
TypeError:“KFold”对象不能在这一行中迭代:对于迭代,枚举的索引(fold,start=1):
If someone can help me with solving this issue and suggest how this can be done with the latest version of scikit-learn it will be very appreciated.
如果有人能帮我解决这个问题,并建议如何用最新版的scikitt来完成这个问题,我们将非常感激。
Thanks.
谢谢。
2 个解决方案
#1
0
KFold is a splitter, so you have to give something to split.
KFold是一个拆分器,所以你必须给它分配一些东西。
example code:
示例代码:
X = np.array([1,1,1,1], [2,2,2,2], [3,3,3,3], [4,4,4,4]])
y = np.array([1, 2, 3, 4])
# Now you create your Kfolds by the way you just have to pass number of splits and if you want to shuffle.
fold = KFold(2,shuffle=False)
# For iterate over the folds just use split
for train_index, test_index in fold.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Follow fitting the classifier
If you want to get the index for the loop of train/test, just add enumerate
如果您想要获取列车/测试循环的索引,只需添加枚举。
for i, train_index, test_index in enumerate(fold.split(X)):
print('Iteration:', i)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I hope this works
我希望这工作
#2
1
That depends on how you have imported the KFold.
这要看你是怎么导入KFold的。
If you have did this:
如果你这样做了:
from sklearn.cross_validation import KFold
Then your code should work. Because it requires 3 params :- length of array, number of splits, and shuffle
然后你的代码就可以工作了。因为它需要3个参数:-长度的数组,分割的数目,和shuffle。
But if you are doing this:
但如果你这样做:
from sklearn.model_selection import KFold
then this will not work and you only need to pass the number of splits and shuffle. No need to pass the length of array along with making changes in enumerate()
.
那么这就行不通了,你只需要通过拆分和洗牌的次数就可以了。不需要传递数组的长度以及枚举()中的更改。
By the way, the model_selection is the new module and recommended to use. Try using it like this:
顺便说一下,model_selection是新模块,建议使用。试着这样使用:
fold = KFold(5,shuffle=False)
for train_index, test_index in fold.split(X):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
lr.fit(x_train_data.iloc[train_index,:], y_train_data.iloc[train_index,:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[test_index,:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[test_index,:].values,y_pred_undersample)
recall_accs.append(recall_acc)
#1
0
KFold is a splitter, so you have to give something to split.
KFold是一个拆分器,所以你必须给它分配一些东西。
example code:
示例代码:
X = np.array([1,1,1,1], [2,2,2,2], [3,3,3,3], [4,4,4,4]])
y = np.array([1, 2, 3, 4])
# Now you create your Kfolds by the way you just have to pass number of splits and if you want to shuffle.
fold = KFold(2,shuffle=False)
# For iterate over the folds just use split
for train_index, test_index in fold.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Follow fitting the classifier
If you want to get the index for the loop of train/test, just add enumerate
如果您想要获取列车/测试循环的索引,只需添加枚举。
for i, train_index, test_index in enumerate(fold.split(X)):
print('Iteration:', i)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I hope this works
我希望这工作
#2
1
That depends on how you have imported the KFold.
这要看你是怎么导入KFold的。
If you have did this:
如果你这样做了:
from sklearn.cross_validation import KFold
Then your code should work. Because it requires 3 params :- length of array, number of splits, and shuffle
然后你的代码就可以工作了。因为它需要3个参数:-长度的数组,分割的数目,和shuffle。
But if you are doing this:
但如果你这样做:
from sklearn.model_selection import KFold
then this will not work and you only need to pass the number of splits and shuffle. No need to pass the length of array along with making changes in enumerate()
.
那么这就行不通了,你只需要通过拆分和洗牌的次数就可以了。不需要传递数组的长度以及枚举()中的更改。
By the way, the model_selection is the new module and recommended to use. Try using it like this:
顺便说一下,model_selection是新模块,建议使用。试着这样使用:
fold = KFold(5,shuffle=False)
for train_index, test_index in fold.split(X):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
lr.fit(x_train_data.iloc[train_index,:], y_train_data.iloc[train_index,:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[test_index,:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[test_index,:].values,y_pred_undersample)
recall_accs.append(recall_acc)