I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.
我有一个相当大的数据集,它的形式是dataframe,我想知道如何将dataframe分割为两个随机样本(80%和20%)用于培训和测试。
Thanks!
谢谢!
15 个解决方案
#1
183
I would just use numpy's randn
:
我会用numpy的randn
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
我想说的是
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
#2
327
scikit learn's train_test_split
is a good one.
scikit学习的train_test_split是一个很好的例子。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
#3
146
Pandas random sample will also work
熊猫的随机样本也可以
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
#4
23
I would use scikit-learn's own training_test_split, and generate it from the index
我将使用scikit-learn自己的training_test_split,并从索引中生成它
from sklearn.cross_validation import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
#5
6
There are many valid answers. Adding one more to the bunch. from sklearn.cross_validation import train_test_split
有许多有效的答案。再加一个。从sklearn。cross_validation进口train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
#6
5
You can use below code to create test and train samples :
您可以使用下面的代码创建测试和训练示例:
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
Test size can vary depending on the percentage of data you want to put in your test and train dataset.
测试大小取决于您想要放入测试和训练数据集的数据的百分比。
#7
4
You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.
你也可以将分层划分考虑为训练和测试集。星化划分也随机地产生训练和测试集,但以这样的方式保留了原始的类比例。这使得训练和测试集更好地反映原始数据集的属性。
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.
df[train_inds]和df[test_inds]为您提供原始DataFrame df的训练和测试集。
#8
2
This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).
这是我在需要分割数据aframe时所写的。我考虑过使用安迪的方法,但是我不喜欢我不能准确地控制数据集的大小(例如。,有时是79,有时是81,等等。
def make_sets(data_df, test_portion):
import random as rnd
tot_ix = range(len(data_df))
test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
train_ix = list(set(tot_ix) ^ set(test_ix))
test_df = data_df.ix[test_ix]
train_df = data_df.ix[train_ix]
return train_df, test_df
train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
#9
1
Just select range row from df like this
从df中选择range row就像这样
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
#10
0
If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:
如果您希望有一个dataframe和两个dataframe(不是numpy数组),那么应该这样做:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data
#11
0
I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.
我认为如果你以后想要添加列的话,你还需要一个拷贝而不是一个dataframe切片。
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
#12
0
You can make use of df.as_matrix() function and create Numpy-array and pass it.
您可以使用df.as_matrix()函数并创建Numpy-array并传递它。
Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
#13
0
How about this? df is my dataframe
这个怎么样?df是我dataframe
total_size=len(df)
train_size=math.floor(0.66*total_size) (2/3 part of my dataset)
#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
#14
0
If you need to split your data with respect to the lables column in your data set you can use this:
如果您需要将数据与数据集中的lables列分开,您可以使用以下方法:
def split_to_train_test(df, label_column, train_frac=0.8):
train_df, test_df = pd.DataFrame(), pd.DataFrame()
labels = df[label_column].unique()
for lbl in labels:
lbl_df = df[df[label_column] == lbl]
lbl_train_df = lbl_df.sample(frac=train_frac)
lbl_test_df = lbl_df.drop(lbl_train_df.index)
print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
train_df = train_df.append(lbl_train_df)
test_df = test_df.append(lbl_test_df)
return train_df, test_df
and use it:
并使用它:
train, test = split_to_train_test(data, 'class', 0.7)
you can also pass random_state if you want to control the split randomness or use some global random seed.
如果你想控制分裂的随机性或者使用一些全局随机种子,你也可以通过random_state。
#15
0
To split into more than two classes such as train, test, and validation, one can do:
要分裂成两个以上的类,如火车、测试和验证,你可以这样做:
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85
df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
This will put 70% of data in training, 15% in test, and 15% in validation.
这将使70%的数据用于培训,15%用于测试,15%用于验证。
#1
183
I would just use numpy's randn
:
我会用numpy的randn
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
我想说的是
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
#2
327
scikit learn's train_test_split
is a good one.
scikit学习的train_test_split是一个很好的例子。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
#3
146
Pandas random sample will also work
熊猫的随机样本也可以
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
#4
23
I would use scikit-learn's own training_test_split, and generate it from the index
我将使用scikit-learn自己的training_test_split,并从索引中生成它
from sklearn.cross_validation import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
#5
6
There are many valid answers. Adding one more to the bunch. from sklearn.cross_validation import train_test_split
有许多有效的答案。再加一个。从sklearn。cross_validation进口train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
#6
5
You can use below code to create test and train samples :
您可以使用下面的代码创建测试和训练示例:
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
Test size can vary depending on the percentage of data you want to put in your test and train dataset.
测试大小取决于您想要放入测试和训练数据集的数据的百分比。
#7
4
You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.
你也可以将分层划分考虑为训练和测试集。星化划分也随机地产生训练和测试集,但以这样的方式保留了原始的类比例。这使得训练和测试集更好地反映原始数据集的属性。
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.
df[train_inds]和df[test_inds]为您提供原始DataFrame df的训练和测试集。
#8
2
This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).
这是我在需要分割数据aframe时所写的。我考虑过使用安迪的方法,但是我不喜欢我不能准确地控制数据集的大小(例如。,有时是79,有时是81,等等。
def make_sets(data_df, test_portion):
import random as rnd
tot_ix = range(len(data_df))
test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
train_ix = list(set(tot_ix) ^ set(test_ix))
test_df = data_df.ix[test_ix]
train_df = data_df.ix[train_ix]
return train_df, test_df
train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
#9
1
Just select range row from df like this
从df中选择range row就像这样
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
#10
0
If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:
如果您希望有一个dataframe和两个dataframe(不是numpy数组),那么应该这样做:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data
#11
0
I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.
我认为如果你以后想要添加列的话,你还需要一个拷贝而不是一个dataframe切片。
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
#12
0
You can make use of df.as_matrix() function and create Numpy-array and pass it.
您可以使用df.as_matrix()函数并创建Numpy-array并传递它。
Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
#13
0
How about this? df is my dataframe
这个怎么样?df是我dataframe
total_size=len(df)
train_size=math.floor(0.66*total_size) (2/3 part of my dataset)
#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
#14
0
If you need to split your data with respect to the lables column in your data set you can use this:
如果您需要将数据与数据集中的lables列分开,您可以使用以下方法:
def split_to_train_test(df, label_column, train_frac=0.8):
train_df, test_df = pd.DataFrame(), pd.DataFrame()
labels = df[label_column].unique()
for lbl in labels:
lbl_df = df[df[label_column] == lbl]
lbl_train_df = lbl_df.sample(frac=train_frac)
lbl_test_df = lbl_df.drop(lbl_train_df.index)
print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
train_df = train_df.append(lbl_train_df)
test_df = test_df.append(lbl_test_df)
return train_df, test_df
and use it:
并使用它:
train, test = split_to_train_test(data, 'class', 0.7)
you can also pass random_state if you want to control the split randomness or use some global random seed.
如果你想控制分裂的随机性或者使用一些全局随机种子,你也可以通过random_state。
#15
0
To split into more than two classes such as train, test, and validation, one can do:
要分裂成两个以上的类,如火车、测试和验证,你可以这样做:
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85
df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
This will put 70% of data in training, 15% in test, and 15% in validation.
这将使70%的数据用于培训,15%用于测试,15%用于验证。