I am working with the sklearn digits dataset.
我正在使用sklearn数字数据集。
Each datapoint is a 8x8 image of a digit.
每个数据点是一个数字的8x8图像。
[[0,1,2,3, .... 62,63], # This row is one image
[0,1,2,3, .... 62,63], # 0-8 make up the first row of the image
... 1794 more times
[0,1,2,3, .... 62,63]]
I set up my dataframe as follows:
我设置我的dataframe如下:
from sklearn import datasets
digits = datasets.load_digits()
df = pd.DataFrame(data = digits.data)
df['target'] = digits.target
I am trying to iterate over each image and calculate averages over subsets of rows and columns.
我尝试遍历每个图像并计算行和列的子集的平均值。
To iterate over each image I just do the following: df[[i for i in range(64)]]
要遍历每个图像,我只需执行以下操作:df[I for I in range(64)]
Or if I want a random subset of 8 pixels I do the following df[[random.sample(range(0, 64), 8)]]
或者如果我想要一个8像素的随机子集,我做如下的df[random]。样本(范围(0,64),8)]]
Those I can wrap my head around. I am struggling with trying to iterate over subsets of each image. How would I iterate over every row of each image individually?
我可以把我的头绕过去。我正在努力尝试迭代每个图像的子集。如何对每个图像的每一行进行迭代?
I can select the first row of the first image like this: df.iloc[:1,0:8]
我可以像这样选择第一个图像的第一行:df.iloc[:1,0:8]
While this will select the first column of the first image: df.iloc[:8,:1]
这将选择第一个图像的第一列:df.iloc[:8,:1]
Ideally, I would like to output this structure:
理想情况下,我想输出这个结构:
[[image_1_col_1_avg..... col8_avg, row1_avg ..... row8_avg],
[image_2_col_1_avg..... col8_avg, row1_avg ..... row8_avg],
....
[image_1797_col_1_avg..... col8_avg, row1_avg ..... row8_avg]]
Where I shrink the 8*8 grid from 0-63 into the averages for each row and column. So instead of having 64 data points for each image, I would only have 16.
我把8*8的网格从0-63缩小到每一行和每一列的平均值。所以不是每个图像都有64个数据点,而是只有16个。
I have searched for a while but I can't find much documentation or guide on how to iterate through subsets of a dataframe. Of what I have found I can't really understand it. Any insight, guidance, or explanation of how to iterate over subsets of a dataframe will be much appreciated.
我搜索了一段时间,但是我找不到关于如何迭代dataframe子集的文档或指南。我发现我不能真正理解它。对于如何遍历dataframe的子集的任何见解、指导或解释都将非常感谢。
3 个解决方案
#1
1
1st APPROACH
1号的方法
My approach use numpy array and functions :
我的方法使用numpy数组和函数:
reshaping the data to a 3D array
将数据修改为3D数组。
data = digits.data.reshape(1797, 8, 8)
applying this function to each matrix in the 3D array and return the column average and row average
将此函数应用于3D数组中的每个矩阵,并返回列平均值和行平均值
def a_function(x):
row_average = np.apply_along_axis(np.average, 1, x)
columns_average = np.apply_along_axis(np.average, 0, x)
return np.append(columns_average, row_average)
Using that function to the array 3D array (There can be a fatest way to do it using only numpy )
将该函数应用到数组3D数组中(可能会有一种使用numpy的最重要的方法)
maped = map(a_function, [data[i] for i in range(np.shape(data)[0])])
and create the final dataframe :
并创建最终的dataframe:
pd.DataFrame(maped)
2nd APPROACH
2方法
This is better than the first you need only numpy and apply_along axis function your data :
这比第一个您只需要numpy和apply_along轴函数的数据要好:
from sklearn import datasets
digits = datasets.load_digits()
data = digits.data
def a_function(x):
x = x.reshape(8, 8)
row_average = np.apply_along_axis(np.average, 1, x)
columns_average = np.apply_along_axis(np.average, 0, x)
return np.append(columns_average, row_average)
the above function will be applied to each row of your dataset like this :
上述功能将应用于您的数据集的每一行,如下所示:
final_data = np.apply_along_axis(a_function, 1, data)
final_data is a 1797 X 16 array you can use it in any classifier : this is what you need, it's not necessary to use a dataframe . The array looks like this :
final_data是一个1797 X 16数组,可以在任何分类器中使用:这是您需要的,不需要使用dataframe。数组如下所示:
array([[ 0. , 2.25 , 10.5 , ..., 4.375, 5.375, 3.625],
[ 0. , 0.875, 2.625, ..., 4.875, 4.875, 4.625],
[ 0. , 1.625, 6.125, ..., 5.75 , 8. , 4.875],
...,
[ 0. , 0. , 10. , ..., 7.625, 7.625, 3.75 ],
[ 0. , 1.125, 7.75 , ..., 2.25 , 4.5 , 5.625],
[ 0. , 1.875, 12.25 , ..., 6.5 , 8.25 , 6. ]])
PS : Using numpy functions for average is better than build-in python function because numpy used C for optimizations and you can go faster when you use numpy functions with numpy array instead of mixing python build-in functions with numpy array. For more check this
PS:一般来说,使用numpy函数比内置的python函数要好,因为numpy使用了C来优化,当您使用numpy数组而不是将python内置函数与numpy数组混合使用时,可以加快速度。更多的检查这个
#2
2
You can use numpy
- reshape to 3d array
and get means per axis 1 and 2, last join both arrays together by numpy.hstack
and call DataFrame
constructor:
您可以使用numpy -到3d数组,并获得每条轴1和2的平均值,最后通过numpy将两个数组连接在一起。hstack并调用DataFrame构造函数:
from sklearn import datasets
digits = datasets.load_digits()
df = pd.DataFrame(data = digits.data)
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
a = df.values
b = a.reshape((a.shape[0], -1, 8))
c = np.hstack((b.mean(axis=1), b.mean(axis=2)))
df = pd.DataFrame(c, columns = col_ind + row_ind)
print (df.head())
col_av_1 col_av_2 col_av_3 col_av_4 col_av_5 col_av_6 col_av_7 \
0 0.0 2.250 10.500 6.000 5.000 8.500 4.500
1 0.0 0.875 2.625 14.125 15.625 5.875 0.000
2 0.0 1.625 6.125 10.875 12.500 10.125 1.750
3 0.0 1.250 4.750 8.375 10.375 6.375 2.250
4 0.0 1.125 4.875 8.375 8.625 7.125 2.125
col_av_8 row_av_1 row_av_2 row_av_3 row_av_4 row_av_5 row_av_6 \
0 0.0 3.500 7.250 4.875 4.000 3.750 4.375
1 0.0 3.750 4.500 5.000 7.000 4.500 4.875
2 0.0 3.875 6.000 5.625 4.125 4.750 5.750
3 0.0 4.500 5.750 3.625 3.625 3.250 2.375
4 0.0 1.500 1.875 3.000 4.875 6.625 8.125
row_av_7 row_av_8
0 5.375 3.625
1 4.875 4.625
2 8.000 4.875
3 5.000 5.250
4 3.500 2.750
#3
1
In pandas you very rarely need to use loops. you can always simplify the problem to a function getting applied to all the rows, i.e. each image, the following line does just that, iterate through the rows of data-frame df and applies the function func
to the reshaped image
在熊猫中,你很少需要使用循环。您总是可以将问题简化为应用于所有行的函数,即每个图像,下面的行就是这样做的,遍历数据帧df的行,并将函数func应用于重构的图像。
#select the image part of df and apply function
df_res = df[range(64)].apply(func,axis=1)
now the problem becomes smaller, given a 1D image return the required averages
现在问题变得更小了,给定一个1D图像返回所需的平均值。
def func(img):
# the input img is a series with length 64
# convert to numpy array and reshape the image
img = img.values.reshape(8, 8)
# create the list of col_avg, row_avg to use in the result
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
res = pd.Series(index=col_ind + row_ind)
# calculate the col average and assign it to the col_index in res
res[col_ind] = img.mean(axis=0)
# calculate the row average and assign it to the row_index in res
res[row_ind] = img.mean(axis=1)
return res
Running the line above after defining function produce the desired result. a sample of the output is shown below
在定义函数之后运行上面的行会产生预期的结果。输出的示例如下所示
In [44]: df_r = df[range(64)].apply(func,axis=1)
In [45]: df_r.head()
Out[45]:
col_av_1 col_av_2 col_av_3 col_av_4 col_av_5 col_av_6 col_av_7 \
0 0.0 2.250 10.500 6.000 5.000 8.500 4.500
1 0.0 0.875 2.625 14.125 15.625 5.875 0.000
2 0.0 1.625 6.125 10.875 12.500 10.125 1.750
3 0.0 1.250 4.750 8.375 10.375 6.375 2.250
4 0.0 1.125 4.875 8.375 8.625 7.125 2.125
col_av_8 row_av_1 row_av_2 row_av_3 row_av_4 row_av_5 row_av_6 \
0 0.0 3.500 7.250 4.875 4.000 3.750 4.375
1 0.0 3.750 4.500 5.000 7.000 4.500 4.875
2 0.0 3.875 6.000 5.625 4.125 4.750 5.750
3 0.0 4.500 5.750 3.625 3.625 3.250 2.375
4 0.0 1.500 1.875 3.000 4.875 6.625 8.125
row_av_7 row_av_8
0 5.375 3.625
1 4.875 4.625
2 8.000 4.875
3 5.000 5.250
4 3.500 2.750
Edit: Alternatively use pandas groupby with modulus 8 to group the columns of the image and integer division by 8 to group the rows
编辑:也可以使用带模数8的熊猫分组,将图像的列分组,用8的整数除法对行进行分组
# create an emply dataframe
df_re = pd.DataFrame()
# create col and row index names
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
df_re[col_ind] = df[range(64)].groupby(lambda x: x % 8, axis=1).mean()
df_re[row_ind] = df[range(64)].groupby(lambda x: x // 8, axis=1).mean()
#1
1
1st APPROACH
1号的方法
My approach use numpy array and functions :
我的方法使用numpy数组和函数:
reshaping the data to a 3D array
将数据修改为3D数组。
data = digits.data.reshape(1797, 8, 8)
applying this function to each matrix in the 3D array and return the column average and row average
将此函数应用于3D数组中的每个矩阵,并返回列平均值和行平均值
def a_function(x):
row_average = np.apply_along_axis(np.average, 1, x)
columns_average = np.apply_along_axis(np.average, 0, x)
return np.append(columns_average, row_average)
Using that function to the array 3D array (There can be a fatest way to do it using only numpy )
将该函数应用到数组3D数组中(可能会有一种使用numpy的最重要的方法)
maped = map(a_function, [data[i] for i in range(np.shape(data)[0])])
and create the final dataframe :
并创建最终的dataframe:
pd.DataFrame(maped)
2nd APPROACH
2方法
This is better than the first you need only numpy and apply_along axis function your data :
这比第一个您只需要numpy和apply_along轴函数的数据要好:
from sklearn import datasets
digits = datasets.load_digits()
data = digits.data
def a_function(x):
x = x.reshape(8, 8)
row_average = np.apply_along_axis(np.average, 1, x)
columns_average = np.apply_along_axis(np.average, 0, x)
return np.append(columns_average, row_average)
the above function will be applied to each row of your dataset like this :
上述功能将应用于您的数据集的每一行,如下所示:
final_data = np.apply_along_axis(a_function, 1, data)
final_data is a 1797 X 16 array you can use it in any classifier : this is what you need, it's not necessary to use a dataframe . The array looks like this :
final_data是一个1797 X 16数组,可以在任何分类器中使用:这是您需要的,不需要使用dataframe。数组如下所示:
array([[ 0. , 2.25 , 10.5 , ..., 4.375, 5.375, 3.625],
[ 0. , 0.875, 2.625, ..., 4.875, 4.875, 4.625],
[ 0. , 1.625, 6.125, ..., 5.75 , 8. , 4.875],
...,
[ 0. , 0. , 10. , ..., 7.625, 7.625, 3.75 ],
[ 0. , 1.125, 7.75 , ..., 2.25 , 4.5 , 5.625],
[ 0. , 1.875, 12.25 , ..., 6.5 , 8.25 , 6. ]])
PS : Using numpy functions for average is better than build-in python function because numpy used C for optimizations and you can go faster when you use numpy functions with numpy array instead of mixing python build-in functions with numpy array. For more check this
PS:一般来说,使用numpy函数比内置的python函数要好,因为numpy使用了C来优化,当您使用numpy数组而不是将python内置函数与numpy数组混合使用时,可以加快速度。更多的检查这个
#2
2
You can use numpy
- reshape to 3d array
and get means per axis 1 and 2, last join both arrays together by numpy.hstack
and call DataFrame
constructor:
您可以使用numpy -到3d数组,并获得每条轴1和2的平均值,最后通过numpy将两个数组连接在一起。hstack并调用DataFrame构造函数:
from sklearn import datasets
digits = datasets.load_digits()
df = pd.DataFrame(data = digits.data)
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
a = df.values
b = a.reshape((a.shape[0], -1, 8))
c = np.hstack((b.mean(axis=1), b.mean(axis=2)))
df = pd.DataFrame(c, columns = col_ind + row_ind)
print (df.head())
col_av_1 col_av_2 col_av_3 col_av_4 col_av_5 col_av_6 col_av_7 \
0 0.0 2.250 10.500 6.000 5.000 8.500 4.500
1 0.0 0.875 2.625 14.125 15.625 5.875 0.000
2 0.0 1.625 6.125 10.875 12.500 10.125 1.750
3 0.0 1.250 4.750 8.375 10.375 6.375 2.250
4 0.0 1.125 4.875 8.375 8.625 7.125 2.125
col_av_8 row_av_1 row_av_2 row_av_3 row_av_4 row_av_5 row_av_6 \
0 0.0 3.500 7.250 4.875 4.000 3.750 4.375
1 0.0 3.750 4.500 5.000 7.000 4.500 4.875
2 0.0 3.875 6.000 5.625 4.125 4.750 5.750
3 0.0 4.500 5.750 3.625 3.625 3.250 2.375
4 0.0 1.500 1.875 3.000 4.875 6.625 8.125
row_av_7 row_av_8
0 5.375 3.625
1 4.875 4.625
2 8.000 4.875
3 5.000 5.250
4 3.500 2.750
#3
1
In pandas you very rarely need to use loops. you can always simplify the problem to a function getting applied to all the rows, i.e. each image, the following line does just that, iterate through the rows of data-frame df and applies the function func
to the reshaped image
在熊猫中,你很少需要使用循环。您总是可以将问题简化为应用于所有行的函数,即每个图像,下面的行就是这样做的,遍历数据帧df的行,并将函数func应用于重构的图像。
#select the image part of df and apply function
df_res = df[range(64)].apply(func,axis=1)
now the problem becomes smaller, given a 1D image return the required averages
现在问题变得更小了,给定一个1D图像返回所需的平均值。
def func(img):
# the input img is a series with length 64
# convert to numpy array and reshape the image
img = img.values.reshape(8, 8)
# create the list of col_avg, row_avg to use in the result
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
res = pd.Series(index=col_ind + row_ind)
# calculate the col average and assign it to the col_index in res
res[col_ind] = img.mean(axis=0)
# calculate the row average and assign it to the row_index in res
res[row_ind] = img.mean(axis=1)
return res
Running the line above after defining function produce the desired result. a sample of the output is shown below
在定义函数之后运行上面的行会产生预期的结果。输出的示例如下所示
In [44]: df_r = df[range(64)].apply(func,axis=1)
In [45]: df_r.head()
Out[45]:
col_av_1 col_av_2 col_av_3 col_av_4 col_av_5 col_av_6 col_av_7 \
0 0.0 2.250 10.500 6.000 5.000 8.500 4.500
1 0.0 0.875 2.625 14.125 15.625 5.875 0.000
2 0.0 1.625 6.125 10.875 12.500 10.125 1.750
3 0.0 1.250 4.750 8.375 10.375 6.375 2.250
4 0.0 1.125 4.875 8.375 8.625 7.125 2.125
col_av_8 row_av_1 row_av_2 row_av_3 row_av_4 row_av_5 row_av_6 \
0 0.0 3.500 7.250 4.875 4.000 3.750 4.375
1 0.0 3.750 4.500 5.000 7.000 4.500 4.875
2 0.0 3.875 6.000 5.625 4.125 4.750 5.750
3 0.0 4.500 5.750 3.625 3.625 3.250 2.375
4 0.0 1.500 1.875 3.000 4.875 6.625 8.125
row_av_7 row_av_8
0 5.375 3.625
1 4.875 4.625
2 8.000 4.875
3 5.000 5.250
4 3.500 2.750
Edit: Alternatively use pandas groupby with modulus 8 to group the columns of the image and integer division by 8 to group the rows
编辑:也可以使用带模数8的熊猫分组,将图像的列分组,用8的整数除法对行进行分组
# create an emply dataframe
df_re = pd.DataFrame()
# create col and row index names
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
df_re[col_ind] = df[range(64)].groupby(lambda x: x % 8, axis=1).mean()
df_re[row_ind] = df[range(64)].groupby(lambda x: x // 8, axis=1).mean()