Python Dataframe子集的平均值

时间:2021-05-19 15:49:12

I am working with the sklearn digits dataset.

我正在使用sklearn数字数据集。

Each datapoint is a 8x8 image of a digit.

每个数据点是一个数字的8x8图像。

[[0,1,2,3, .... 62,63], # This row is one image
 [0,1,2,3, .... 62,63], # 0-8 make up the first row of the image
 ... 1794 more times
[0,1,2,3, .... 62,63]]

I set up my dataframe as follows:

我设置我的dataframe如下:

from sklearn import datasets
digits = datasets.load_digits()
df = pd.DataFrame(data = digits.data)
df['target'] = digits.target

I am trying to iterate over each image and calculate averages over subsets of rows and columns.

我尝试遍历每个图像并计算行和列的子集的平均值。

To iterate over each image I just do the following: df[[i for i in range(64)]]

要遍历每个图像,我只需执行以下操作:df[I for I in range(64)]

Or if I want a random subset of 8 pixels I do the following df[[random.sample(range(0, 64), 8)]]

或者如果我想要一个8像素的随机子集,我做如下的df[random]。样本(范围(0,64),8)]]

Those I can wrap my head around. I am struggling with trying to iterate over subsets of each image. How would I iterate over every row of each image individually?

我可以把我的头绕过去。我正在努力尝试迭代每个图像的子集。如何对每个图像的每一行进行迭代?

I can select the first row of the first image like this: df.iloc[:1,0:8]

我可以像这样选择第一个图像的第一行:df.iloc[:1,0:8]

While this will select the first column of the first image: df.iloc[:8,:1]

这将选择第一个图像的第一列:df.iloc[:8,:1]

Ideally, I would like to output this structure:

理想情况下,我想输出这个结构:

[[image_1_col_1_avg..... col8_avg, row1_avg ..... row8_avg],
 [image_2_col_1_avg..... col8_avg, row1_avg ..... row8_avg],
   ....
 [image_1797_col_1_avg..... col8_avg, row1_avg ..... row8_avg]]

Where I shrink the 8*8 grid from 0-63 into the averages for each row and column. So instead of having 64 data points for each image, I would only have 16.

我把8*8的网格从0-63缩小到每一行和每一列的平均值。所以不是每个图像都有64个数据点,而是只有16个。

I have searched for a while but I can't find much documentation or guide on how to iterate through subsets of a dataframe. Of what I have found I can't really understand it. Any insight, guidance, or explanation of how to iterate over subsets of a dataframe will be much appreciated.

我搜索了一段时间,但是我找不到关于如何迭代dataframe子集的文档或指南。我发现我不能真正理解它。对于如何遍历dataframe的子集的任何见解、指导或解释都将非常感谢。

3 个解决方案

#1


1  

1st APPROACH

1号的方法

My approach use numpy array and functions :

我的方法使用numpy数组和函数:

reshaping the data to a 3D array

将数据修改为3D数组。

data = digits.data.reshape(1797, 8, 8) 

applying this function to each matrix in the 3D array and return the column average and row average

将此函数应用于3D数组中的每个矩阵,并返回列平均值和行平均值

def a_function(x):
    row_average = np.apply_along_axis(np.average, 1, x)
    columns_average = np.apply_along_axis(np.average, 0, x)
    return np.append(columns_average, row_average)

Using that function to the array 3D array (There can be a fatest way to do it using only numpy )

将该函数应用到数组3D数组中(可能会有一种使用numpy的最重要的方法)

maped = map(a_function, [data[i] for i in range(np.shape(data)[0])])

and create the final dataframe :

并创建最终的dataframe:

pd.DataFrame(maped)

2nd APPROACH

2方法

This is better than the first you need only numpy and apply_along axis function your data :

这比第一个您只需要numpy和apply_along轴函数的数据要好:

from sklearn import datasets
digits = datasets.load_digits()
data = digits.data
def a_function(x):
    x = x.reshape(8, 8)
    row_average = np.apply_along_axis(np.average, 1, x)
    columns_average = np.apply_along_axis(np.average, 0, x)
    return np.append(columns_average, row_average)

the above function will be applied to each row of your dataset like this :

上述功能将应用于您的数据集的每一行,如下所示:

final_data = np.apply_along_axis(a_function, 1, data)

final_data is a 1797 X 16 array you can use it in any classifier : this is what you need, it's not necessary to use a dataframe . The array looks like this :

final_data是一个1797 X 16数组,可以在任何分类器中使用:这是您需要的,不需要使用dataframe。数组如下所示:

array([[  0.   ,   2.25 ,  10.5  , ...,   4.375,   5.375,   3.625],
       [  0.   ,   0.875,   2.625, ...,   4.875,   4.875,   4.625],
       [  0.   ,   1.625,   6.125, ...,   5.75 ,   8.   ,   4.875],
       ..., 
       [  0.   ,   0.   ,  10.   , ...,   7.625,   7.625,   3.75 ],
       [  0.   ,   1.125,   7.75 , ...,   2.25 ,   4.5  ,   5.625],
       [  0.   ,   1.875,  12.25 , ...,   6.5  ,   8.25 ,   6.   ]])

PS : Using numpy functions for average is better than build-in python function because numpy used C for optimizations and you can go faster when you use numpy functions with numpy array instead of mixing python build-in functions with numpy array. For more check this

PS:一般来说,使用numpy函数比内置的python函数要好,因为numpy使用了C来优化,当您使用numpy数组而不是将python内置函数与numpy数组混合使用时,可以加快速度。更多的检查这个

#2


2  

You can use numpy - reshape to 3d array and get means per axis 1 and 2, last join both arrays together by numpy.hstack and call DataFrame constructor:

您可以使用numpy -到3d数组,并获得每条轴1和2的平均值,最后通过numpy将两个数组连接在一起。hstack并调用DataFrame构造函数:

from sklearn import datasets
digits = datasets.load_digits()
df = pd.DataFrame(data = digits.data)

col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]

a = df.values
b = a.reshape((a.shape[0], -1, 8))
c = np.hstack((b.mean(axis=1), b.mean(axis=2)))

df = pd.DataFrame(c, columns = col_ind + row_ind)
print (df.head())
   col_av_1  col_av_2  col_av_3  col_av_4  col_av_5  col_av_6  col_av_7  \
0       0.0     2.250    10.500     6.000     5.000     8.500     4.500   
1       0.0     0.875     2.625    14.125    15.625     5.875     0.000   
2       0.0     1.625     6.125    10.875    12.500    10.125     1.750   
3       0.0     1.250     4.750     8.375    10.375     6.375     2.250   
4       0.0     1.125     4.875     8.375     8.625     7.125     2.125   

   col_av_8  row_av_1  row_av_2  row_av_3  row_av_4  row_av_5  row_av_6  \
0       0.0     3.500     7.250     4.875     4.000     3.750     4.375   
1       0.0     3.750     4.500     5.000     7.000     4.500     4.875   
2       0.0     3.875     6.000     5.625     4.125     4.750     5.750   
3       0.0     4.500     5.750     3.625     3.625     3.250     2.375   
4       0.0     1.500     1.875     3.000     4.875     6.625     8.125   

   row_av_7  row_av_8  
0     5.375     3.625  
1     4.875     4.625  
2     8.000     4.875  
3     5.000     5.250  
4     3.500     2.750  

#3


1  

In pandas you very rarely need to use loops. you can always simplify the problem to a function getting applied to all the rows, i.e. each image, the following line does just that, iterate through the rows of data-frame df and applies the function func to the reshaped image

在熊猫中,你很少需要使用循环。您总是可以将问题简化为应用于所有行的函数,即每个图像,下面的行就是这样做的,遍历数据帧df的行,并将函数func应用于重构的图像。

#select the image part of df and apply function    
df_res = df[range(64)].apply(func,axis=1)

now the problem becomes smaller, given a 1D image return the required averages

现在问题变得更小了,给定一个1D图像返回所需的平均值。

def func(img):
    # the input img is a series with length 64
    # convert to numpy array and reshape the image
    img = img.values.reshape(8, 8)
    # create the list of col_avg, row_avg to use in the result
    col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
    row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]

    res = pd.Series(index=col_ind + row_ind)
    # calculate the col average and assign it to the col_index in res
    res[col_ind] = img.mean(axis=0)
    # calculate the row average and assign it to the row_index in res
    res[row_ind] = img.mean(axis=1)
    return res

Running the line above after defining function produce the desired result. a sample of the output is shown below

在定义函数之后运行上面的行会产生预期的结果。输出的示例如下所示

In [44]: df_r = df[range(64)].apply(func,axis=1)

In [45]: df_r.head()
Out[45]: 
   col_av_1  col_av_2  col_av_3  col_av_4  col_av_5  col_av_6  col_av_7  \
0       0.0     2.250    10.500     6.000     5.000     8.500     4.500   
1       0.0     0.875     2.625    14.125    15.625     5.875     0.000   
2       0.0     1.625     6.125    10.875    12.500    10.125     1.750   
3       0.0     1.250     4.750     8.375    10.375     6.375     2.250   
4       0.0     1.125     4.875     8.375     8.625     7.125     2.125   

   col_av_8  row_av_1  row_av_2  row_av_3  row_av_4  row_av_5  row_av_6  \
0       0.0     3.500     7.250     4.875     4.000     3.750     4.375   
1       0.0     3.750     4.500     5.000     7.000     4.500     4.875   
2       0.0     3.875     6.000     5.625     4.125     4.750     5.750   
3       0.0     4.500     5.750     3.625     3.625     3.250     2.375   
4       0.0     1.500     1.875     3.000     4.875     6.625     8.125   

   row_av_7  row_av_8  
0     5.375     3.625  
1     4.875     4.625  
2     8.000     4.875  
3     5.000     5.250  
4     3.500     2.750  

Edit: Alternatively use pandas groupby with modulus 8 to group the columns of the image and integer division by 8 to group the rows

编辑:也可以使用带模数8的熊猫分组,将图像的列分组,用8的整数除法对行进行分组

# create an emply dataframe
df_re = pd.DataFrame()
# create col and row index names
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
df_re[col_ind] = df[range(64)].groupby(lambda x: x % 8, axis=1).mean()
df_re[row_ind] = df[range(64)].groupby(lambda x: x // 8, axis=1).mean()

#1


1  

1st APPROACH

1号的方法

My approach use numpy array and functions :

我的方法使用numpy数组和函数:

reshaping the data to a 3D array

将数据修改为3D数组。

data = digits.data.reshape(1797, 8, 8) 

applying this function to each matrix in the 3D array and return the column average and row average

将此函数应用于3D数组中的每个矩阵,并返回列平均值和行平均值

def a_function(x):
    row_average = np.apply_along_axis(np.average, 1, x)
    columns_average = np.apply_along_axis(np.average, 0, x)
    return np.append(columns_average, row_average)

Using that function to the array 3D array (There can be a fatest way to do it using only numpy )

将该函数应用到数组3D数组中(可能会有一种使用numpy的最重要的方法)

maped = map(a_function, [data[i] for i in range(np.shape(data)[0])])

and create the final dataframe :

并创建最终的dataframe:

pd.DataFrame(maped)

2nd APPROACH

2方法

This is better than the first you need only numpy and apply_along axis function your data :

这比第一个您只需要numpy和apply_along轴函数的数据要好:

from sklearn import datasets
digits = datasets.load_digits()
data = digits.data
def a_function(x):
    x = x.reshape(8, 8)
    row_average = np.apply_along_axis(np.average, 1, x)
    columns_average = np.apply_along_axis(np.average, 0, x)
    return np.append(columns_average, row_average)

the above function will be applied to each row of your dataset like this :

上述功能将应用于您的数据集的每一行,如下所示:

final_data = np.apply_along_axis(a_function, 1, data)

final_data is a 1797 X 16 array you can use it in any classifier : this is what you need, it's not necessary to use a dataframe . The array looks like this :

final_data是一个1797 X 16数组,可以在任何分类器中使用:这是您需要的,不需要使用dataframe。数组如下所示:

array([[  0.   ,   2.25 ,  10.5  , ...,   4.375,   5.375,   3.625],
       [  0.   ,   0.875,   2.625, ...,   4.875,   4.875,   4.625],
       [  0.   ,   1.625,   6.125, ...,   5.75 ,   8.   ,   4.875],
       ..., 
       [  0.   ,   0.   ,  10.   , ...,   7.625,   7.625,   3.75 ],
       [  0.   ,   1.125,   7.75 , ...,   2.25 ,   4.5  ,   5.625],
       [  0.   ,   1.875,  12.25 , ...,   6.5  ,   8.25 ,   6.   ]])

PS : Using numpy functions for average is better than build-in python function because numpy used C for optimizations and you can go faster when you use numpy functions with numpy array instead of mixing python build-in functions with numpy array. For more check this

PS:一般来说,使用numpy函数比内置的python函数要好,因为numpy使用了C来优化,当您使用numpy数组而不是将python内置函数与numpy数组混合使用时,可以加快速度。更多的检查这个

#2


2  

You can use numpy - reshape to 3d array and get means per axis 1 and 2, last join both arrays together by numpy.hstack and call DataFrame constructor:

您可以使用numpy -到3d数组,并获得每条轴1和2的平均值,最后通过numpy将两个数组连接在一起。hstack并调用DataFrame构造函数:

from sklearn import datasets
digits = datasets.load_digits()
df = pd.DataFrame(data = digits.data)

col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]

a = df.values
b = a.reshape((a.shape[0], -1, 8))
c = np.hstack((b.mean(axis=1), b.mean(axis=2)))

df = pd.DataFrame(c, columns = col_ind + row_ind)
print (df.head())
   col_av_1  col_av_2  col_av_3  col_av_4  col_av_5  col_av_6  col_av_7  \
0       0.0     2.250    10.500     6.000     5.000     8.500     4.500   
1       0.0     0.875     2.625    14.125    15.625     5.875     0.000   
2       0.0     1.625     6.125    10.875    12.500    10.125     1.750   
3       0.0     1.250     4.750     8.375    10.375     6.375     2.250   
4       0.0     1.125     4.875     8.375     8.625     7.125     2.125   

   col_av_8  row_av_1  row_av_2  row_av_3  row_av_4  row_av_5  row_av_6  \
0       0.0     3.500     7.250     4.875     4.000     3.750     4.375   
1       0.0     3.750     4.500     5.000     7.000     4.500     4.875   
2       0.0     3.875     6.000     5.625     4.125     4.750     5.750   
3       0.0     4.500     5.750     3.625     3.625     3.250     2.375   
4       0.0     1.500     1.875     3.000     4.875     6.625     8.125   

   row_av_7  row_av_8  
0     5.375     3.625  
1     4.875     4.625  
2     8.000     4.875  
3     5.000     5.250  
4     3.500     2.750  

#3


1  

In pandas you very rarely need to use loops. you can always simplify the problem to a function getting applied to all the rows, i.e. each image, the following line does just that, iterate through the rows of data-frame df and applies the function func to the reshaped image

在熊猫中,你很少需要使用循环。您总是可以将问题简化为应用于所有行的函数,即每个图像,下面的行就是这样做的,遍历数据帧df的行,并将函数func应用于重构的图像。

#select the image part of df and apply function    
df_res = df[range(64)].apply(func,axis=1)

now the problem becomes smaller, given a 1D image return the required averages

现在问题变得更小了,给定一个1D图像返回所需的平均值。

def func(img):
    # the input img is a series with length 64
    # convert to numpy array and reshape the image
    img = img.values.reshape(8, 8)
    # create the list of col_avg, row_avg to use in the result
    col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
    row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]

    res = pd.Series(index=col_ind + row_ind)
    # calculate the col average and assign it to the col_index in res
    res[col_ind] = img.mean(axis=0)
    # calculate the row average and assign it to the row_index in res
    res[row_ind] = img.mean(axis=1)
    return res

Running the line above after defining function produce the desired result. a sample of the output is shown below

在定义函数之后运行上面的行会产生预期的结果。输出的示例如下所示

In [44]: df_r = df[range(64)].apply(func,axis=1)

In [45]: df_r.head()
Out[45]: 
   col_av_1  col_av_2  col_av_3  col_av_4  col_av_5  col_av_6  col_av_7  \
0       0.0     2.250    10.500     6.000     5.000     8.500     4.500   
1       0.0     0.875     2.625    14.125    15.625     5.875     0.000   
2       0.0     1.625     6.125    10.875    12.500    10.125     1.750   
3       0.0     1.250     4.750     8.375    10.375     6.375     2.250   
4       0.0     1.125     4.875     8.375     8.625     7.125     2.125   

   col_av_8  row_av_1  row_av_2  row_av_3  row_av_4  row_av_5  row_av_6  \
0       0.0     3.500     7.250     4.875     4.000     3.750     4.375   
1       0.0     3.750     4.500     5.000     7.000     4.500     4.875   
2       0.0     3.875     6.000     5.625     4.125     4.750     5.750   
3       0.0     4.500     5.750     3.625     3.625     3.250     2.375   
4       0.0     1.500     1.875     3.000     4.875     6.625     8.125   

   row_av_7  row_av_8  
0     5.375     3.625  
1     4.875     4.625  
2     8.000     4.875  
3     5.000     5.250  
4     3.500     2.750  

Edit: Alternatively use pandas groupby with modulus 8 to group the columns of the image and integer division by 8 to group the rows

编辑:也可以使用带模数8的熊猫分组,将图像的列分组,用8的整数除法对行进行分组

# create an emply dataframe
df_re = pd.DataFrame()
# create col and row index names
col_ind = ['col_av_{}'.format(i) for i in range(1, 9)]
row_ind = ['row_av_{}'.format(i) for i in range(1, 9)]
df_re[col_ind] = df[range(64)].groupby(lambda x: x % 8, axis=1).mean()
df_re[row_ind] = df[range(64)].groupby(lambda x: x // 8, axis=1).mean()