I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.
我生成了一些具有相同形状的dataframes,我想将它们彼此进行比较。我希望能得到数据的均值和中位数。
Source.0 Source.1 Source.2 Source.3
cluster
0 0.001182 0.184535 0.814230 0.000054
1 0.000001 0.160490 0.839508 0.000001
2 0.000001 0.173829 0.826114 0.000055
3 0.000432 0.180065 0.819502 0.000001
4 0.000152 0.157041 0.842694 0.000113
5 0.000183 0.174142 0.825674 0.000001
6 0.000001 0.151556 0.848405 0.000038
7 0.000771 0.177583 0.821645 0.000001
8 0.000001 0.202059 0.797939 0.000001
9 0.000025 0.189537 0.810410 0.000028
10 0.006142 0.003041 0.493912 0.496905
11 0.003739 0.002367 0.514216 0.479678
12 0.002334 0.001517 0.529041 0.467108
13 0.003458 0.000001 0.532265 0.464276
14 0.000405 0.005655 0.527576 0.466364
15 0.002557 0.003233 0.507954 0.486256
16 0.004161 0.000001 0.491271 0.504568
17 0.001364 0.001330 0.528311 0.468996
18 0.002886 0.000001 0.506392 0.490721
19 0.001823 0.002498 0.509620 0.486059
Source.0 Source.1 Source.2 Source.3
cluster
0 0.000001 0.197108 0.802495 0.000396
1 0.000001 0.157860 0.842076 0.000063
2 0.094956 0.203057 0.701662 0.000325
3 0.000001 0.181948 0.817841 0.000210
4 0.000003 0.169680 0.830316 0.000001
5 0.000362 0.177194 0.822443 0.000001
6 0.000001 0.146807 0.852924 0.000268
7 0.001087 0.178994 0.819564 0.000354
8 0.000001 0.202182 0.797333 0.000485
9 0.000348 0.181399 0.818252 0.000001
10 0.003050 0.000247 0.506777 0.489926
11 0.004420 0.000001 0.513927 0.481652
12 0.006488 0.001396 0.527197 0.464919
13 0.001510 0.000001 0.525987 0.472502
14 0.000001 0.000001 0.520737 0.479261
15 0.000001 0.001765 0.515658 0.482575
16 0.000001 0.000001 0.492550 0.507448
17 0.002855 0.000199 0.526535 0.470411
18 0.000001 0.001952 0.498303 0.499744
19 0.001232 0.000001 0.506612 0.492155
Then I want to get the mean of these two dataframes.
然后我要得到这两个数据的均值。
What is the easiest way to do this?
最简单的方法是什么?
Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.
澄清一下,当所有的dataframes的索引和列完全相同时,我想要得到每个特定单元格的平均值。
So in the example I gave, the average for [0,Source.0]
would be (0.001182 + 0.000001) / 2 = 0.0005915.
在我给出的例子中,(0,源)的平均值。0]为(0.001182 + 0.000001)/ 2 = 0.0005915。
5 个解决方案
#1
21
Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:
假设这两个dataframes具有相同的列,那么您只需将它们连接起来,然后在连接的框架上计算您的汇总统计信息:
import numpy as np
import pandas as pd
# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
# concatenate them
df_concat = pd.concat((df1, df2))
print df_concat.mean()
# x -0.163044
# y 2.120000
# dtype: float64
print df_concat.median()
# x -0.192037
# y 2.000000
# dtype: float64
Update
If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby()
to group the data by row index, then apply the mean, median etc.:
如果您想在两个数据集中使用相同索引的每组行中计算统计数据,可以使用.groupby()按行索引对数据进行分组,然后应用平均值、中位数等:
by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()
print df_means.head()
# x y
# 0 -0.850794 1.5
# 1 0.159038 1.5
# 2 0.083278 1.0
# 3 -0.540336 0.5
# 4 0.390954 3.5
This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.
即使您的dataframes有不相等的行数,这个方法也可以工作——如果两个dataframes中的一个丢失了特定的行索引,那么平均值/中位数将在现有的单个行上计算。
#2
11
I go similar as @ali_m, but since you want one mean per row-column combination, I conclude differently:
我类似于@ali_m,但由于您希望每行-列组合有一个平均值,所以我的结论不同:
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df = pd.concat([df1, df2])
foo = df.groupby(level=1).mean()
foo.head()
x y
0 0.841282 2.5
1 0.716749 1.0
2 -0.551903 2.5
3 1.240736 1.5
4 1.227109 2.0
#3
6
You can simply assign a label to each frame, call it group
and then concat
and groupby
to do what you want:
你可以简单地给每个帧分配一个标签,叫它组,然后concat和groupby做你想做的:
In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [58]: df2 = df.copy()
In [59]: dfs = [df, df2]
In [60]: df
Out[60]:
a b c d
0 0.1959 0.1260 0.1464 0.1631
1 0.9344 -1.8154 1.4529 -0.6334
2 0.0390 0.4810 1.1779 -1.1799
3 0.3542 0.3819 -2.0895 0.8877
4 -2.2898 -1.0585 0.8083 -0.2126
5 0.3727 -0.6867 -1.3440 -1.4849
6 -1.1785 0.0885 1.0945 -1.6271
7 -1.7169 0.3760 -1.4078 0.8994
8 0.0508 0.4891 0.0274 -0.6369
9 -0.7019 1.0425 -0.5476 -0.5143
In [61]: for i, d in enumerate(dfs):
....: d['group'] = i
....:
In [62]: dfs[0]
Out[62]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
7 -1.7169 0.3760 -1.4078 0.8994 0
8 0.0508 0.4891 0.0274 -0.6369 0
9 -0.7019 1.0425 -0.5476 -0.5143 0
In [63]: final = pd.concat(dfs, ignore_index=True)
In [64]: final
Out[64]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
.. ... ... ... ... ...
13 0.3542 0.3819 -2.0895 0.8877 1
14 -2.2898 -1.0585 0.8083 -0.2126 1
15 0.3727 -0.6867 -1.3440 -1.4849 1
16 -1.1785 0.0885 1.0945 -1.6271 1
17 -1.7169 0.3760 -1.4078 0.8994 1
18 0.0508 0.4891 0.0274 -0.6369 1
19 -0.7019 1.0425 -0.5476 -0.5143 1
[20 rows x 5 columns]
In [65]: final.groupby('group').mean()
Out[65]:
a b c d
group
0 -0.394 -0.0576 -0.0682 -0.4339
1 -0.394 -0.0576 -0.0682 -0.4339
Here, each group
is the same, but that's only because df == df2
.
这里,每个组都是相同的,但这只是因为df = df2。
Alternatively, you can throw the frames into a Panel
:
或者,你也可以把框架扔进一个面板:
In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [71]: panel = pd.Panel({0: df, 1: df2})
In [72]: panel
Out[72]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 9
Minor_axis axis: a to d
In [73]: panel.mean()
Out[73]:
0 1
a 0.3839 0.2956
b 0.1855 -0.3164
c -0.1167 -0.0627
d -0.2338 -0.0450
#4
4
As per Niklas' comment, the solution to the question is panel.mean(axis=0)
.
根据Niklas的评论,问题的解决方案是panel.mean(axis=0)。
As a more complete example:
作为一个更完整的例子:
import pandas as pd
import numpy as np
dfs = {}
nrows = 4
ncols = 3
for i in range(4):
dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
columns=list('abc'))
print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))
panel = pd.Panel(dfs)
print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))
Will give the following output:
输出如下:
DF0:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
DF1:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
DF2:
a b c
0 2 3 4
1 5 6 7
2 8 9 10
3 11 12 13
DF3:
a b c
0 3 4 5
1 6 7 8
2 9 10 11
3 12 13 14
Mean of stacked DFs:
a b c
0 1.5 2.5 3.5
1 4.5 5.5 6.5
2 7.5 8.5 9.5
3 10.5 11.5 12.5
#5
2
Here is a solution first unstack both dataframes so they are series with multiindexes(cluster, colnames)... then you can use Series addition and division, which automattically do the operation on the indexes, finally unstack them... here it is in code...
这里有一个解决方案,首先将两个dataframes解叠,这样它们就具有多索引(集群、colname)……然后您可以使用级数的加法和除法,它自动地对索引执行操作,最终将它们解叠……这是密码……
averages = (df1.stack()+df2.stack())/2
averages = averages.unstack()
And your done...
和你做……
Or for more general purposes...
或者为了更一般的目的……
dfs = [df1,df2]
averages = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()
#1
21
Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:
假设这两个dataframes具有相同的列,那么您只需将它们连接起来,然后在连接的框架上计算您的汇总统计信息:
import numpy as np
import pandas as pd
# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
# concatenate them
df_concat = pd.concat((df1, df2))
print df_concat.mean()
# x -0.163044
# y 2.120000
# dtype: float64
print df_concat.median()
# x -0.192037
# y 2.000000
# dtype: float64
Update
If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby()
to group the data by row index, then apply the mean, median etc.:
如果您想在两个数据集中使用相同索引的每组行中计算统计数据,可以使用.groupby()按行索引对数据进行分组,然后应用平均值、中位数等:
by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()
print df_means.head()
# x y
# 0 -0.850794 1.5
# 1 0.159038 1.5
# 2 0.083278 1.0
# 3 -0.540336 0.5
# 4 0.390954 3.5
This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.
即使您的dataframes有不相等的行数,这个方法也可以工作——如果两个dataframes中的一个丢失了特定的行索引,那么平均值/中位数将在现有的单个行上计算。
#2
11
I go similar as @ali_m, but since you want one mean per row-column combination, I conclude differently:
我类似于@ali_m,但由于您希望每行-列组合有一个平均值,所以我的结论不同:
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df = pd.concat([df1, df2])
foo = df.groupby(level=1).mean()
foo.head()
x y
0 0.841282 2.5
1 0.716749 1.0
2 -0.551903 2.5
3 1.240736 1.5
4 1.227109 2.0
#3
6
You can simply assign a label to each frame, call it group
and then concat
and groupby
to do what you want:
你可以简单地给每个帧分配一个标签,叫它组,然后concat和groupby做你想做的:
In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [58]: df2 = df.copy()
In [59]: dfs = [df, df2]
In [60]: df
Out[60]:
a b c d
0 0.1959 0.1260 0.1464 0.1631
1 0.9344 -1.8154 1.4529 -0.6334
2 0.0390 0.4810 1.1779 -1.1799
3 0.3542 0.3819 -2.0895 0.8877
4 -2.2898 -1.0585 0.8083 -0.2126
5 0.3727 -0.6867 -1.3440 -1.4849
6 -1.1785 0.0885 1.0945 -1.6271
7 -1.7169 0.3760 -1.4078 0.8994
8 0.0508 0.4891 0.0274 -0.6369
9 -0.7019 1.0425 -0.5476 -0.5143
In [61]: for i, d in enumerate(dfs):
....: d['group'] = i
....:
In [62]: dfs[0]
Out[62]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
7 -1.7169 0.3760 -1.4078 0.8994 0
8 0.0508 0.4891 0.0274 -0.6369 0
9 -0.7019 1.0425 -0.5476 -0.5143 0
In [63]: final = pd.concat(dfs, ignore_index=True)
In [64]: final
Out[64]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
.. ... ... ... ... ...
13 0.3542 0.3819 -2.0895 0.8877 1
14 -2.2898 -1.0585 0.8083 -0.2126 1
15 0.3727 -0.6867 -1.3440 -1.4849 1
16 -1.1785 0.0885 1.0945 -1.6271 1
17 -1.7169 0.3760 -1.4078 0.8994 1
18 0.0508 0.4891 0.0274 -0.6369 1
19 -0.7019 1.0425 -0.5476 -0.5143 1
[20 rows x 5 columns]
In [65]: final.groupby('group').mean()
Out[65]:
a b c d
group
0 -0.394 -0.0576 -0.0682 -0.4339
1 -0.394 -0.0576 -0.0682 -0.4339
Here, each group
is the same, but that's only because df == df2
.
这里,每个组都是相同的,但这只是因为df = df2。
Alternatively, you can throw the frames into a Panel
:
或者,你也可以把框架扔进一个面板:
In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [71]: panel = pd.Panel({0: df, 1: df2})
In [72]: panel
Out[72]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 9
Minor_axis axis: a to d
In [73]: panel.mean()
Out[73]:
0 1
a 0.3839 0.2956
b 0.1855 -0.3164
c -0.1167 -0.0627
d -0.2338 -0.0450
#4
4
As per Niklas' comment, the solution to the question is panel.mean(axis=0)
.
根据Niklas的评论,问题的解决方案是panel.mean(axis=0)。
As a more complete example:
作为一个更完整的例子:
import pandas as pd
import numpy as np
dfs = {}
nrows = 4
ncols = 3
for i in range(4):
dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
columns=list('abc'))
print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))
panel = pd.Panel(dfs)
print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))
Will give the following output:
输出如下:
DF0:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
DF1:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
DF2:
a b c
0 2 3 4
1 5 6 7
2 8 9 10
3 11 12 13
DF3:
a b c
0 3 4 5
1 6 7 8
2 9 10 11
3 12 13 14
Mean of stacked DFs:
a b c
0 1.5 2.5 3.5
1 4.5 5.5 6.5
2 7.5 8.5 9.5
3 10.5 11.5 12.5
#5
2
Here is a solution first unstack both dataframes so they are series with multiindexes(cluster, colnames)... then you can use Series addition and division, which automattically do the operation on the indexes, finally unstack them... here it is in code...
这里有一个解决方案,首先将两个dataframes解叠,这样它们就具有多索引(集群、colname)……然后您可以使用级数的加法和除法,它自动地对索引执行操作,最终将它们解叠……这是密码……
averages = (df1.stack()+df2.stack())/2
averages = averages.unstack()
And your done...
和你做……
Or for more general purposes...
或者为了更一般的目的……
dfs = [df1,df2]
averages = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()