My df:
我的df:
{'city1': {0: 'Chicago',
1: 'Chicago',
2: 'Chicago',
3: 'Chicago',
4: 'Miami',
5: 'Houston',
6: 'Austin'},
'city2': {0: 'Toronto',
1: 'Detroit',
2: 'St.Louis',
3: 'Miami',
4: 'Dallas',
5: 'Dallas',
6: 'Dallas'},
'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
'plant1_type': {0: 'COMBCYCL',
1: 'COMBCYCL',
2: 'NUKE',
3: 'COAL',
4: 'NUKE',
5: 'COMBCYCL',
6: 'COAL'},
'plant2_type': {0: 'COAL',
1: 'COAL',
2: 'COMBCYCL',
3: 'COMBCYCL',
4: 'COAL',
5: 'NUKE',
6: 'NUKE'}}
I want to do 2 groupby operations and take the largest 1 of each group using column p234_r_c
.
我想要做两组操作,使用列p234_r_c来取每个组中最大的一个。
1st groupby = ['plant1_type', 'plant2_type', 'city1']
first groupby = ['plant1_type', 'plant2_type', 'city1']
2nd groupby = ['plant1_type', 'plant2_type', 'city2']
第二个groupby = ['plant1_type', 'plant2_type', 'city2']
As such I do the following:
我这样做如下:
df.groupby(['plant1_type','plant2_type','city1'])['p234_r_c'].\
nlargest(1).reset_index()
plant1_type plant2_type city1 level_3 p234_r_c
0 COAL COMBCYCL Chicago 3 0.5
1 COAL NUKE Austin 6 3.0
2 COMBCYCL COAL Chicago 0 5.0
3 COMBCYCL NUKE Houston 5 4.0
4 NUKE COAL Miami 4 1.0
5 NUKE COMBCYCL Chicago 2 2.0
The result of the 1st groupby makes sense. However, I am confused by the result of the 2nd groupby:
第一组的结果是有意义的。然而,我对第二组的结果感到困惑:
df.groupby(['plant1_type','plant2_type','city2'])['p234_r_c'].\
nlargest(1).reset_index()
index p234_r_c
0 0 5.0
1 1 4.0
2 2 2.0
3 3 0.5
4 4 1.0
5 5 4.0
6 6 3.0
What happened to columns plant1_type
, plant2_type
and city2
in the result? Shouldnt they appear in the result just like how plant1_type
, plant2_type
and city1
appeared in the result of the 1st groupby?
结果中的列plant1_type、plant2_type和city2发生了什么?它们是否应该像plant1_type、plant2_type和city1那样出现在第一个groupby的结果中?
1 个解决方案
#1
4
我在这里添加了一个问题。
Theory:
理论:
When the results of a
groupby
on apd.Series
returns the samepd.Series
values, then the original index is returned.当一组的结果在pd上。序列返回相同的pd。序列值,然后返回原始索引。
Boiled down example
归结的例子
df = pd.DataFrame(dict(A=[0, 1, 2, 3]))
# returns results identical to df.A
print(df.groupby(df.A // 2).A.nsmallest(2))
# returns results out of order
print(df.groupby(df.A // 2).A.nlargest(2))
0 0
1 1
2 2
3 3
Name: A, dtype: int64
A
0 1 1
0 0
1 3 3
2 2
Name: A, dtype: int64
I'd argue that you want these to return the same consistent index.
我认为你希望它们返回相同的一致的索引。
This is the most egregious consequence of this:
这是最恶劣的后果:
# most egregious
# this will be randomly different
print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=2))
returns this on one execution
在一个执行中返回这个。
A
0 1 1
0 0
1 2 2
3 3
Name: A, dtype: int64
And this on another
这在另一个
0 0
1 1
2 2
3 3
Name: A, dtype: int64
Of course this never has an issue because it's impossible to return the same values as the original
当然,这永远不会有问题,因为它不可能返回与原始值相同的值。
print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=1))
A
0 0 0
1 2 2
Name: A, dtype: int64
Work aroundset_index
解决set_index
cols = ['plant1_type','plant2_type','city2']
df.set_index(cols).groupby(level=cols)['p234_r_c'].\
nlargest(1).reset_index()
plant1_type plant2_type city2 p234_r_c
0 COMBCYCL COAL Toronto 5.0
1 COMBCYCL COAL Detroit 4.0
2 NUKE COMBCYCL St.Louis 2.0
3 COAL COMBCYCL Miami 0.5
4 NUKE COAL Dallas 1.0
5 COMBCYCL NUKE Dallas 4.0
6 COAL NUKE Dallas 3.0
#1
4
我在这里添加了一个问题。
Theory:
理论:
When the results of a
groupby
on apd.Series
returns the samepd.Series
values, then the original index is returned.当一组的结果在pd上。序列返回相同的pd。序列值,然后返回原始索引。
Boiled down example
归结的例子
df = pd.DataFrame(dict(A=[0, 1, 2, 3]))
# returns results identical to df.A
print(df.groupby(df.A // 2).A.nsmallest(2))
# returns results out of order
print(df.groupby(df.A // 2).A.nlargest(2))
0 0
1 1
2 2
3 3
Name: A, dtype: int64
A
0 1 1
0 0
1 3 3
2 2
Name: A, dtype: int64
I'd argue that you want these to return the same consistent index.
我认为你希望它们返回相同的一致的索引。
This is the most egregious consequence of this:
这是最恶劣的后果:
# most egregious
# this will be randomly different
print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=2))
returns this on one execution
在一个执行中返回这个。
A
0 1 1
0 0
1 2 2
3 3
Name: A, dtype: int64
And this on another
这在另一个
0 0
1 1
2 2
3 3
Name: A, dtype: int64
Of course this never has an issue because it's impossible to return the same values as the original
当然,这永远不会有问题,因为它不可能返回与原始值相同的值。
print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=1))
A
0 0 0
1 2 2
Name: A, dtype: int64
Work aroundset_index
解决set_index
cols = ['plant1_type','plant2_type','city2']
df.set_index(cols).groupby(level=cols)['p234_r_c'].\
nlargest(1).reset_index()
plant1_type plant2_type city2 p234_r_c
0 COMBCYCL COAL Toronto 5.0
1 COMBCYCL COAL Detroit 4.0
2 NUKE COMBCYCL St.Louis 2.0
3 COAL COMBCYCL Miami 0.5
4 NUKE COAL Dallas 1.0
5 COMBCYCL NUKE Dallas 4.0
6 COAL NUKE Dallas 3.0