I have a pandas dataframe in the following format:
我有一个熊猫dataframe的格式如下:
df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3','col4','col5']
df:
df:
col1 col2 col3 col4 col5
0 1.1 A 1.1 x/y/z 1
1 1.1 A 1.7 x/y 3
2 1.1 A 2.5 x/y/z/n 3
3 2.6 B 2.6 x/u 2
4 2.5 B 3.3 x 4
5 3.4 B 3.8 x/u/v 2
6 2.6 B 4 x/y/z 5
7 2.6 A 4.2 x 3
8 3.4 B 4.3 x/u/v/b 6
9 3.4 C 4.5 - 3
10 2.6 B 4.6 x/y 5
11 1.1 D 4.7 x/y/z 1
12 1.1 D 4.7 x 1
13 3.3 D 4.8 x/u/v/w 1
Now I want to group this by two columns like following:
现在我想用以下两列来对它进行分组:
df.groupby(['col5','col2']).reset_index()
OutPut:
输出:
index col1 col2 col3 col4 col5
col5 col2
1 A 0 0 1.1 A 1.1 x/y/z 1
D 0 11 1.1 D 4.7 x/y/z 1
1 12 1.1 D 4.7 x 1
2 13 3.3 D 4.8 x/u/v/w 1
2 B 0 3 2.6 B 2.6 x/u 2
1 5 3.4 B 3.8 x/u/v 2
3 A 0 1 1.1 A 1.7 x/y 3
1 2 1.1 A 2.5 x/y/z/n 3
2 7 2.6 A 4.2 x 3
C 0 9 3.4 C 4.5 - 3
4 B 0 4 2.5 B 3.3 x 4
5 B 0 6 2.6 B 4 x/y/z 5
1 10 2.6 B 4.6 x/y 5
6 B 0 8 3.4 B 4.3 x/u/v/b 6
I want to get the count by each row like following. Expected Output:
我希望每一行都有如下的计数。预期的输出:
col5 col2 count
1 A 1
D 3
2 B 2
etc...
How to get my expected output? And I want to find largest count for each 'col2' value?
如何得到预期的输出?我想找到每个“col2”值最大的数?
6 个解决方案
#1
36
Followed by @Andy's answer, you can do following to solve your second question:
接下来是@Andy的回答,你可以做以下的事情来解决你的第二个问题:
In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]:
0
col2
A 3
B 2
C 1
D 3
#2
59
You are looking for size
:
您正在寻找尺寸:
In [11]: df.groupby(['col5', 'col2']).size()
Out[11]:
col5 col2
1 A 1
D 3
2 B 2
3 A 3
C 1
4 B 1
5 B 2
6 B 1
dtype: int64
To get the same answer as waitingkuo (the "second question"), but slightly cleaner, is to groupby the level:
要得到与waitingkuo相同的答案(“第二个问题”),但稍微干净一点,是按级别分组:
In [12]: df.groupby(['col5', 'col2']).size().groupby(level=1).max()
Out[12]:
col2
A 3
B 2
C 1
D 3
dtype: int64
#3
11
Inserting data into a pandas dataframe and providing column name.
将数据插入到熊猫dataframe并提供列名。
import pandas as pd
df = pd.DataFrame([['A','C','A','B','C','A','B','B','A','A'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['Alphabet','Words']]
print(df) #printing dataframe.
This is our printed data:
这是我们的印刷数据:
For making a group of dataframe in pandas and counter,
You need to provide one more column which counts the grouping, let's call that column as, "COUNTER" in dataframe.
为了在熊猫和计数器中创建一组dataframe,您需要提供一个包含分组的列,让我们把这个列称为dataframe中的“counter”。
Like this:
是这样的:
df['COUNTER'] =1 #initially, set that counter to 1.
group_data = df.groupby(['Alphabet','Words'])['COUNTER'].sum() #sum function
print(group_data)
OUTPUT:
输出:
#4
2
Idiomatic solution that uses only a single groupby
df.groupby(['col5', 'col2']).size() \
.sort_values(ascending=False) \
.reset_index(name='count') \
.drop_duplicates(subset='col2')
col5 col2 count
0 3 A 3
1 1 D 3
2 5 B 2
6 3 C 1
Explanation
解释
The result of the groupby size
method is a Series with col5
and col2
in the index. From here, you can use another groupby method to find the maximum value of each value in col2
but it is not necessary to do. You can simply sort all the values descendingly and then keep only the rows with the first occurrence of col2
with the drop_duplicates
method.
groupby size方法的结果是在索引中包含col5和col2的系列。从这里,您可以使用另一个groupby方法来找到col2中每个值的最大值,但是没有必要这样做。您可以简单地对所有的值进行排序,然后使用drop_copy方法只保留第一次出现col2的行。
#5
0
Should you want to add a new column (say 'count_column') containing the groups' counts into the dataframe:
如果您想要添加一个新的列(例如“count_column”),其中包含组的计数到dataframe:
df.count_column=df.groupby(['col5','col2']).col5.transform('count')
(I picked 'col5' as it contains no nan)
(我选了“col5”,因为它不含nan)
#6
-2
You can just use the built-in function count follow by the groupby function
您可以只使用groupby函数的内置函数计数。
df.groupby(['col5','col2']).count()
#1
36
Followed by @Andy's answer, you can do following to solve your second question:
接下来是@Andy的回答,你可以做以下的事情来解决你的第二个问题:
In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]:
0
col2
A 3
B 2
C 1
D 3
#2
59
You are looking for size
:
您正在寻找尺寸:
In [11]: df.groupby(['col5', 'col2']).size()
Out[11]:
col5 col2
1 A 1
D 3
2 B 2
3 A 3
C 1
4 B 1
5 B 2
6 B 1
dtype: int64
To get the same answer as waitingkuo (the "second question"), but slightly cleaner, is to groupby the level:
要得到与waitingkuo相同的答案(“第二个问题”),但稍微干净一点,是按级别分组:
In [12]: df.groupby(['col5', 'col2']).size().groupby(level=1).max()
Out[12]:
col2
A 3
B 2
C 1
D 3
dtype: int64
#3
11
Inserting data into a pandas dataframe and providing column name.
将数据插入到熊猫dataframe并提供列名。
import pandas as pd
df = pd.DataFrame([['A','C','A','B','C','A','B','B','A','A'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['Alphabet','Words']]
print(df) #printing dataframe.
This is our printed data:
这是我们的印刷数据:
For making a group of dataframe in pandas and counter,
You need to provide one more column which counts the grouping, let's call that column as, "COUNTER" in dataframe.
为了在熊猫和计数器中创建一组dataframe,您需要提供一个包含分组的列,让我们把这个列称为dataframe中的“counter”。
Like this:
是这样的:
df['COUNTER'] =1 #initially, set that counter to 1.
group_data = df.groupby(['Alphabet','Words'])['COUNTER'].sum() #sum function
print(group_data)
OUTPUT:
输出:
#4
2
Idiomatic solution that uses only a single groupby
df.groupby(['col5', 'col2']).size() \
.sort_values(ascending=False) \
.reset_index(name='count') \
.drop_duplicates(subset='col2')
col5 col2 count
0 3 A 3
1 1 D 3
2 5 B 2
6 3 C 1
Explanation
解释
The result of the groupby size
method is a Series with col5
and col2
in the index. From here, you can use another groupby method to find the maximum value of each value in col2
but it is not necessary to do. You can simply sort all the values descendingly and then keep only the rows with the first occurrence of col2
with the drop_duplicates
method.
groupby size方法的结果是在索引中包含col5和col2的系列。从这里,您可以使用另一个groupby方法来找到col2中每个值的最大值,但是没有必要这样做。您可以简单地对所有的值进行排序,然后使用drop_copy方法只保留第一次出现col2的行。
#5
0
Should you want to add a new column (say 'count_column') containing the groups' counts into the dataframe:
如果您想要添加一个新的列(例如“count_column”),其中包含组的计数到dataframe:
df.count_column=df.groupby(['col5','col2']).col5.transform('count')
(I picked 'col5' as it contains no nan)
(我选了“col5”,因为它不含nan)
#6
-2
You can just use the built-in function count follow by the groupby function
您可以只使用groupby函数的内置函数计数。
df.groupby(['col5','col2']).count()