This is my dataframe :
这是我的dataframe:
df = pd.DataFrame({'name' : ['name1', 'name2', 'name1', 'name3'],
'rate' : [1,2,2,3],
'id' : range(4)})
id name rate
0 0 name1 1
1 1 name2 2
2 2 name1 2
3 3 name3 3
I want to group the rows of a pandas dataframe if they have the same values in column name
OR in column rate
.
如果在列名或列速率中具有相同的值,我想对熊猫数据存储器的行进行分组。
id name rate
0 [0, 1, 2] [name1, name2] [1, 2, 2]
1 [3] name3 [3]
I have a huge dataframe so I don't want to iterate over each row (unless that the only solution). What should I do ?
我有一个很大的dataframe,所以我不想遍历每一行(除非这是唯一的解决方案)。我该怎么办?
(I can use Numpy arrays instead of Pandas dataframe)
(我可以使用Numpy数组而不是熊猫dataframe)
1 个解决方案
#1
2
Your conditions are unboundedly transitive. Say in rows 2i, 2i + 1 the name is shared and in rows 2i + 1, 2i + 2 the rate is shared, you need to keep linking rows.
你的条件是无限的过渡性的。在第2i行,2i + 1行名称是共享的,在第2i + 1行,2i + 2行速率是共享的,你需要保持链接行。
One way to solve this is using the graph theory's connected components algorithm.
解决这个问题的一种方法是使用图论的连通分量算法。
For this you can use networkx
. In code, it could be as follows:
为此,您可以使用networkx。在代码中,可以如下所示:
import networkx as nx
import itertools
G = nx.Graph()
G.add_nodes_from(df.id)
G.add_edges_from(
[(r1[1]['id'], r2[1]['id']) for (r1, r2) in itertools.product(df.iterrows(), df.iterrows()) if r1[1].id < r2[1].id and (r1[1]['rate'] == r2[1]['rate'] or r1[1]['name'] == r2[1]['name'])]
)
Let's create a group
column, indicating, for each row, its group:
让我们创建一个组列,为每一行表示它的组:
df['group'] = df['id'].map(
dict(itertools.chain.from_iterable([[(ee, i) for ee in e] for (i, e) in enumerate(nx.connected_components(G))])))
>>> df.group
0 0
1 0
2 0
3 1
Now you just need to groupby
the group column, and apply a list
.
现在只需要将group列分组,并应用一个列表。
#1
2
Your conditions are unboundedly transitive. Say in rows 2i, 2i + 1 the name is shared and in rows 2i + 1, 2i + 2 the rate is shared, you need to keep linking rows.
你的条件是无限的过渡性的。在第2i行,2i + 1行名称是共享的,在第2i + 1行,2i + 2行速率是共享的,你需要保持链接行。
One way to solve this is using the graph theory's connected components algorithm.
解决这个问题的一种方法是使用图论的连通分量算法。
For this you can use networkx
. In code, it could be as follows:
为此,您可以使用networkx。在代码中,可以如下所示:
import networkx as nx
import itertools
G = nx.Graph()
G.add_nodes_from(df.id)
G.add_edges_from(
[(r1[1]['id'], r2[1]['id']) for (r1, r2) in itertools.product(df.iterrows(), df.iterrows()) if r1[1].id < r2[1].id and (r1[1]['rate'] == r2[1]['rate'] or r1[1]['name'] == r2[1]['name'])]
)
Let's create a group
column, indicating, for each row, its group:
让我们创建一个组列,为每一行表示它的组:
df['group'] = df['id'].map(
dict(itertools.chain.from_iterable([[(ee, i) for ee in e] for (i, e) in enumerate(nx.connected_components(G))])))
>>> df.group
0 0
1 0
2 0
3 1
Now you just need to groupby
the group column, and apply a list
.
现在只需要将group列分组,并应用一个列表。