How to get all the existing duplicated sets of records(based on a column) from a dataframe?
如何从dataframe中获取所有现有的重复记录集(基于列)?
I got a dataframe as follows:
我得到了如下的数据aframe:
flight_id | from_location | to_location | schedule |
1 | Vancouver | Toronto | 3-Jan |
2 | Amsterdam | Tokyo | 15-Feb |
4 | Fairbanks | Glasgow | 12-Jan |
9 | Halmstad | Athens | 21-Jan |
3 | Brisbane | Lisbon | 4-Feb |
4 | Johannesburg | Venice | 12-Jan |
9 | LosAngeles | Perth | 3-Mar |
Here flight_id is the column on which I need to check duplicates. And there are 2 sets of duplicates.
这里flight_id是我需要检查副本的列。有两套重复的。
Output for this specific example should look like--[(2,5),(3,6)]
. List of tuples of record index values
这个特定示例的输出应该是—[(2,5),(3,6)]。记录索引值的元组列表
3 个解决方案
#1
7
Is this what you need ? duplicated
+groupby
这是你需要的吗?复制+ groupby
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple)
Out[510]:
flight_id
4 (2, 5)
9 (3, 6)
Name: index, dtype: object
Adding tolist
at the end
最后添加tolist
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple).tolist()
Out[511]: [(2, 5), (3, 6)]
And another solution ... for fun only
和另一个解决方案…只为了好玩
s=df['flight_id'].value_counts()
list(map(lambda x : tuple(df[df['flight_id']==x].index.tolist()), s[s.gt(1)].index))
Out[519]: [(2, 5), (3, 6)]
#2
8
Using apply
and a lambda
使用apply和lambda
df.groupby('flight_id').apply(
lambda d: tuple(d.index) if len(d.index) > 1 else None
).dropna()
flight_id
4 (2, 5)
9 (3, 6)
dtype: object
Or better with an iteration through the groupby
object
或者通过groupby对象进行迭代更好
{k: tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1}
{4: (2, 5), 9: (3, 6)}
Just the tuples
的元组
[tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1]
[(2, 5), (3, 6)]
Leaving this for posterity
But I now highly dislike this approach. It's just too gross.
I was messing around with itertools.groupby
Others may find this fun
把这留给子孙后代,但我现在非常不喜欢这种做法。太恶心。我在使用迭代工具。其他人可能会觉得这很有趣
from itertools import groupby
key = df.flight_id.get
s = sorted(df.index, key=key)
dict(filter(
lambda t: len(t[1]) > 1,
((k, tuple(g)) for k, g in groupby(s, key))
))
{4: (2, 5), 9: (3, 6)}
#3
6
Performing a groupby
on df.index
can take you places.
在df上执行分组。索引可以带你去不同的地方。
v = df.index.to_series().groupby(df.flight_id).apply(pd.Series.tolist)
v[v.str.len().gt(1)]
flight_id
4 [2, 5]
9 [3, 6]
dtype: object
You can also get cute with just groupby
on df.index
directly.
你也可以在df上使用groupby。直接索引。
v = pd.Series(df.index.groupby(df.flight_id))
v[v.str.len().gt(1)].to_dict()
{
"4": [
2,
5
],
"9": [
3,
6
]
}
#1
7
Is this what you need ? duplicated
+groupby
这是你需要的吗?复制+ groupby
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple)
Out[510]:
flight_id
4 (2, 5)
9 (3, 6)
Name: index, dtype: object
Adding tolist
at the end
最后添加tolist
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple).tolist()
Out[511]: [(2, 5), (3, 6)]
And another solution ... for fun only
和另一个解决方案…只为了好玩
s=df['flight_id'].value_counts()
list(map(lambda x : tuple(df[df['flight_id']==x].index.tolist()), s[s.gt(1)].index))
Out[519]: [(2, 5), (3, 6)]
#2
8
Using apply
and a lambda
使用apply和lambda
df.groupby('flight_id').apply(
lambda d: tuple(d.index) if len(d.index) > 1 else None
).dropna()
flight_id
4 (2, 5)
9 (3, 6)
dtype: object
Or better with an iteration through the groupby
object
或者通过groupby对象进行迭代更好
{k: tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1}
{4: (2, 5), 9: (3, 6)}
Just the tuples
的元组
[tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1]
[(2, 5), (3, 6)]
Leaving this for posterity
But I now highly dislike this approach. It's just too gross.
I was messing around with itertools.groupby
Others may find this fun
把这留给子孙后代,但我现在非常不喜欢这种做法。太恶心。我在使用迭代工具。其他人可能会觉得这很有趣
from itertools import groupby
key = df.flight_id.get
s = sorted(df.index, key=key)
dict(filter(
lambda t: len(t[1]) > 1,
((k, tuple(g)) for k, g in groupby(s, key))
))
{4: (2, 5), 9: (3, 6)}
#3
6
Performing a groupby
on df.index
can take you places.
在df上执行分组。索引可以带你去不同的地方。
v = df.index.to_series().groupby(df.flight_id).apply(pd.Series.tolist)
v[v.str.len().gt(1)]
flight_id
4 [2, 5]
9 [3, 6]
dtype: object
You can also get cute with just groupby
on df.index
directly.
你也可以在df上使用groupby。直接索引。
v = pd.Series(df.index.groupby(df.flight_id))
v[v.str.len().gt(1)].to_dict()
{
"4": [
2,
5
],
"9": [
3,
6
]
}