来自熊猫数据的多组重复记录。

时间:2021-12-12 22:55:23

How to get all the existing duplicated sets of records(based on a column) from a dataframe?

如何从dataframe中获取所有现有的重复记录集(基于列)?

I got a dataframe as follows:

我得到了如下的数据aframe:

flight_id | from_location  | to_location |  schedule |  
1         |   Vancouver    |   Toronto   |   3-Jan   |  
2         |   Amsterdam    |   Tokyo     |   15-Feb  |  
4         |   Fairbanks    |   Glasgow   |   12-Jan  |  
9         |   Halmstad     |   Athens    |   21-Jan  |  
3         |   Brisbane     |   Lisbon    |   4-Feb   |  
4         | Johannesburg   |   Venice    |   12-Jan  |
9         | LosAngeles     |  Perth      |   3-Mar   |

Here flight_id is the column on which I need to check duplicates. And there are 2 sets of duplicates.

这里flight_id是我需要检查副本的列。有两套重复的。

Output for this specific example should look like--[(2,5),(3,6)]. List of tuples of record index values

这个特定示例的输出应该是—[(2,5),(3,6)]。记录索引值的元组列表

3 个解决方案

#1


7  

Is this what you need ? duplicated+groupby

这是你需要的吗?复制+ groupby

(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple)
Out[510]: 
flight_id
4    (2, 5)
9    (3, 6)
Name: index, dtype: object

Adding tolist at the end

最后添加tolist

(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple).tolist()
Out[511]: [(2, 5), (3, 6)]

And another solution ... for fun only

和另一个解决方案…只为了好玩

s=df['flight_id'].value_counts()
list(map(lambda x : tuple(df[df['flight_id']==x].index.tolist()), s[s.gt(1)].index))
Out[519]: [(2, 5), (3, 6)]

#2


8  

Using apply and a lambda

使用apply和lambda

df.groupby('flight_id').apply(
    lambda d: tuple(d.index) if len(d.index) > 1 else None
).dropna()

flight_id
4    (2, 5)
9    (3, 6)
dtype: object

Or better with an iteration through the groupby object

或者通过groupby对象进行迭代更好

{k: tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1}

{4: (2, 5), 9: (3, 6)}

Just the tuples

的元组

[tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1]

[(2, 5), (3, 6)]

Leaving this for posterity
But I now highly dislike this approach. It's just too gross.
I was messing around with itertools.groupby
Others may find this fun

把这留给子孙后代,但我现在非常不喜欢这种做法。太恶心。我在使用迭代工具。其他人可能会觉得这很有趣

from itertools import groupby

key = df.flight_id.get
s = sorted(df.index, key=key)
dict(filter(
    lambda t: len(t[1]) > 1,
    ((k, tuple(g)) for k, g in groupby(s, key))
))

{4: (2, 5), 9: (3, 6)}

#3


6  

Performing a groupby on df.index can take you places.

在df上执行分组。索引可以带你去不同的地方。

v = df.index.to_series().groupby(df.flight_id).apply(pd.Series.tolist)
v[v.str.len().gt(1)]

flight_id
4    [2, 5]
9    [3, 6]
dtype: object

You can also get cute with just groupby on df.index directly.

你也可以在df上使用groupby。直接索引。

v = pd.Series(df.index.groupby(df.flight_id))
v[v.str.len().gt(1)].to_dict()

{
    "4": [
        2,
        5
    ],
    "9": [
        3,
        6
    ]
}

#1


7  

Is this what you need ? duplicated+groupby

这是你需要的吗?复制+ groupby

(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple)
Out[510]: 
flight_id
4    (2, 5)
9    (3, 6)
Name: index, dtype: object

Adding tolist at the end

最后添加tolist

(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple).tolist()
Out[511]: [(2, 5), (3, 6)]

And another solution ... for fun only

和另一个解决方案…只为了好玩

s=df['flight_id'].value_counts()
list(map(lambda x : tuple(df[df['flight_id']==x].index.tolist()), s[s.gt(1)].index))
Out[519]: [(2, 5), (3, 6)]

#2


8  

Using apply and a lambda

使用apply和lambda

df.groupby('flight_id').apply(
    lambda d: tuple(d.index) if len(d.index) > 1 else None
).dropna()

flight_id
4    (2, 5)
9    (3, 6)
dtype: object

Or better with an iteration through the groupby object

或者通过groupby对象进行迭代更好

{k: tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1}

{4: (2, 5), 9: (3, 6)}

Just the tuples

的元组

[tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1]

[(2, 5), (3, 6)]

Leaving this for posterity
But I now highly dislike this approach. It's just too gross.
I was messing around with itertools.groupby
Others may find this fun

把这留给子孙后代,但我现在非常不喜欢这种做法。太恶心。我在使用迭代工具。其他人可能会觉得这很有趣

from itertools import groupby

key = df.flight_id.get
s = sorted(df.index, key=key)
dict(filter(
    lambda t: len(t[1]) > 1,
    ((k, tuple(g)) for k, g in groupby(s, key))
))

{4: (2, 5), 9: (3, 6)}

#3


6  

Performing a groupby on df.index can take you places.

在df上执行分组。索引可以带你去不同的地方。

v = df.index.to_series().groupby(df.flight_id).apply(pd.Series.tolist)
v[v.str.len().gt(1)]

flight_id
4    [2, 5]
9    [3, 6]
dtype: object

You can also get cute with just groupby on df.index directly.

你也可以在df上使用groupby。直接索引。

v = pd.Series(df.index.groupby(df.flight_id))
v[v.str.len().gt(1)].to_dict()

{
    "4": [
        2,
        5
    ],
    "9": [
        3,
        6
    ]
}