I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
我有一个dataframe,我想消除重复的行,它们具有相同的值,但是在不同的列中:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})dfOut[8]: a b c d1 x y e f2 e f x y3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate. I want to drop these lines and only keep one, to have the final output:
行[1],[2]的值为{x,y,e,f},但是它们是交叉排列的——也就是说,如果你把列c d和行[2]中的a,b交换,你会得到一个副本。我想去掉这几行,只保留一行,以得到最终的输出:
df_newOut[20]: a b c d1 x y e f3 w v s t
How can I efficiently achieve that?
我怎样才能有效地做到这一点呢?
3 个解决方案
#1
4
I think you need filter by boolean indexing
with mask created by numpy.sort
with duplicated
, for invert it use ~
:
我认为您需要使用由numpy创建的掩码进行布尔索引。排序与重复,用于倒置使用~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]print (df) a b c d1 x y e f3 w v s t
Detail:
细节:
print (np.sort(df, axis=1))[['e' 'f' 'x' 'y'] ['e' 'f' 'x' 'y'] ['s' 't' 'v' 'w']]print (pd.DataFrame(np.sort(df, axis=1), index=df.index)) 0 1 2 31 e f x y2 e f x y3 s t v wprint (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())1 False2 True3 Falsedtype: boolprint (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())1 True2 False3 Truedtype: bool
#2
1
Here's another solution, with a for loop:
下面是另一种解决方案,它有一个for循环:
data = df.as_matrix()new = []for row in data: if not new: new.append(row) else: if not any([c in nrow for nrow in new for c in row]): new.append(row)new_df = pd.DataFrame(new, columns=df.columns)
#3
1
Use sorting(np.sort
) and then get duplicates(.duplicated()
) out of it.Later use that duplicates to drop(df.drop
) the required index
使用排序(np.sort),然后从中获取重复(.duplicate())。稍后使用该副本来删除(df.drop)所需的索引
import pandas as pdimport numpy as npdf = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]df.drop(df.index[df_duplicated])
#1
4
I think you need filter by boolean indexing
with mask created by numpy.sort
with duplicated
, for invert it use ~
:
我认为您需要使用由numpy创建的掩码进行布尔索引。排序与重复,用于倒置使用~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]print (df) a b c d1 x y e f3 w v s t
Detail:
细节:
print (np.sort(df, axis=1))[['e' 'f' 'x' 'y'] ['e' 'f' 'x' 'y'] ['s' 't' 'v' 'w']]print (pd.DataFrame(np.sort(df, axis=1), index=df.index)) 0 1 2 31 e f x y2 e f x y3 s t v wprint (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())1 False2 True3 Falsedtype: boolprint (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())1 True2 False3 Truedtype: bool
#2
1
Here's another solution, with a for loop:
下面是另一种解决方案,它有一个for循环:
data = df.as_matrix()new = []for row in data: if not new: new.append(row) else: if not any([c in nrow for nrow in new for c in row]): new.append(row)new_df = pd.DataFrame(new, columns=df.columns)
#3
1
Use sorting(np.sort
) and then get duplicates(.duplicated()
) out of it.Later use that duplicates to drop(df.drop
) the required index
使用排序(np.sort),然后从中获取重复(.duplicate())。稍后使用该副本来删除(df.drop)所需的索引
import pandas as pdimport numpy as npdf = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]df.drop(df.index[df_duplicated])