在Python熊猫DataFrame中删除副本,而不删除副本

时间:2022-10-15 04:25:32

I have a problem with removing the duplicates. My program is based around a loop which generates tuples (x,y) which are then used as nodes in a graph. The final array/matrix of nodes is :

我有一个删除重复的问题。我的程序基于一个循环,该循环生成元组(x,y),然后在图中作为节点使用。节点的最终数组/矩阵为:

[[ 1.          1.        ]
[ 1.12273268  1.15322175]
[..........etc..........]
[ 0.94120695  0.77802849]
**[ 0.84301344  0.91660517]**
[ 0.93096269  1.21383287]
**[ 0.84301344  0.91660517]**
[ 0.75506418  1.0798641 ]]

The length of the array is 22. Now, I need to remove the duplicate entries (see **). So I used:

数组的长度是22。现在,我需要删除重复的条目(参见**)。所以我使用:

def urows(array):
    df = pandas.DataFrame(array)
    df.drop_duplicates(take_last=True)
    return df.drop_duplicates(take_last=True).values

Fantastic, but I still get :

太棒了,但我还是得到:

           0         1
0   1.000000  1.000000
....... etc...........
17  1.039400  1.030320
18  0.941207  0.778028
**19  0.843013  0.916605**
20  0.930963  1.213833
**21  0.843013  0.916605**

So drop duplicates is not removing anything. I tested to see if the nodes where actually the same and I get:

所以删除复制不会删除任何东西。我测试了这些节点是否相同,我得到:

print urows(total_nodes)[19,:]
---> [ 0.84301344  0.91660517]
print urows(total_nodes)[21,:]
---> [ 0.84301344  0.91660517]
print urows(total_nodes)[12,:] - urows(total_nodes)[13,:]
---> [ 0.  0.]

Why is it not working ??? How can I remove those duplicate values ???

为什么它不工作?如何删除重复的值?

One more question....

一个问题....

Say two values are "nearly" equal (say x1 and x2), is there any way to replace them in a way that they are both equal ???? What I want is to replace x2 with x1 if they are "nearly" equal.

假设两个值“接近”相等(比如x1和x2),有没有什么方法可以替换它们,使它们都相等???我想用x1替换x2如果它们“接近”相等。

2 个解决方案

#1


5  

If I copy-paste in your data, I get:

如果我复制粘贴你的数据,我得到:

>>> df
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
5  0.843013  0.916605
6  0.755064  1.079864

>>> df.drop_duplicates() 
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
6  0.755064  1.079864

so it is actually removed, and your problem is that the arrays aren't exactly equal (though their difference rounds to 0 for display).

所以它实际上被删除了,你的问题是数组并不完全相等(尽管它们的差值为0)。

One workaround would be to round the data to however many decimal places are applicable with something like df.apply(np.round, args=[4]), then drop the duplicates. If you want to keep the original data but remove rows that are duplicate up to rounding, you can use something like

一种解决方法是将数据四舍五入到df.apply(np)之类的东西可以使用的任何小数点位。圆形,args=[4]),然后删除重复。如果您希望保留原始数据,但删除重复到舍入的行,可以使用以下方法

df = df.ix[~df.apply(np.round, args=[4]).duplicated()]

Here's one really clumsy way to do what you're asking for with setting nearly-equal values to be actually equal:

这里有一种很笨拙的方法来实现你的要求将接近相等的值设为相等:

grouped = df.groupby([df[i].round(4) for i in df.columns])
subbed = grouped.apply(lambda g: g.apply(lambda row: g.irow(0), axis=1))
subbed.drop_index(level=list(df.columns), drop=True, inplace=True)

This reorders the dataframe, but you can then call .sort() to get them back in the original order if you need that.

这将重新订购dataframe,但是您可以调用.sort(),以便在需要时按原始顺序将它们恢复到原来的顺序。

Explanation: the first line uses groupby to group the data frame by the rounded values. Unfortunately, if you give a function to groupby it applies it to the labels rather than the rows (so you could maybe do df.groupby(lambda k: np.round(df.ix[k], 4)), but that sucks too).

说明:第一行使用groupby按圆角值对数据帧进行分组。不幸的是,如果你给groupby一个函数,它会将它应用到标签而不是行(所以你可以使用df。groupby(λk:np.round(df。ix[k], 4),但那也很糟糕。

The second line uses the apply method on groupby to replace the dataframe of near-duplicate rows, g, with a new dataframe g.apply(lambda row: g.irow(0), axis=1). That uses the apply method on dataframes to replace each row with the first row of the group.

第二行使用groupby上的apply方法,用一个新的dataframe替换几乎重复的行g的dataframe。应用(λ行:g.irow(0)轴= 1)。使用dataframes上的apply方法将每一行替换为该组的第一行。

The result then looks like

结果是这样的

                        0         1
0      1                           
0.7551 1.0799 6  0.755064  1.079864
0.8430 0.9166 3  0.843013  0.916605
              5  0.843013  0.916605
0.9310 1.2138 4  0.930963  1.213833
0.9412 0.7780 2  0.941207  0.778028
1.0000 1.0000 0  1.000000  1.000000
1.1227 1.1532 1  1.122733  1.153222

where groupby has inserted the rounded values as an index. The reset_index line then drops those columns.

其中groupby将四舍五入的值作为索引插入。然后reset_index行删除这些列。

Hopefully someone who knows pandas better than I do will drop by and show how to do this better.

希望那些比我更了解熊猫的人能来参观并展示如何做得更好。

#2


1  

Similar to @Dougal answer, but in a slightly different way

类似于@Dougal,但方式略有不同

In [20]: df.ix[~(df*1e6).astype('int64').duplicated(cols=[0])]
Out[20]: 
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
6  0.755064  1.079864

#1


5  

If I copy-paste in your data, I get:

如果我复制粘贴你的数据,我得到:

>>> df
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
5  0.843013  0.916605
6  0.755064  1.079864

>>> df.drop_duplicates() 
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
6  0.755064  1.079864

so it is actually removed, and your problem is that the arrays aren't exactly equal (though their difference rounds to 0 for display).

所以它实际上被删除了,你的问题是数组并不完全相等(尽管它们的差值为0)。

One workaround would be to round the data to however many decimal places are applicable with something like df.apply(np.round, args=[4]), then drop the duplicates. If you want to keep the original data but remove rows that are duplicate up to rounding, you can use something like

一种解决方法是将数据四舍五入到df.apply(np)之类的东西可以使用的任何小数点位。圆形,args=[4]),然后删除重复。如果您希望保留原始数据,但删除重复到舍入的行,可以使用以下方法

df = df.ix[~df.apply(np.round, args=[4]).duplicated()]

Here's one really clumsy way to do what you're asking for with setting nearly-equal values to be actually equal:

这里有一种很笨拙的方法来实现你的要求将接近相等的值设为相等:

grouped = df.groupby([df[i].round(4) for i in df.columns])
subbed = grouped.apply(lambda g: g.apply(lambda row: g.irow(0), axis=1))
subbed.drop_index(level=list(df.columns), drop=True, inplace=True)

This reorders the dataframe, but you can then call .sort() to get them back in the original order if you need that.

这将重新订购dataframe,但是您可以调用.sort(),以便在需要时按原始顺序将它们恢复到原来的顺序。

Explanation: the first line uses groupby to group the data frame by the rounded values. Unfortunately, if you give a function to groupby it applies it to the labels rather than the rows (so you could maybe do df.groupby(lambda k: np.round(df.ix[k], 4)), but that sucks too).

说明:第一行使用groupby按圆角值对数据帧进行分组。不幸的是,如果你给groupby一个函数,它会将它应用到标签而不是行(所以你可以使用df。groupby(λk:np.round(df。ix[k], 4),但那也很糟糕。

The second line uses the apply method on groupby to replace the dataframe of near-duplicate rows, g, with a new dataframe g.apply(lambda row: g.irow(0), axis=1). That uses the apply method on dataframes to replace each row with the first row of the group.

第二行使用groupby上的apply方法,用一个新的dataframe替换几乎重复的行g的dataframe。应用(λ行:g.irow(0)轴= 1)。使用dataframes上的apply方法将每一行替换为该组的第一行。

The result then looks like

结果是这样的

                        0         1
0      1                           
0.7551 1.0799 6  0.755064  1.079864
0.8430 0.9166 3  0.843013  0.916605
              5  0.843013  0.916605
0.9310 1.2138 4  0.930963  1.213833
0.9412 0.7780 2  0.941207  0.778028
1.0000 1.0000 0  1.000000  1.000000
1.1227 1.1532 1  1.122733  1.153222

where groupby has inserted the rounded values as an index. The reset_index line then drops those columns.

其中groupby将四舍五入的值作为索引插入。然后reset_index行删除这些列。

Hopefully someone who knows pandas better than I do will drop by and show how to do this better.

希望那些比我更了解熊猫的人能来参观并展示如何做得更好。

#2


1  

Similar to @Dougal answer, but in a slightly different way

类似于@Dougal,但方式略有不同

In [20]: df.ix[~(df*1e6).astype('int64').duplicated(cols=[0])]
Out[20]: 
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
6  0.755064  1.079864