如何按组删除重复的行?

时间:2021-07-31 09:11:22

How to delete duplicate rows by group with the option to choose how many duplicate rows to keep?

如何使用选项来删除重复行,以选择保留多少重复行?

eg: Please check Example Picture, for every continuous 1 in V1, delete the rows where the Volume is duplicated, for df[2:5,] row 5 will be deleted, for df[9:10,] row 9 will be deleted, df[15:17,] row 15,16 will be deleted, df[19:20,] row 19 will be deleted.

例:请查看示例图片,对于V1中每一个连续的1,删除重复卷的行,删除df[2:5,] 5行,删除df[9:10,] 9行,删除df[15:17,] 15行,删除16行,删除df[19:20,] 19行。

Also, is it possible to choose how many duplicate rows to keep? eg: if I want to keep 2 duplicate rows, the result for df[15:17,] will be df[15:16,] where only row 17 got deleted.

另外,是否可以选择保留多少重复的行?如果我想保持2行重复,df[15:17,]的结果将是df[15:16,],其中只有第17行被删除。

How to achieve this without using loops, how do I achieve this the vectorized way so calculation speed is faster(when dealing with millions of rows)?

如何在不使用循环的情况下实现这个目标,如何使用矢量化的方式来实现这个目标,从而使计算速度更快(当处理数百万行时)?

Example Picture

示例图片

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 2: 0.8645 0.8508  1  0
 3: 0.8573 0.8585  1  0
 4: 1.1457 1.1413  1  0
 5: 0.8573 0.8568  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
 9: 1.4850 1.3412  1  0
10: 1.4850 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
13: 1.0437 1.0344  0  1
14: 1.1475 1.1447  0  1
15: 1.1859 1.1748  1  0
16: 1.1859 1.1735  1  0
17: 1.1859 1.1731  1  0
18: 1.1557 1.1552  0  1
19: 1.1749 1.1731  1  0
20: 1.1749 1.1552  1  0

Expected Outcome

预期的结果

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 2: 0.8645 0.8508  1  0
 3: 0.8573 0.8585  1  0
 4: 1.1457 1.1413  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
10: 1.4850 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
13: 1.0437 1.0344  0  1
14: 1.1475 1.1447  0  1
17: 1.1859 1.1731  1  0
18: 1.1557 1.1552  0  1
20: 1.1749 1.1552  1  0

1 个解决方案

#1


2  

We can use duplicated

我们可以用复制

setDT(df1)[df1[, (!duplicated(Volume) & V1==1)|V1==0, rleid(V1)]$V1]

If we need to delete from the duplicate from the reverse direction

如果我们需要从反向删除副本

setDT(df1)[df1[, (!duplicated(Volume, fromLast = TRUE) & V1==1)|V1==0, rleid(V1)]$V1]

#1


2  

We can use duplicated

我们可以用复制

setDT(df1)[df1[, (!duplicated(Volume) & V1==1)|V1==0, rleid(V1)]$V1]

If we need to delete from the duplicate from the reverse direction

如果我们需要从反向删除副本

setDT(df1)[df1[, (!duplicated(Volume, fromLast = TRUE) & V1==1)|V1==0, rleid(V1)]$V1]