ID Cat1 Cat2 Cat3 Cat4
A0001 358 11.25 37428 0
A0001 279 14.6875 38605 0
A0013 367 5.125 40152 1
A0014 337 16.3125 38624 0
A0020 367 8.875 37797 0
A0020 339 9.625 39324 0
I need help learning to how remove the unique rows in my file while keeping the duplicates or triplicates. For example, output should look like below:
我需要帮助学习如何删除文件中的唯一行,同时保持重复或重复。例如,输出应如下所示:
ID Cat1 Cat2 Cat3 Cat4
A0001 358 11.25 37428 0
A0001 279 14.6875 38605 0
A0020 367 8.875 37797 0
A0020 339 9.625 39324 0
If you can give me advice how to approach this problem, much appreciated.
如果你能给我建议如何解决这个问题,非常感谢。
Thanks for everyone's suggestions. I wanted to calculate the difference in value in the different Categories (i.e. Cat2, Cat 3) between the repeated measures (by unique ID). Would appreciate any suggestions.
谢谢大家的建议。我想计算重复测量之间的不同类别(即Cat2,Cat 3)的值差异(通过唯一ID)。将不胜感激任何建议。
2 个解决方案
#1
6
Another option in base R Using duplicated
基础R中的另一个选项使用重复
dx[dx$ID %in% dx$ID[duplicated(dx$ID)],]
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 5 A0020 367 8.8750 37797 0
# 6 A0020 339 9.6250 39324 0
data.table using duplicated
using duplicated
and fromLast
version you get :
使用duplicated和fromLast版本,您将获得:
library(data.table)
setkey(setDT(dx),ID) # or with data.table 1.9.5+: setDT(dx,key="ID")
dx[duplicated(dx) |duplicated(dx,fromLast=T)]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
This can be applied to base R also but I prefer data.table here for syntax sugar.
这也可以应用于基数R但我更喜欢data.table这里的语法糖。
#2
6
General comments.
- The
ave
approach is the only one here that preserves the data's initial row ordering. - The
by
approach should be very slow. I suspect that data.table and dplyr are not much faster thanave
andtapply
(yet) at selecting groups. Benchmarks to prove me wrong welcome!
这里唯一的方法是保留数据的初始行排序。
by方法应该非常慢。我怀疑data.table和dplyr在选择组时并不比ave和tapply(还)快。基准来证明我的错误欢迎!
base R (Thanks to @thelatemail for both of the first two approaches.)
base R(感谢前两种方法的@thelatemail。)
1) Each row is assigned the length of its df$ID
group, and we filter based on the vector of lengths.
1)为每行分配其df $ ID组的长度,并根据长度向量进行过滤。
df[ ave(1:nrow(df), df$ID, FUN=length) > 1 , ]
2) Alternately, we split row names or numbers by df$ID
, selecting which groups' rows to keep. tapply
returns a list of groups of rows, so we must unlist
them into a single vector of rows.
2)或者,我们用df $ ID分割行名或数字,选择要保留的组的行。 tapply返回一组行列表,因此我们必须将它们取消列为单个行向量。
df[ unlist(tapply(1:nrow(df), df$ID, function(x) if (length(x) > 1) x)) , ]
What follows is a worse approach, but better parallels what you see with data.table and dplyr:
接下来是一种更糟糕的方法,但与data.table和dplyr所看到的更好的相似之处:
3) The data is split by df$ID
, keeping each subset of data, SD
if if has more than one row. by
returns a list, so we must rbind
them back together.
3)数据按df $ ID分割,保留每个数据子集,如果有多行,则保留SD。通过返回一个列表,所以我们必须将它们重新组合在一起。
do.call( rbind, c(list(make.row.names = FALSE),
by(df, df$ID, FUN=function(SD) if (nrow(SD) > 1) SD )))
data.table .N
corresponds to nrow
within a by=ID
group; and .SD
is the subset of data.
data.table .N对应于by = ID组中的nrow;和.SD是数据的子集。
library(data.table)
setDT(df)[, if (.N>1) .SD, by=ID]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
dplyr n()
corresponds to nrow
within a group_by(ID)
group.
dplyr n()对应于group_by(ID)组中的nrow。
library(dplyr)
df %>% group_by(ID) %>% filter( n() > 1 )
# Source: local data frame [4 x 5]
# Groups: ID
#
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 3 A0020 367 8.8750 37797 0
# 4 A0020 339 9.6250 39324 0
#1
6
Another option in base R Using duplicated
基础R中的另一个选项使用重复
dx[dx$ID %in% dx$ID[duplicated(dx$ID)],]
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 5 A0020 367 8.8750 37797 0
# 6 A0020 339 9.6250 39324 0
data.table using duplicated
using duplicated
and fromLast
version you get :
使用duplicated和fromLast版本,您将获得:
library(data.table)
setkey(setDT(dx),ID) # or with data.table 1.9.5+: setDT(dx,key="ID")
dx[duplicated(dx) |duplicated(dx,fromLast=T)]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
This can be applied to base R also but I prefer data.table here for syntax sugar.
这也可以应用于基数R但我更喜欢data.table这里的语法糖。
#2
6
General comments.
- The
ave
approach is the only one here that preserves the data's initial row ordering. - The
by
approach should be very slow. I suspect that data.table and dplyr are not much faster thanave
andtapply
(yet) at selecting groups. Benchmarks to prove me wrong welcome!
这里唯一的方法是保留数据的初始行排序。
by方法应该非常慢。我怀疑data.table和dplyr在选择组时并不比ave和tapply(还)快。基准来证明我的错误欢迎!
base R (Thanks to @thelatemail for both of the first two approaches.)
base R(感谢前两种方法的@thelatemail。)
1) Each row is assigned the length of its df$ID
group, and we filter based on the vector of lengths.
1)为每行分配其df $ ID组的长度,并根据长度向量进行过滤。
df[ ave(1:nrow(df), df$ID, FUN=length) > 1 , ]
2) Alternately, we split row names or numbers by df$ID
, selecting which groups' rows to keep. tapply
returns a list of groups of rows, so we must unlist
them into a single vector of rows.
2)或者,我们用df $ ID分割行名或数字,选择要保留的组的行。 tapply返回一组行列表,因此我们必须将它们取消列为单个行向量。
df[ unlist(tapply(1:nrow(df), df$ID, function(x) if (length(x) > 1) x)) , ]
What follows is a worse approach, but better parallels what you see with data.table and dplyr:
接下来是一种更糟糕的方法,但与data.table和dplyr所看到的更好的相似之处:
3) The data is split by df$ID
, keeping each subset of data, SD
if if has more than one row. by
returns a list, so we must rbind
them back together.
3)数据按df $ ID分割,保留每个数据子集,如果有多行,则保留SD。通过返回一个列表,所以我们必须将它们重新组合在一起。
do.call( rbind, c(list(make.row.names = FALSE),
by(df, df$ID, FUN=function(SD) if (nrow(SD) > 1) SD )))
data.table .N
corresponds to nrow
within a by=ID
group; and .SD
is the subset of data.
data.table .N对应于by = ID组中的nrow;和.SD是数据的子集。
library(data.table)
setDT(df)[, if (.N>1) .SD, by=ID]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
dplyr n()
corresponds to nrow
within a group_by(ID)
group.
dplyr n()对应于group_by(ID)组中的nrow。
library(dplyr)
df %>% group_by(ID) %>% filter( n() > 1 )
# Source: local data frame [4 x 5]
# Groups: ID
#
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 3 A0020 367 8.8750 37797 0
# 4 A0020 339 9.6250 39324 0