I wish to count the number of unique values by grouping of a second variable, and then add the count to the existing data.frame as a new column. For example, if the existing data frame looks like this:
我希望通过分组第二个变量来计数惟一值的数量,然后将计数添加到现有的data.frame作为新列。例如,如果现有的数据帧是这样的:
color type1 black chair2 black chair3 black sofa4 green sofa5 green sofa6 red sofa7 red plate8 blue sofa9 blue plate10 blue chair
I want to add for each color
, the count of unique types
that are present in the data:
我想为每一种颜色加上数据中唯一类型的计数:
color type unique_types1 black chair 22 black chair 23 black sofa 24 green sofa 15 green sofa 16 red sofa 27 red plate 28 blue sofa 39 blue plate 310 blue chair 3
I was hoping to use ave
, but can't seem to find a straightforward method that doesn't require many lines. I have >100,000 rows, so am also not sure how important efficiency is.
我希望使用ave,但似乎找不到一个简单的方法,不需要很多行。我有>100,000行,所以我也不确定效率有多重要。
It's somewhat similar to this issue: Count number of observations/rows per group and add result to data frame
它有点类似于这个问题:计算每个组的观察数/行数并向数据帧添加结果
3 个解决方案
#1
40
Using ave
(since you ask for it specifically):
使用ave(具体要求):
within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})
Make sure that type
is character vector and not factor.
确保类型是字符向量而不是因子。
Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table
solution as well.
既然你也说你的数据是巨大的,而且速度/性能可能是一个因素,我建议一个数据。表解决方案。
require(data.table)setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+# if you don't want df to be modified by referenceans = as.data.table(df)[, count := uniqueN(type), by = color]
uniqueN
was implemented in v1.9.6
and is a faster equivalent of length(unique(.))
. In addition it also works with data.frames/data.tables.
uniqueN在v1.9.6中被采用,并且是一个体现长度的快速版本(unique(.)。此外,它还适用于data.frame /data.tables。
Other solutions:
其他的解决方案:
Using plyr:
使用plyr:
require(plyr)ddply(df, .(color), mutate, count = length(unique(type)))
Using aggregate
:
使用聚合:
agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))merge(df, agg, by="color", all=TRUE)
#2
32
Here's a solution with the dplyr package - it has n_distinct()
as a wrapper for length(unique())
.
这里有一个dplyr包的解决方案——它有n_distinct()作为长度(unique())的包装。
df %>% group_by(color) %>% mutate(unique_types = n_distinct(type))
#3
4
This can be also achieved in a vectorized without by group operations by combining unique
with table
or tabulate
这也可以在不通过分组操作的情况下,通过与表或表格相结合的方式实现。
If df$color
is factor
, then
如果df$color是因数,则
Either
要么
table(unique(df)$color)[as.character(df$color)]# black black black green green red red blue blue blue # 2 2 2 1 1 2 2 3 3 3
Or
或
tabulate(unique(df)$color)[as.integer(df$color)]# [1] 2 2 2 1 1 2 2 3 3 3
If df$color
is character
then just
如果df$color就是字符
table(unique(df)$color)[df$color]
If df$color
is an integer
then just
如果df$color是一个整数,那么
tabulate(unique(df)$color)[df$color]
#1
40
Using ave
(since you ask for it specifically):
使用ave(具体要求):
within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})
Make sure that type
is character vector and not factor.
确保类型是字符向量而不是因子。
Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table
solution as well.
既然你也说你的数据是巨大的,而且速度/性能可能是一个因素,我建议一个数据。表解决方案。
require(data.table)setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+# if you don't want df to be modified by referenceans = as.data.table(df)[, count := uniqueN(type), by = color]
uniqueN
was implemented in v1.9.6
and is a faster equivalent of length(unique(.))
. In addition it also works with data.frames/data.tables.
uniqueN在v1.9.6中被采用,并且是一个体现长度的快速版本(unique(.)。此外,它还适用于data.frame /data.tables。
Other solutions:
其他的解决方案:
Using plyr:
使用plyr:
require(plyr)ddply(df, .(color), mutate, count = length(unique(type)))
Using aggregate
:
使用聚合:
agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))merge(df, agg, by="color", all=TRUE)
#2
32
Here's a solution with the dplyr package - it has n_distinct()
as a wrapper for length(unique())
.
这里有一个dplyr包的解决方案——它有n_distinct()作为长度(unique())的包装。
df %>% group_by(color) %>% mutate(unique_types = n_distinct(type))
#3
4
This can be also achieved in a vectorized without by group operations by combining unique
with table
or tabulate
这也可以在不通过分组操作的情况下,通过与表或表格相结合的方式实现。
If df$color
is factor
, then
如果df$color是因数,则
Either
要么
table(unique(df)$color)[as.character(df$color)]# black black black green green red red blue blue blue # 2 2 2 1 1 2 2 3 3 3
Or
或
tabulate(unique(df)$color)[as.integer(df$color)]# [1] 2 2 2 1 1 2 2 3 3 3
If df$color
is character
then just
如果df$color就是字符
table(unique(df)$color)[df$color]
If df$color
is an integer
then just
如果df$color是一个整数,那么
tabulate(unique(df)$color)[df$color]