R:按所有因子水平(当前和非当前)汇总

时间:2020-12-14 07:35:39

I can aggregate a data.frame trivially with dplyr with the following:

我可以用dplyr简单地聚合一个数据。

z <- data.frame(a = rnorm(20), b = rep(letters[1:4], each = 5))

library(dplyr)

z %>%
  group_by(b) %>%
  summarise(out = n())

Source: local data frame [4 x 2]

       b   out
  (fctr) (int)
1      a     5
2      b     5
3      c     5
4      d     5

However, sometimes a dataset may be missing a factor. In which case I would like the output to be 0.

然而,有时数据集可能缺少一个因素。在这种情况下,我希望输出是0。

For example, let's say the typical dataset should have 5 groups.

例如,假设典型的数据集应该有5个组。

z$b <- factor(z$b, levels = letters[1:5])

But clearly there aren't any in this particular but could be in another. How can I aggregate this data so the length for missing factors is 0.

但很明显,在这一点上没有任何东西,但可能是另一个。如何对这些数据进行聚合,使遗漏因子的长度为0。

Desired output:

期望的输出:

Source: local data frame [4 x 2]

       b   out
  (fctr) (int)
1      a     5
2      b     5
3      c     5
4      d     5
5      e     0

3 个解决方案

#1


2  

One way to approach this is to use complete from "tidyr". You have to use mutate first to factor column "b":

一种方法是使用“tidyr”中的complete。你必须先用突变因子分解b列:

library(dplyr)
library(tidyr)

z %>%
  mutate(b = factor(b, letters[1:5])) %>%
  group_by(b) %>%
  summarise(out = n()) %>%
  complete(b, fill = list(out = 0))
# Source: local data frame [5 x 2]
# 
#        b   out
#   (fctr) (dbl)
# 1      a     5
# 2      b     5
# 3      c     5
# 4      d     5
# 5      e     0

#2


1  

A workaround is to join with a table containing all levels:

解决方案是与包含所有级别的表连接:

z <- full_join(z, data.frame(b=levels(z$b))

This will set all the missing rows for your analysis variables to NA, which in the general case would make more sense than setting them to zero. You can change them to zero if necessary with z[is.na(z)] <- 0.

这会将分析变量的所有缺失行设置为NA,这在一般情况下比将它们设置为0更有意义。如果需要的话,可以用z[is.na(z)] <- 0将它们变为0。

#3


0  

You could use xtabs:

您可以使用xtabs:

xtabs(a ~ b, z)

This aggregates z$b rather than just counting levels in z$a as in your example, but that's easily achieved with table:

这将合计z$b,而不像您的示例中那样只计算z$a的级别,但是这很容易通过表实现:

table(z$a)

#1


2  

One way to approach this is to use complete from "tidyr". You have to use mutate first to factor column "b":

一种方法是使用“tidyr”中的complete。你必须先用突变因子分解b列:

library(dplyr)
library(tidyr)

z %>%
  mutate(b = factor(b, letters[1:5])) %>%
  group_by(b) %>%
  summarise(out = n()) %>%
  complete(b, fill = list(out = 0))
# Source: local data frame [5 x 2]
# 
#        b   out
#   (fctr) (dbl)
# 1      a     5
# 2      b     5
# 3      c     5
# 4      d     5
# 5      e     0

#2


1  

A workaround is to join with a table containing all levels:

解决方案是与包含所有级别的表连接:

z <- full_join(z, data.frame(b=levels(z$b))

This will set all the missing rows for your analysis variables to NA, which in the general case would make more sense than setting them to zero. You can change them to zero if necessary with z[is.na(z)] <- 0.

这会将分析变量的所有缺失行设置为NA,这在一般情况下比将它们设置为0更有意义。如果需要的话,可以用z[is.na(z)] <- 0将它们变为0。

#3


0  

You could use xtabs:

您可以使用xtabs:

xtabs(a ~ b, z)

This aggregates z$b rather than just counting levels in z$a as in your example, but that's easily achieved with table:

这将合计z$b,而不像您的示例中那样只计算z$a的级别,但是这很容易通过表实现:

table(z$a)