dplyr - 多个变量的汇总表

时间:2022-07-22 14:55:35

How to create simple summary statistics using dplyr from multiple variables? Using the summarise_each function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.

如何使用多个变量的dplyr创建简单的汇总统计信息?使用summarise_each函数似乎是要走的路,但是,当将多个函数应用于多个列时,结果是一个宽而难以读取的数据帧。

4 个解决方案

#1


20  

Use dplyr in combination with tidyr to reshape the end result.

将dplyr与tidyr结合使用可以重塑最终结果。

library(dplyr)
library(tidyr)

df <- tbl_df(mtcars)

df.sum <- df %>%
  select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean, 
                      sd = sd))

# the result is a wide data frame
> dim(df.sum)
[1]  1 42

# reshape it using tidyr functions

df.stats.tidy <- df.sum %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, q25, median, q75, max, mean, sd) # reorder columns

> print(df.stats.tidy)

   var  min    q25 median  q75  max     mean        sd
1   am  0.0  0.000    0.0  1.0  1.0  0.40625 0.4989909
2 carb  1.0  2.000    2.0  4.0  8.0  2.81250 1.6152000
3  cyl  4.0  4.000    6.0  8.0  8.0  6.18750 1.7859216
4 gear  3.0  3.000    4.0  4.0  5.0  3.68750 0.7378041
5  mpg 10.4 15.425   19.2 22.8 33.9 20.09062 6.0269481
6   vs  0.0  0.000    0.0  1.0  1.0  0.43750 0.5040161

#2


6  

If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.

如果您想创建一个用于发布的汇总表(不用于进一步的计算),您可能需要查看优秀的观星包。

df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
    df[, cols], type = "text", 
    summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)

================================================================
Statistic  Min   Pctl(25) Median Pctl(75)  Max   Median St. Dev.
----------------------------------------------------------------
mpg       10.400  15.430  19.200  22.800  33.900 19.200  6.027
cyl         4       4       6       8       8      6     1.786
vs          0       0       0       1       1      0     0.504
am          0       0       0       1       1      0     0.499
gear        3       3       4       4       5      4     0.738
carb        1       2       2       4       8      2     1.615
----------------------------------------------------------------

You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.

您也可以将类型更改为'latex'和'html',并将其保存到文件中,并指定给出'out'参数的文件。

#3


5  

I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.

我喜欢paljenczy的想法,即只使用dplyr / tidy并在格式化之前将表格放在data.frame / tibble中。但是我遇到了健壮性问题:因为它依赖于解析变量名称,所以它会在名称中带有下划线的列上窒息。在尝试在dplyr框架中修复它之后,它似乎总是有些脆弱,因为它依赖于字符串解析。

So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. Full example duplicating the previous solutions below (combining describe with some tidyverse stuff to get the exact tibble I'm looking for):

所以最后我决定使用psych :: describe()这是一个专门为这个东西设计的函数。它并不完全是任意函数,而是几乎任何人都想要做的事情。完整的示例复制了以下的解决方案(将描述与一些tidyverse的东西结合起来得到我正在寻找的确切的tibble):

library(psych)
library(tidyverse)

# Create an extended version with a bunch of stats 
d.summary.extended <- mtcars %>%
    select(mpg, cyl, vs, am, gear, carb) %>%
    psych::describe(quant=c(.25,.75)) %>%
    as_tibble() %>%
    rownames_to_column() %>%
    print()

<OUTPUT>
# A tibble: 6 x 16
  rowname  vars     n     mean        sd median    trimmed     mad   min   max range       skew  kurtosis         se  Q0.25 Q0.75
    <chr> <int> <dbl>    <dbl>     <dbl>  <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl>      <dbl>  <dbl> <dbl>
1     mpg     1    32 20.09062 6.0269481   19.2 19.6961538 5.41149  10.4  33.9  23.5  0.6106550 -0.372766 1.06542396 15.425  22.8
2     cyl     2    32  6.18750 1.7859216    6.0  6.2307692 2.96520   4.0   8.0   4.0 -0.1746119 -1.762120 0.31570933  4.000   8.0
3      vs     3    32  0.43750 0.5040161    0.0  0.4230769 0.00000   0.0   1.0   1.0  0.2402577 -2.001938 0.08909831  0.000   1.0
4      am     4    32  0.40625 0.4989909    0.0  0.3846154 0.00000   0.0   1.0   1.0  0.3640159 -1.924741 0.08820997  0.000   1.0
5    gear     5    32  3.68750 0.7378041    4.0  3.6153846 1.48260   3.0   5.0   2.0  0.5288545 -1.069751 0.13042656  3.000   4.0
6    carb     6    32  2.81250 1.6152000    2.0  2.6538462 1.48260   1.0   8.0   7.0  1.0508738  1.257043 0.28552971  2.000   4.0
</OUTPUT>

# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
    select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
    print()

<OUTPUT>
# A tibble: 6 x 8
    var   min    q25 median   q75   max     mean        sd
  <chr> <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>     <dbl>
1   mpg  10.4 15.425   19.2  22.8  33.9 20.09062 6.0269481
2   cyl   4.0  4.000    6.0   8.0   8.0  6.18750 1.7859216
3    vs   0.0  0.000    0.0   1.0   1.0  0.43750 0.5040161
4    am   0.0  0.000    0.0   1.0   1.0  0.40625 0.4989909
5  gear   3.0  3.000    4.0   4.0   5.0  3.68750 0.7378041
6  carb   1.0  2.000    2.0   4.0   8.0  2.81250 1.6152000    
</OUTPUT>

#4


1  

You can achieve the same result using data.table as well. You might consider using it if your table is big.

您也可以使用data.table获得相同的结果。如果你的桌子很大,你可以考虑使用它。

dt <- data.table(mtcars)

cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')

dt.sum <- dt[
    , 
    lapply(
        .SD, 
        function(x) list(
                min(x), quantile(x, 0.25), median(x), 
                quantile(x, 0.75), max(x), mean(x), sd(x)
        )
    ),
    .SDcols = cols
]

dt.sum
     mpg   cyl     vs     am   gear  carb
1:  10.4     4      0      0      3     1
2: 15.43     4      0      0      3     2
3:  19.2     6      0      0      4     2
4:  22.8     8      1      1      4     4
5:  33.9     8      1      1      5     8
6: 20.09 6.188 0.4375 0.4062  3.688 2.812
7: 6.027 1.786  0.504  0.499 0.7378 1.615

# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))

dt.sum.t
    var  min   q25 median  q75  max   mean     sd
1:  mpg 10.4 15.43   19.2 22.8 33.9  20.09  6.027
2:  cyl    4     4      6    8    8  6.188  1.786
3:   vs    0     0      0    1    1 0.4375  0.504
4:   am    0     0      0    1    1 0.4062  0.499
5: gear    3     3      4    4    5  3.688 0.7378
6: carb    1     2      2    4    8  2.812  1.615

#1


20  

Use dplyr in combination with tidyr to reshape the end result.

将dplyr与tidyr结合使用可以重塑最终结果。

library(dplyr)
library(tidyr)

df <- tbl_df(mtcars)

df.sum <- df %>%
  select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean, 
                      sd = sd))

# the result is a wide data frame
> dim(df.sum)
[1]  1 42

# reshape it using tidyr functions

df.stats.tidy <- df.sum %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, q25, median, q75, max, mean, sd) # reorder columns

> print(df.stats.tidy)

   var  min    q25 median  q75  max     mean        sd
1   am  0.0  0.000    0.0  1.0  1.0  0.40625 0.4989909
2 carb  1.0  2.000    2.0  4.0  8.0  2.81250 1.6152000
3  cyl  4.0  4.000    6.0  8.0  8.0  6.18750 1.7859216
4 gear  3.0  3.000    4.0  4.0  5.0  3.68750 0.7378041
5  mpg 10.4 15.425   19.2 22.8 33.9 20.09062 6.0269481
6   vs  0.0  0.000    0.0  1.0  1.0  0.43750 0.5040161

#2


6  

If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.

如果您想创建一个用于发布的汇总表(不用于进一步的计算),您可能需要查看优秀的观星包。

df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
    df[, cols], type = "text", 
    summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)

================================================================
Statistic  Min   Pctl(25) Median Pctl(75)  Max   Median St. Dev.
----------------------------------------------------------------
mpg       10.400  15.430  19.200  22.800  33.900 19.200  6.027
cyl         4       4       6       8       8      6     1.786
vs          0       0       0       1       1      0     0.504
am          0       0       0       1       1      0     0.499
gear        3       3       4       4       5      4     0.738
carb        1       2       2       4       8      2     1.615
----------------------------------------------------------------

You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.

您也可以将类型更改为'latex'和'html',并将其保存到文件中,并指定给出'out'参数的文件。

#3


5  

I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.

我喜欢paljenczy的想法,即只使用dplyr / tidy并在格式化之前将表格放在data.frame / tibble中。但是我遇到了健壮性问题:因为它依赖于解析变量名称,所以它会在名称中带有下划线的列上窒息。在尝试在dplyr框架中修复它之后,它似乎总是有些脆弱,因为它依赖于字符串解析。

So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. Full example duplicating the previous solutions below (combining describe with some tidyverse stuff to get the exact tibble I'm looking for):

所以最后我决定使用psych :: describe()这是一个专门为这个东西设计的函数。它并不完全是任意函数,而是几乎任何人都想要做的事情。完整的示例复制了以下的解决方案(将描述与一些tidyverse的东西结合起来得到我正在寻找的确切的tibble):

library(psych)
library(tidyverse)

# Create an extended version with a bunch of stats 
d.summary.extended <- mtcars %>%
    select(mpg, cyl, vs, am, gear, carb) %>%
    psych::describe(quant=c(.25,.75)) %>%
    as_tibble() %>%
    rownames_to_column() %>%
    print()

<OUTPUT>
# A tibble: 6 x 16
  rowname  vars     n     mean        sd median    trimmed     mad   min   max range       skew  kurtosis         se  Q0.25 Q0.75
    <chr> <int> <dbl>    <dbl>     <dbl>  <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl>      <dbl>  <dbl> <dbl>
1     mpg     1    32 20.09062 6.0269481   19.2 19.6961538 5.41149  10.4  33.9  23.5  0.6106550 -0.372766 1.06542396 15.425  22.8
2     cyl     2    32  6.18750 1.7859216    6.0  6.2307692 2.96520   4.0   8.0   4.0 -0.1746119 -1.762120 0.31570933  4.000   8.0
3      vs     3    32  0.43750 0.5040161    0.0  0.4230769 0.00000   0.0   1.0   1.0  0.2402577 -2.001938 0.08909831  0.000   1.0
4      am     4    32  0.40625 0.4989909    0.0  0.3846154 0.00000   0.0   1.0   1.0  0.3640159 -1.924741 0.08820997  0.000   1.0
5    gear     5    32  3.68750 0.7378041    4.0  3.6153846 1.48260   3.0   5.0   2.0  0.5288545 -1.069751 0.13042656  3.000   4.0
6    carb     6    32  2.81250 1.6152000    2.0  2.6538462 1.48260   1.0   8.0   7.0  1.0508738  1.257043 0.28552971  2.000   4.0
</OUTPUT>

# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
    select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
    print()

<OUTPUT>
# A tibble: 6 x 8
    var   min    q25 median   q75   max     mean        sd
  <chr> <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>     <dbl>
1   mpg  10.4 15.425   19.2  22.8  33.9 20.09062 6.0269481
2   cyl   4.0  4.000    6.0   8.0   8.0  6.18750 1.7859216
3    vs   0.0  0.000    0.0   1.0   1.0  0.43750 0.5040161
4    am   0.0  0.000    0.0   1.0   1.0  0.40625 0.4989909
5  gear   3.0  3.000    4.0   4.0   5.0  3.68750 0.7378041
6  carb   1.0  2.000    2.0   4.0   8.0  2.81250 1.6152000    
</OUTPUT>

#4


1  

You can achieve the same result using data.table as well. You might consider using it if your table is big.

您也可以使用data.table获得相同的结果。如果你的桌子很大,你可以考虑使用它。

dt <- data.table(mtcars)

cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')

dt.sum <- dt[
    , 
    lapply(
        .SD, 
        function(x) list(
                min(x), quantile(x, 0.25), median(x), 
                quantile(x, 0.75), max(x), mean(x), sd(x)
        )
    ),
    .SDcols = cols
]

dt.sum
     mpg   cyl     vs     am   gear  carb
1:  10.4     4      0      0      3     1
2: 15.43     4      0      0      3     2
3:  19.2     6      0      0      4     2
4:  22.8     8      1      1      4     4
5:  33.9     8      1      1      5     8
6: 20.09 6.188 0.4375 0.4062  3.688 2.812
7: 6.027 1.786  0.504  0.499 0.7378 1.615

# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))

dt.sum.t
    var  min   q25 median  q75  max   mean     sd
1:  mpg 10.4 15.43   19.2 22.8 33.9  20.09  6.027
2:  cyl    4     4      6    8    8  6.188  1.786
3:   vs    0     0      0    1    1 0.4375  0.504
4:   am    0     0      0    1    1 0.4062  0.499
5: gear    3     3      4    4    5  3.688 0.7378
6: carb    1     2      2    4    8  2.812  1.615