How to create simple summary statistics using dplyr
from multiple variables? Using the summarise_each
function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.
如何使用多个变量的dplyr创建简单的汇总统计信息?使用summarise_each函数似乎是要走的路,但是,当将多个函数应用于多个列时,结果是一个宽而难以读取的数据帧。
4 个解决方案
#1
20
Use dplyr
in combination with tidyr
to reshape the end result.
将dplyr与tidyr结合使用可以重塑最终结果。
library(dplyr)
library(tidyr)
df <- tbl_df(mtcars)
df.sum <- df %>%
select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
summarise_each(funs(min = min,
q25 = quantile(., 0.25),
median = median,
q75 = quantile(., 0.75),
max = max,
mean = mean,
sd = sd))
# the result is a wide data frame
> dim(df.sum)
[1] 1 42
# reshape it using tidyr functions
df.stats.tidy <- df.sum %>% gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, q25, median, q75, max, mean, sd) # reorder columns
> print(df.stats.tidy)
var min q25 median q75 max mean sd
1 am 0.0 0.000 0.0 1.0 1.0 0.40625 0.4989909
2 carb 1.0 2.000 2.0 4.0 8.0 2.81250 1.6152000
3 cyl 4.0 4.000 6.0 8.0 8.0 6.18750 1.7859216
4 gear 3.0 3.000 4.0 4.0 5.0 3.68750 0.7378041
5 mpg 10.4 15.425 19.2 22.8 33.9 20.09062 6.0269481
6 vs 0.0 0.000 0.0 1.0 1.0 0.43750 0.5040161
#2
6
If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.
如果您想创建一个用于发布的汇总表(不用于进一步的计算),您可能需要查看优秀的观星包。
df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
df[, cols], type = "text",
summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)
================================================================
Statistic Min Pctl(25) Median Pctl(75) Max Median St. Dev.
----------------------------------------------------------------
mpg 10.400 15.430 19.200 22.800 33.900 19.200 6.027
cyl 4 4 6 8 8 6 1.786
vs 0 0 0 1 1 0 0.504
am 0 0 0 1 1 0 0.499
gear 3 3 4 4 5 4 0.738
carb 1 2 2 4 8 2 1.615
----------------------------------------------------------------
You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.
您也可以将类型更改为'latex'和'html',并将其保存到文件中,并指定给出'out'参数的文件。
#3
5
I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.
我喜欢paljenczy的想法,即只使用dplyr / tidy并在格式化之前将表格放在data.frame / tibble中。但是我遇到了健壮性问题:因为它依赖于解析变量名称,所以它会在名称中带有下划线的列上窒息。在尝试在dplyr框架中修复它之后,它似乎总是有些脆弱,因为它依赖于字符串解析。
So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. Full example duplicating the previous solutions below (combining describe with some tidyverse stuff to get the exact tibble I'm looking for):
所以最后我决定使用psych :: describe()这是一个专门为这个东西设计的函数。它并不完全是任意函数,而是几乎任何人都想要做的事情。完整的示例复制了以下的解决方案(将描述与一些tidyverse的东西结合起来得到我正在寻找的确切的tibble):
library(psych)
library(tidyverse)
# Create an extended version with a bunch of stats
d.summary.extended <- mtcars %>%
select(mpg, cyl, vs, am, gear, carb) %>%
psych::describe(quant=c(.25,.75)) %>%
as_tibble() %>%
rownames_to_column() %>%
print()
<OUTPUT>
# A tibble: 6 x 16
rowname vars n mean sd median trimmed mad min max range skew kurtosis se Q0.25 Q0.75
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 1 32 20.09062 6.0269481 19.2 19.6961538 5.41149 10.4 33.9 23.5 0.6106550 -0.372766 1.06542396 15.425 22.8
2 cyl 2 32 6.18750 1.7859216 6.0 6.2307692 2.96520 4.0 8.0 4.0 -0.1746119 -1.762120 0.31570933 4.000 8.0
3 vs 3 32 0.43750 0.5040161 0.0 0.4230769 0.00000 0.0 1.0 1.0 0.2402577 -2.001938 0.08909831 0.000 1.0
4 am 4 32 0.40625 0.4989909 0.0 0.3846154 0.00000 0.0 1.0 1.0 0.3640159 -1.924741 0.08820997 0.000 1.0
5 gear 5 32 3.68750 0.7378041 4.0 3.6153846 1.48260 3.0 5.0 2.0 0.5288545 -1.069751 0.13042656 3.000 4.0
6 carb 6 32 2.81250 1.6152000 2.0 2.6538462 1.48260 1.0 8.0 7.0 1.0508738 1.257043 0.28552971 2.000 4.0
</OUTPUT>
# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
print()
<OUTPUT>
# A tibble: 6 x 8
var min q25 median q75 max mean sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 10.4 15.425 19.2 22.8 33.9 20.09062 6.0269481
2 cyl 4.0 4.000 6.0 8.0 8.0 6.18750 1.7859216
3 vs 0.0 0.000 0.0 1.0 1.0 0.43750 0.5040161
4 am 0.0 0.000 0.0 1.0 1.0 0.40625 0.4989909
5 gear 3.0 3.000 4.0 4.0 5.0 3.68750 0.7378041
6 carb 1.0 2.000 2.0 4.0 8.0 2.81250 1.6152000
</OUTPUT>
#4
1
You can achieve the same result using data.table
as well. You might consider using it if your table is big.
您也可以使用data.table获得相同的结果。如果你的桌子很大,你可以考虑使用它。
dt <- data.table(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')
dt.sum <- dt[
,
lapply(
.SD,
function(x) list(
min(x), quantile(x, 0.25), median(x),
quantile(x, 0.75), max(x), mean(x), sd(x)
)
),
.SDcols = cols
]
dt.sum
mpg cyl vs am gear carb
1: 10.4 4 0 0 3 1
2: 15.43 4 0 0 3 2
3: 19.2 6 0 0 4 2
4: 22.8 8 1 1 4 4
5: 33.9 8 1 1 5 8
6: 20.09 6.188 0.4375 0.4062 3.688 2.812
7: 6.027 1.786 0.504 0.499 0.7378 1.615
# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))
dt.sum.t
var min q25 median q75 max mean sd
1: mpg 10.4 15.43 19.2 22.8 33.9 20.09 6.027
2: cyl 4 4 6 8 8 6.188 1.786
3: vs 0 0 0 1 1 0.4375 0.504
4: am 0 0 0 1 1 0.4062 0.499
5: gear 3 3 4 4 5 3.688 0.7378
6: carb 1 2 2 4 8 2.812 1.615
#1
20
Use dplyr
in combination with tidyr
to reshape the end result.
将dplyr与tidyr结合使用可以重塑最终结果。
library(dplyr)
library(tidyr)
df <- tbl_df(mtcars)
df.sum <- df %>%
select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
summarise_each(funs(min = min,
q25 = quantile(., 0.25),
median = median,
q75 = quantile(., 0.75),
max = max,
mean = mean,
sd = sd))
# the result is a wide data frame
> dim(df.sum)
[1] 1 42
# reshape it using tidyr functions
df.stats.tidy <- df.sum %>% gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, q25, median, q75, max, mean, sd) # reorder columns
> print(df.stats.tidy)
var min q25 median q75 max mean sd
1 am 0.0 0.000 0.0 1.0 1.0 0.40625 0.4989909
2 carb 1.0 2.000 2.0 4.0 8.0 2.81250 1.6152000
3 cyl 4.0 4.000 6.0 8.0 8.0 6.18750 1.7859216
4 gear 3.0 3.000 4.0 4.0 5.0 3.68750 0.7378041
5 mpg 10.4 15.425 19.2 22.8 33.9 20.09062 6.0269481
6 vs 0.0 0.000 0.0 1.0 1.0 0.43750 0.5040161
#2
6
If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.
如果您想创建一个用于发布的汇总表(不用于进一步的计算),您可能需要查看优秀的观星包。
df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
df[, cols], type = "text",
summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)
================================================================
Statistic Min Pctl(25) Median Pctl(75) Max Median St. Dev.
----------------------------------------------------------------
mpg 10.400 15.430 19.200 22.800 33.900 19.200 6.027
cyl 4 4 6 8 8 6 1.786
vs 0 0 0 1 1 0 0.504
am 0 0 0 1 1 0 0.499
gear 3 3 4 4 5 4 0.738
carb 1 2 2 4 8 2 1.615
----------------------------------------------------------------
You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.
您也可以将类型更改为'latex'和'html',并将其保存到文件中,并指定给出'out'参数的文件。
#3
5
I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.
我喜欢paljenczy的想法,即只使用dplyr / tidy并在格式化之前将表格放在data.frame / tibble中。但是我遇到了健壮性问题:因为它依赖于解析变量名称,所以它会在名称中带有下划线的列上窒息。在尝试在dplyr框架中修复它之后,它似乎总是有些脆弱,因为它依赖于字符串解析。
So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. Full example duplicating the previous solutions below (combining describe with some tidyverse stuff to get the exact tibble I'm looking for):
所以最后我决定使用psych :: describe()这是一个专门为这个东西设计的函数。它并不完全是任意函数,而是几乎任何人都想要做的事情。完整的示例复制了以下的解决方案(将描述与一些tidyverse的东西结合起来得到我正在寻找的确切的tibble):
library(psych)
library(tidyverse)
# Create an extended version with a bunch of stats
d.summary.extended <- mtcars %>%
select(mpg, cyl, vs, am, gear, carb) %>%
psych::describe(quant=c(.25,.75)) %>%
as_tibble() %>%
rownames_to_column() %>%
print()
<OUTPUT>
# A tibble: 6 x 16
rowname vars n mean sd median trimmed mad min max range skew kurtosis se Q0.25 Q0.75
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 1 32 20.09062 6.0269481 19.2 19.6961538 5.41149 10.4 33.9 23.5 0.6106550 -0.372766 1.06542396 15.425 22.8
2 cyl 2 32 6.18750 1.7859216 6.0 6.2307692 2.96520 4.0 8.0 4.0 -0.1746119 -1.762120 0.31570933 4.000 8.0
3 vs 3 32 0.43750 0.5040161 0.0 0.4230769 0.00000 0.0 1.0 1.0 0.2402577 -2.001938 0.08909831 0.000 1.0
4 am 4 32 0.40625 0.4989909 0.0 0.3846154 0.00000 0.0 1.0 1.0 0.3640159 -1.924741 0.08820997 0.000 1.0
5 gear 5 32 3.68750 0.7378041 4.0 3.6153846 1.48260 3.0 5.0 2.0 0.5288545 -1.069751 0.13042656 3.000 4.0
6 carb 6 32 2.81250 1.6152000 2.0 2.6538462 1.48260 1.0 8.0 7.0 1.0508738 1.257043 0.28552971 2.000 4.0
</OUTPUT>
# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
print()
<OUTPUT>
# A tibble: 6 x 8
var min q25 median q75 max mean sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 10.4 15.425 19.2 22.8 33.9 20.09062 6.0269481
2 cyl 4.0 4.000 6.0 8.0 8.0 6.18750 1.7859216
3 vs 0.0 0.000 0.0 1.0 1.0 0.43750 0.5040161
4 am 0.0 0.000 0.0 1.0 1.0 0.40625 0.4989909
5 gear 3.0 3.000 4.0 4.0 5.0 3.68750 0.7378041
6 carb 1.0 2.000 2.0 4.0 8.0 2.81250 1.6152000
</OUTPUT>
#4
1
You can achieve the same result using data.table
as well. You might consider using it if your table is big.
您也可以使用data.table获得相同的结果。如果你的桌子很大,你可以考虑使用它。
dt <- data.table(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')
dt.sum <- dt[
,
lapply(
.SD,
function(x) list(
min(x), quantile(x, 0.25), median(x),
quantile(x, 0.75), max(x), mean(x), sd(x)
)
),
.SDcols = cols
]
dt.sum
mpg cyl vs am gear carb
1: 10.4 4 0 0 3 1
2: 15.43 4 0 0 3 2
3: 19.2 6 0 0 4 2
4: 22.8 8 1 1 4 4
5: 33.9 8 1 1 5 8
6: 20.09 6.188 0.4375 0.4062 3.688 2.812
7: 6.027 1.786 0.504 0.499 0.7378 1.615
# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))
dt.sum.t
var min q25 median q75 max mean sd
1: mpg 10.4 15.43 19.2 22.8 33.9 20.09 6.027
2: cyl 4 4 6 8 8 6.188 1.786
3: vs 0 0 0 1 1 0.4375 0.504
4: am 0 0 0 1 1 0.4062 0.499
5: gear 3 3 4 4 5 3.688 0.7378
6: carb 1 2 2 4 8 2.812 1.615