I am trying to write a function in R to summarize a table. The following is an example function and I am using the Iris data as a test.
我试图用R写一个函数来总结一个表。下面是一个示例函数,我将使用Iris数据作为测试。
test_func <- function(data, by_var_nm) {
by_var_nm <- deparse(substitute(by_var_nm))
tbl_test_sum <- data %>%
group_by(data[[by_var_nm]]) %>%
summarise(
count = n()
)
tbl_test_sum
}
test_func(iris, Species)
As you could see, the output in the following section has a problem, in which the first variable in the table is called "data[[by_var_nm]]" instead of "Species". Is there any way that I could maintain the original variable name during the summarizing process?
如您所见,下一节中的输出有一个问题,其中表中的第一个变量被称为“data[[by_var_nm]]”,而不是“Species”。在汇总过程中,是否有办法维护原来的变量名?
# A tibble: 3 x 2
`data[[by_var_nm]]` count
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Thank you.
谢谢你!
Thank you all for very helpful answer. I tried the solutions and it seems snoram's answer solved my initial problem quite well. However, after I combined everything together, I couldn't get the last bit of the plot working properly. The idea is that I want to plot the percentage distribution on the "var_nm" and group them by "by_var_nm". The problem I got is that the bar graph and also the percentage for the data label are not lined up properly.
谢谢大家的帮助。我尝试了解决方案,似乎snoram的回答很好地解决了我最初的问题。然而,在我把所有的东西都结合在一起之后,我无法让最后一点情节正常运行。我的想法是绘制“var_nm”上的百分比分布并将它们分组为“by_var_nm”。我遇到的问题是条形图和数据标签的百分比排列不正确。
test_func <- function(data, var_nm, by_var_nm) {
var_nm <- deparse(substitute(var_nm))
by_var_nm <- deparse(substitute(by_var_nm))
tbl_test_sum <- as.data.frame(table(data[[by_var_nm]], data[[var_nm]]))
names(tbl_test_sum) <- c(by_var_nm, var_nm, "count")
# tbl_test_sum
tbl_test_total <- as.data.frame(table(data[[by_var_nm]]))
names(tbl_test_total) <- c(by_var_nm, "total")
# tbl_test_total
tbl_test_pctg <- full_join(tbl_test_sum, tbl_test_total, by = by_var_nm) %>%
mutate(
percentage = count / total
)
# tbl_test_pctg
ggplot(data=tbl_test_pctg, aes(x = tbl_test_pctg[[var_nm]], y = percentage, fill = tbl_test_pctg[[var_nm]])) +
geom_bar(stat="identity") +
geom_text(aes(label = scales::percent(percentage))) +
facet_grid(tbl_test_pctg[[by_var_nm]]~.) +
coord_flip()
}
test_func(mtcars, cyl, am)
3 个解决方案
#1
1
Suggesting similar solution as Alexandre but breaking dplyr
dependency at the same time. If you are planning on keeping this function I think unnecessary dependencies is not a good idea.
建议类似Alexandre的解决方案,但同时打破dplyr依赖关系。如果您打算保留这个函数,我认为不必要的依赖关系不是一个好主意。
test_func <- function(data, by_var_nm) {
by_var_nm <- deparse(substitute(by_var_nm))
tbl_test_sum <- as.data.frame(table(data[[by_var_nm]]))
names(tbl_test_sum) <- c(by_var_nm, "count")
tbl_test_sum
}
Speed:
速度:
> microbenchmark::microbenchmark(test_func_Alex(iris, Species), test_func_snoram(iris, Species), unit = "relative")
Unit: relative
expr min lq mean median uq max neval cld
test_func_Alex(iris, Species) 6.910679 6.834064 5.827796 5.622154 5.480321 4.009469 100 b
test_func_snoram(iris, Species) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
#2
1
You can use rlang
's Quotation syntax, which is designed for this use case; Also read the examples here:
您可以使用rlang的引号语法,它是为这个用例设计的;也可以阅读下面的例子:
library(rlang); library(dplyr)
test_func <- function(data, by_var_nm) {
by_var_nm <- enquo(by_var_nm)
tbl_test_sum <- data %>%
group_by(!!by_var_nm) %>%
summarise(
count = n()
)
tbl_test_sum
}
test_func(iris, Species)
# A tibble: 3 x 2
# Species count
# <fct> <int>
#1 setosa 50
#2 versicolor 50
#3 virginica 50
#3
0
I don't know why this is happening but you can use this trick to get back the name :
我不知道为什么会这样但你可以用这个技巧来取回名字
test_func <- function(data, by_var_nm) {
by_var_nm <- deparse(substitute(by_var_nm))
tbl_test_sum <- data %>%
group_by(data[[by_var_nm]]) %>%
summarise(
count = n()
)
names(tbl_test_sum)[grep("by_var_nm",names(tbl_test_sum))] <- by_var_nm
tbl_test_sum
}
test_func(iris, Species)
You can also use the index names(tbl_test_sum)[1]
assuming the group_by()
is creating the first column on this variable.
您还可以使用索引名称(tbl_test_sum)[1],假设group_by()正在创建这个变量的第一列。
Hope this will help you
希望这能对你有所帮助
#1
1
Suggesting similar solution as Alexandre but breaking dplyr
dependency at the same time. If you are planning on keeping this function I think unnecessary dependencies is not a good idea.
建议类似Alexandre的解决方案,但同时打破dplyr依赖关系。如果您打算保留这个函数,我认为不必要的依赖关系不是一个好主意。
test_func <- function(data, by_var_nm) {
by_var_nm <- deparse(substitute(by_var_nm))
tbl_test_sum <- as.data.frame(table(data[[by_var_nm]]))
names(tbl_test_sum) <- c(by_var_nm, "count")
tbl_test_sum
}
Speed:
速度:
> microbenchmark::microbenchmark(test_func_Alex(iris, Species), test_func_snoram(iris, Species), unit = "relative")
Unit: relative
expr min lq mean median uq max neval cld
test_func_Alex(iris, Species) 6.910679 6.834064 5.827796 5.622154 5.480321 4.009469 100 b
test_func_snoram(iris, Species) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
#2
1
You can use rlang
's Quotation syntax, which is designed for this use case; Also read the examples here:
您可以使用rlang的引号语法,它是为这个用例设计的;也可以阅读下面的例子:
library(rlang); library(dplyr)
test_func <- function(data, by_var_nm) {
by_var_nm <- enquo(by_var_nm)
tbl_test_sum <- data %>%
group_by(!!by_var_nm) %>%
summarise(
count = n()
)
tbl_test_sum
}
test_func(iris, Species)
# A tibble: 3 x 2
# Species count
# <fct> <int>
#1 setosa 50
#2 versicolor 50
#3 virginica 50
#3
0
I don't know why this is happening but you can use this trick to get back the name :
我不知道为什么会这样但你可以用这个技巧来取回名字
test_func <- function(data, by_var_nm) {
by_var_nm <- deparse(substitute(by_var_nm))
tbl_test_sum <- data %>%
group_by(data[[by_var_nm]]) %>%
summarise(
count = n()
)
names(tbl_test_sum)[grep("by_var_nm",names(tbl_test_sum))] <- by_var_nm
tbl_test_sum
}
test_func(iris, Species)
You can also use the index names(tbl_test_sum)[1]
assuming the group_by()
is creating the first column on this variable.
您还可以使用索引名称(tbl_test_sum)[1],假设group_by()正在创建这个变量的第一列。
Hope this will help you
希望这能对你有所帮助