I've created code that will take an input vector, create a dataframe based on the input, optimise some values and return some of these values. I'm now turning this into a function that will apply the calculations rowwise on an input dataframe. Below is a minimum working example of what I would like to achieve (my actual function would be too long to share here!):
我已经创建了包含一个输入向量的代码,基于输入创建一个dataframe,优化一些值并返回其中的一些值。现在我将它转换成一个函数,它将对输入数据aframe行数应用计算。下面是我想要实现的最小工作示例(我的实际功能太长,无法在这里分享!)
# Randomly generated dataframe
df <- data.frame(a = rnorm(10, 0, 1), x = rnorm(10, 1, 3), y = rnorm(10, 2, 3))
# Function that takes multiple arguments and returns multiple values in a list
zsummary <- function(x, y) {
if (y < 0) return(list(NA, NA))
z = rnorm(10, x, abs(y))
return(list(mean(z), sd(z)))
}
# Example of something that works using dplyr
# However, this results in a lot of function calls...
# especially if there were a lot of columns in the list...
library(dplyr)
df %>% rowwise() %>%
mutate(mean = zsummary(x,y)[[1]], sd = zsummary(x,y)[[1]])
As you can see, I can't apply individual functions to each new df$mean
and dfsd
columns as they depend on a z
vector that can only be generated once. I've looked around on SO already, but I haven't been able to find an answer yet. I think a solution would be using one of the apply
functions and not something from dplyr
, but I've honestly never fully understood apply
functions. I would also not like solutions that use for
loops with rbind
as I've tried this in previous projects and for large dataframes it becomes very slow!
如您所见,我不能对每个新的df$mean和dfsd列应用单独的函数,因为它们依赖于只能生成一次的z向量。我已经到处找过了,但还没找到答案。我认为一个解决方案应该使用一个应用函数,而不是dplyr的函数,但是我从来没有完全理解过应用函数。我也不喜欢使用带有rbind的循环的解决方案,因为我在以前的项目中已经尝试过了,对于大型dataframes,它会变得非常慢!
1 个解决方案
#1
2
We can use mapply
for this. As the zsummary
takes two arguments, the mapply
would be one option as it take corresponding element of 'x' and 'y' to apply the zsummary
.
我们可以用mapply来做这个。当zsummary接受两个参数时,mapply将是一个选项,因为它使用“x”和“y”的相应元素来应用zsummary。
t(mapply(zsummary, df$x, df$y))
We can also change the function slightly and get the output with dplyr
我们也可以稍微改变函数,用dplyr得到输出
zsummary <- function(x, y) {
if (y < 0) return(data.frame(mean = NA, sd = NA))
z = rnorm(10, x, abs(y))
data.frame(mean = mean(z), sd = sd(z))
}
df %>%
rowwise() %>%
do(data.frame(., zsummary(.$x, .$y)))
Or as we discussed in the comments, instead of having the function taking multiple arguments, have a single argument and use apply
with MARGIN=1
for applying it on each row.
或者,正如我们在注释中讨论的,函数不是使用多个参数,而是使用一个参数,并使用apply with MARGIN=1对每一行应用它。
zsummary2 <- function(v1){
if(v1[2] < 0) return(c(mean = NA, sd = NA))
z <- rnorm(10, v1[1], abs(v1[2]))
c(mean = mean(v1), sd= sd(v1))
}
t(apply(df[-1], 1, zsummary2))
# mean sd
# [1,] 1.403066 0.8757504
# [2,] 5.058188 5.1401507
# [3,] 4.288365 1.4194393
# [4,] 1.932829 6.7587054
# [5,] -1.864236 3.7587462
# [6,] NA NA
# [7,] 3.328629 1.3711950
# [8,] -2.347699 5.0449958
# [9,] 2.936615 1.7332283
#[10,] NA NA
NOTE: The values will be different in each run as we didn't set any seed for the rnorm
.
注意:在每次运行时,值将会不同,因为我们没有为rnorm设置任何种子。
#1
2
We can use mapply
for this. As the zsummary
takes two arguments, the mapply
would be one option as it take corresponding element of 'x' and 'y' to apply the zsummary
.
我们可以用mapply来做这个。当zsummary接受两个参数时,mapply将是一个选项,因为它使用“x”和“y”的相应元素来应用zsummary。
t(mapply(zsummary, df$x, df$y))
We can also change the function slightly and get the output with dplyr
我们也可以稍微改变函数,用dplyr得到输出
zsummary <- function(x, y) {
if (y < 0) return(data.frame(mean = NA, sd = NA))
z = rnorm(10, x, abs(y))
data.frame(mean = mean(z), sd = sd(z))
}
df %>%
rowwise() %>%
do(data.frame(., zsummary(.$x, .$y)))
Or as we discussed in the comments, instead of having the function taking multiple arguments, have a single argument and use apply
with MARGIN=1
for applying it on each row.
或者,正如我们在注释中讨论的,函数不是使用多个参数,而是使用一个参数,并使用apply with MARGIN=1对每一行应用它。
zsummary2 <- function(v1){
if(v1[2] < 0) return(c(mean = NA, sd = NA))
z <- rnorm(10, v1[1], abs(v1[2]))
c(mean = mean(v1), sd= sd(v1))
}
t(apply(df[-1], 1, zsummary2))
# mean sd
# [1,] 1.403066 0.8757504
# [2,] 5.058188 5.1401507
# [3,] 4.288365 1.4194393
# [4,] 1.932829 6.7587054
# [5,] -1.864236 3.7587462
# [6,] NA NA
# [7,] 3.328629 1.3711950
# [8,] -2.347699 5.0449958
# [9,] 2.936615 1.7332283
#[10,] NA NA
NOTE: The values will be different in each run as we didn't set any seed for the rnorm
.
注意:在每次运行时,值将会不同,因为我们没有为rnorm设置任何种子。