如何从R中的自定义函数向dataframe添加多个列

时间:2023-01-05 22:57:13

I've created code that will take an input vector, create a dataframe based on the input, optimise some values and return some of these values. I'm now turning this into a function that will apply the calculations rowwise on an input dataframe. Below is a minimum working example of what I would like to achieve (my actual function would be too long to share here!):

我已经创建了包含一个输入向量的代码,基于输入创建一个dataframe,优化一些值并返回其中的一些值。现在我将它转换成一个函数,它将对输入数据aframe行数应用计算。下面是我想要实现的最小工作示例(我的实际功能太长,无法在这里分享!)

# Randomly generated dataframe
df <-  data.frame(a = rnorm(10, 0, 1), x = rnorm(10, 1, 3), y = rnorm(10, 2, 3))

# Function that takes multiple arguments and returns multiple values in a list
zsummary <- function(x, y) { 
  if (y < 0) return(list(NA, NA))
  z = rnorm(10, x, abs(y))
  return(list(mean(z), sd(z)))
}

# Example of something that works using dplyr
#    However, this results in a lot of function calls...
#    especially if there were a lot of columns in the list...
library(dplyr)
df %>% rowwise() %>%
  mutate(mean = zsummary(x,y)[[1]], sd = zsummary(x,y)[[1]])

As you can see, I can't apply individual functions to each new df$mean and dfsd columns as they depend on a z vector that can only be generated once. I've looked around on SO already, but I haven't been able to find an answer yet. I think a solution would be using one of the apply functions and not something from dplyr, but I've honestly never fully understood apply functions. I would also not like solutions that use for loops with rbind as I've tried this in previous projects and for large dataframes it becomes very slow!

如您所见,我不能对每个新的df$mean和dfsd列应用单独的函数,因为它们依赖于只能生成一次的z向量。我已经到处找过了,但还没找到答案。我认为一个解决方案应该使用一个应用函数,而不是dplyr的函数,但是我从来没有完全理解过应用函数。我也不喜欢使用带有rbind的循环的解决方案,因为我在以前的项目中已经尝试过了,对于大型dataframes,它会变得非常慢!

1 个解决方案

#1


2  

We can use mapply for this. As the zsummary takes two arguments, the mapply would be one option as it take corresponding element of 'x' and 'y' to apply the zsummary.

我们可以用mapply来做这个。当zsummary接受两个参数时,mapply将是一个选项,因为它使用“x”和“y”的相应元素来应用zsummary。

t(mapply(zsummary, df$x, df$y))

We can also change the function slightly and get the output with dplyr

我们也可以稍微改变函数,用dplyr得到输出

zsummary <- function(x, y) { 
   if (y < 0) return(data.frame(mean = NA, sd = NA))
   z = rnorm(10, x, abs(y))
   data.frame(mean = mean(z), sd = sd(z))
}

 df %>%
     rowwise() %>% 
     do(data.frame(., zsummary(.$x, .$y)))

Or as we discussed in the comments, instead of having the function taking multiple arguments, have a single argument and use apply with MARGIN=1 for applying it on each row.

或者,正如我们在注释中讨论的,函数不是使用多个参数,而是使用一个参数,并使用apply with MARGIN=1对每一行应用它。

zsummary2 <- function(v1){
      if(v1[2] < 0) return(c(mean = NA, sd = NA))
      z <- rnorm(10, v1[1], abs(v1[2]))
       c(mean = mean(v1), sd= sd(v1))
     }

t(apply(df[-1], 1, zsummary2))
#         mean        sd
# [1,]  1.403066 0.8757504
# [2,]  5.058188 5.1401507
# [3,]  4.288365 1.4194393
# [4,]  1.932829 6.7587054
# [5,] -1.864236 3.7587462
# [6,]        NA        NA
# [7,]  3.328629 1.3711950
# [8,] -2.347699 5.0449958
# [9,]  2.936615 1.7332283
#[10,]        NA        NA

NOTE: The values will be different in each run as we didn't set any seed for the rnorm.

注意:在每次运行时,值将会不同,因为我们没有为rnorm设置任何种子。

#1


2  

We can use mapply for this. As the zsummary takes two arguments, the mapply would be one option as it take corresponding element of 'x' and 'y' to apply the zsummary.

我们可以用mapply来做这个。当zsummary接受两个参数时,mapply将是一个选项,因为它使用“x”和“y”的相应元素来应用zsummary。

t(mapply(zsummary, df$x, df$y))

We can also change the function slightly and get the output with dplyr

我们也可以稍微改变函数,用dplyr得到输出

zsummary <- function(x, y) { 
   if (y < 0) return(data.frame(mean = NA, sd = NA))
   z = rnorm(10, x, abs(y))
   data.frame(mean = mean(z), sd = sd(z))
}

 df %>%
     rowwise() %>% 
     do(data.frame(., zsummary(.$x, .$y)))

Or as we discussed in the comments, instead of having the function taking multiple arguments, have a single argument and use apply with MARGIN=1 for applying it on each row.

或者,正如我们在注释中讨论的,函数不是使用多个参数,而是使用一个参数,并使用apply with MARGIN=1对每一行应用它。

zsummary2 <- function(v1){
      if(v1[2] < 0) return(c(mean = NA, sd = NA))
      z <- rnorm(10, v1[1], abs(v1[2]))
       c(mean = mean(v1), sd= sd(v1))
     }

t(apply(df[-1], 1, zsummary2))
#         mean        sd
# [1,]  1.403066 0.8757504
# [2,]  5.058188 5.1401507
# [3,]  4.288365 1.4194393
# [4,]  1.932829 6.7587054
# [5,] -1.864236 3.7587462
# [6,]        NA        NA
# [7,]  3.328629 1.3711950
# [8,] -2.347699 5.0449958
# [9,]  2.936615 1.7332283
#[10,]        NA        NA

NOTE: The values will be different in each run as we didn't set any seed for the rnorm.

注意:在每次运行时,值将会不同,因为我们没有为rnorm设置任何种子。