为什么“vapply”比“sapply”更安全?

时间:2022-04-08 20:13:43

The documentation says

文档说

vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.

vapply类似于sapply,但具有预先指定的返回值类型,因此它可以更安全[…)使用。

Could you please elaborate as to why it is generally safer, maybe providing examples?

你能详细解释一下为什么它通常更安全,也许提供一些例子吗?


P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.

注:我知道答案,我已经倾向于避免使用sapply。我只是希望这里有个不错的答案,这样我就可以让我的同事知道了。请不要“阅读手册”的答案。

3 个解决方案

#1


58  

As has already been noted, vapply does two things:

正如已经注意到的,vapply做了两件事:

  • Slight speed improvement
  • 轻微的速度提高
  • Improves consistency by providing limited return type checks.
  • 通过提供有限的返回类型检查来提高一致性。

The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).

第二点是更大的优势,因为它有助于在错误发生之前捕获错误,并导致更健壮的代码。这个返回值检查可以通过使用sapply和stopif来单独完成,而不是确保返回值与您所期望的一致,但是vapply稍微容易一些(如果更有限,因为自定义错误检查代码可以检查范围内的值等等)。

Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).

这里有一个vapply的示例,确保您的结果符合预期。这与我正在编写的PDF文件的情况类似,在这里,findD将使用regex来匹配原始文本数据中的模式(例如,我将有一个由实体拆分的列表,以及一个regex以匹配每个实体内的地址)。有时,PDF被转换为无序状态,一个实体会有两个地址,这导致了badness。

> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"

[[2]]
[1] "d"

[[3]]
[1] "d" "d"

> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
 but FUN(X[[3]]) result is length 2

As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."

正如我告诉我的学生,成为程序员的一部分是改变你的思维方式,从“错误是讨厌的”到“错误是我的朋友”。

Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:

零长度输入一个相关的问题是,如果输入长度为零,那么不管输入类型是什么,sapply总是返回一个空的列表。比较:

sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()    
vapply(1:5, identity)
## [1] 1 2 3 4 5
vapply(integer(), identity)
## integer(0)

With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.

使用vapply,您可以保证拥有特定类型的输出,因此不需要为零长度输入编写额外的检查。

Benchmarks

基准

vapply can be a bit faster because it already knows what format it should be expecting the results in.

vapply可能会更快一些,因为它已经知道它应该期待结果的格式。

input1.long <- rep(input1,10000)

library(microbenchmark)
m <- microbenchmark(
  sapply(input1.long, findD ),
  vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)

为什么“vapply”比“sapply”更安全?

#2


13  

The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.

与vapply相关的额外的关键操作可以节省您稍后调试混乱结果的时间。如果您调用的函数可以返回不同的数据类型,那么应该使用vapply。

One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:

我想到的一个例子是在RODBC包中的sqlQuery。如果执行查询时出现错误,该函数将返回带有消息的字符向量。例如,假设您试图遍历表名的一个vector,并从每个表中的数字列“NumCol”中选择最大值:

sapply(tnames, 
   function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])

If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.

如果所有的表名都是有效的,那么这将导致一个数字向量。但是,如果数据库中的某个表名发生了变化,而查询失败,则结果将被强制转换为模式字符。但是,使用vapply和FUN.VALUE=numeric(1)可以阻止错误,防止它出现在直线上的某个地方——或者更糟,根本不是。

#3


12  

If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.

如果你总是想要你的结果是特别的…一个逻辑向量。vapply确保这种情况发生,但sapply并不一定这样做。

a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)

is.logical(a)
is.logical(b)

#1


58  

As has already been noted, vapply does two things:

正如已经注意到的,vapply做了两件事:

  • Slight speed improvement
  • 轻微的速度提高
  • Improves consistency by providing limited return type checks.
  • 通过提供有限的返回类型检查来提高一致性。

The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).

第二点是更大的优势,因为它有助于在错误发生之前捕获错误,并导致更健壮的代码。这个返回值检查可以通过使用sapply和stopif来单独完成,而不是确保返回值与您所期望的一致,但是vapply稍微容易一些(如果更有限,因为自定义错误检查代码可以检查范围内的值等等)。

Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).

这里有一个vapply的示例,确保您的结果符合预期。这与我正在编写的PDF文件的情况类似,在这里,findD将使用regex来匹配原始文本数据中的模式(例如,我将有一个由实体拆分的列表,以及一个regex以匹配每个实体内的地址)。有时,PDF被转换为无序状态,一个实体会有两个地址,这导致了badness。

> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"

[[2]]
[1] "d"

[[3]]
[1] "d" "d"

> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
 but FUN(X[[3]]) result is length 2

As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."

正如我告诉我的学生,成为程序员的一部分是改变你的思维方式,从“错误是讨厌的”到“错误是我的朋友”。

Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:

零长度输入一个相关的问题是,如果输入长度为零,那么不管输入类型是什么,sapply总是返回一个空的列表。比较:

sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()    
vapply(1:5, identity)
## [1] 1 2 3 4 5
vapply(integer(), identity)
## integer(0)

With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.

使用vapply,您可以保证拥有特定类型的输出,因此不需要为零长度输入编写额外的检查。

Benchmarks

基准

vapply can be a bit faster because it already knows what format it should be expecting the results in.

vapply可能会更快一些,因为它已经知道它应该期待结果的格式。

input1.long <- rep(input1,10000)

library(microbenchmark)
m <- microbenchmark(
  sapply(input1.long, findD ),
  vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)

为什么“vapply”比“sapply”更安全?

#2


13  

The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.

与vapply相关的额外的关键操作可以节省您稍后调试混乱结果的时间。如果您调用的函数可以返回不同的数据类型,那么应该使用vapply。

One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:

我想到的一个例子是在RODBC包中的sqlQuery。如果执行查询时出现错误,该函数将返回带有消息的字符向量。例如,假设您试图遍历表名的一个vector,并从每个表中的数字列“NumCol”中选择最大值:

sapply(tnames, 
   function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])

If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.

如果所有的表名都是有效的,那么这将导致一个数字向量。但是,如果数据库中的某个表名发生了变化,而查询失败,则结果将被强制转换为模式字符。但是,使用vapply和FUN.VALUE=numeric(1)可以阻止错误,防止它出现在直线上的某个地方——或者更糟,根本不是。

#3


12  

If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.

如果你总是想要你的结果是特别的…一个逻辑向量。vapply确保这种情况发生,但sapply并不一定这样做。

a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)

is.logical(a)
is.logical(b)