Whenever I want to do something "map"py in R, I usually try to use a function in the apply
family.
每当我想在R中做一个“map”py时,我通常尝试在应用程序族中使用一个函数。
However, I've never quite understood the differences between them -- how {sapply
, lapply
, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.
但是,我从来没有完全理解它们之间的区别——{sapply, lapply, etc.}将函数应用于输入/分组输入,输出将是什么样子,或者甚至是输入是什么——所以我经常在得到我想要的东西之前一直浏览它们。
Can someone explain how to use which one when?
有人能解释一下如何使用吗?
My current (probably incorrect/incomplete) understanding is...
我现在(可能不正确/不完全)理解是……
-
sapply(vec, f)
: input is a vector. output is a vector/matrix, where elementi
isf(vec[i])
, giving you a matrix iff
has a multi-element outputsapply(vec, f):输入是一个矢量。输出是一个向量/矩阵,其中元素i为f(vec[i]),如果f具有多元素输出,则给出一个矩阵。
-
lapply(vec, f)
: same assapply
, but output is a list?lapply(vec, f):与sapply相同,但输出是一个列表?
-
apply(matrix, 1/2, f)
: input is a matrix. output is a vector, where elementi
is f(row/col i of the matrix) - 应用(矩阵,1/2,f):输入是一个矩阵。输出是一个向量,其中元素i为f(矩阵的行/col i)
-
tapply(vector, grouping, f)
: output is a matrix/array, where an element in the matrix/array is the value off
at a groupingg
of the vector, andg
gets pushed to the row/col names - tapply(向量,分组,f):输出是一个矩阵/数组,其中矩阵/数组中的一个元素是f在一个集合g中的值,g被推到行/col名称。
-
by(dataframe, grouping, f)
: letg
be a grouping. applyf
to each column of the group/dataframe. pretty print the grouping and the value off
at each column. - 通过(dataframe,分组,f):让g成为一个分组。将f应用于组/dataframe的每一列。漂亮的打印分组和f在每列的值。
-
aggregate(matrix, grouping, f)
: similar toby
, but instead of pretty printing the output, aggregate sticks everything into a dataframe. - 聚合(矩阵,分组,f):类似的,但不是打印输出,而是把所有的东西都粘贴到一个dataframe中。
Side question: I still haven't learned plyr or reshape -- would plyr
or reshape
replace all of these entirely?
问:我还没有学过plyr或整形——plyr或整形会完全取代这些吗?
9 个解决方案
#1
1172
R has many *apply functions which are ably described in the help files (e.g. ?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
R有许多*应用功能,这些功能在帮助文件中可以很好地描述(例如,应用)。但是,有足够多的人,开始的用户可能很难决定哪一个适合他们的情况,甚至是记住他们。他们可能有一种普遍的感觉,即“我应该在这里使用*apply函数”,但要在一开始就把它们都弄清楚是很困难的。
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
尽管事实(在其他答案中指出),*应用家庭的大部分功能都被非常流行的plyr包所覆盖,但是基本功能仍然有用并且值得了解。
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
这个答案的目的是作为新用户的一种路标,帮助他们指导他们正确的应用功能。注意,这不是简单地反刍或取代R文档!希望这个答案可以帮助你决定哪个*应用功能适合你的情况,然后你可以进一步研究它。只有一个例外,性能差异不会得到解决。
-
apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
应用—当您想要将一个函数应用到矩阵的行或列(以及高维的类似物)时;一般来说,对于数据帧来说,这是不可取的,因为它会首先强制一个矩阵。
# Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick
colMeans
,rowMeans
,colSums
,rowSums
.如果您想要一个2D矩阵的行/列方法或和,一定要研究高度优化的、闪电快速的colMeans、rowMeans、colsum、rowsum。
-
lapply - When you want to apply a function to each element of a list in turn and get a list back.
lapply——当您想要将一个函数应用到列表中的每个元素时,然后返回一个列表。
This is the workhorse of many of the other *apply functions. Peel back their code and you will often find
lapply
underneath.这是许多其他*应用函数的工作马。剥去他们的代码,你会发现下面是lapply。
x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005
-
sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
当你想要将一个函数应用到列表中的每个元素时,你需要的是一个向量,而不是一个列表。
If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.如果你发现自己在键入unlist(lapply(…)),停止并考虑sapply。
x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005
In more advanced uses of
sapply
it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length,sapply
will use them as columns of a matrix:在更高级的sapply应用中,它将尝试在适当的情况下将结果强制转换为多维数组。例如,如果函数返回相同长度的向量,则sapply将它们作为矩阵的列:
sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix,
sapply
will do essentially the same thing, treating each returned matrix as a single long vector:如果我们的函数返回一个二维矩阵,sapply将会做本质上相同的事情,把每个返回的矩阵当作一个单一的长向量:
sapply(1:5,function(x) matrix(x,2,2))
Unless we specify
simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:除非我们指定简化= "数组",在这种情况下,它将使用单个矩阵来构建多维数组:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
每一种行为当然取决于我们的函数返回的向量或相同长度或维度的矩阵。
-
vapply - When you want to use
sapply
but perhaps need to squeeze some more speed out of your code.vapply——当您想使用sapply时,可能需要从代码中挤出一些速度。
For
vapply
, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.对于vapply,您基本上可以给R一个示例,说明函数将返回什么类型的东西,这可以节省一些时间强制返回的值以适应单个原子向量。
x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91
-
mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in
sapply
.mapply——当你有几个数据结构(例如向量、列表)时,你想要将一个函数应用到每个元素的第1个元素,然后将每个元素的第2个元素,等等,将结果强制转换成一个向量/数组,就像在sapply中一样。
This is multivariate in the sense that your function must accept multiple arguments.
这是多变量的,因为您的函数必须接受多个参数。
#Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4
-
Map - A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.映射-一个用简化= FALSE的包装器,因此它保证返回一个列表。
Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15
-
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
rapply——当您想要将一个函数应用到嵌套列表结构的每个元素时,递归地执行。
To give you some idea of how uncommon
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.rapply
is best illustrated with a user-defined function to apply:为了让你知道rapply有多不寻常,我在第一次发布这个答案的时候就忘了它!很明显,我相信很多人都用它,但是YMMV。rapply最好用用户定义的函数来说明:
# Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace")
-
tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
tapply——当你想将一个函数应用到一个向量的子集,而子集是由另一个向量定义的,通常是一个因子。
The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.
* * *的害群之马。帮助文件使用“不规则数组”这个短语可能有点令人困惑,但实际上非常简单。
A vector:
一个向量:
x <- 1:20
A factor (of the same length!) defining groups:
一个因素(相同长度!)定义组:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in
x
within each subgroup defined byy
:在y定义的每个子组中,将x的值相加:
tapply(x, y, sum) a b c d e 10 26 42 58 74
More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors.
tapply
is similar in spirit to the split-apply-combine functions that are common in R (aggregate
,by
,ave
,ddply
, etc.) Hence its black sheep status.可以处理更复杂的示例,其中的子组由几个因素的列表的惟一组合定义。tapply在spirit中类似于在R(聚合、by、ave、ddply等)中常见的分割-应用组合函数,因此它是黑羊状态。
#2
167
On the side note, here is how the various plyr
functions correspond to the base *apply
functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)
另一方面,这里是不同的plyr函数如何对应于基本*应用函数(从plyr网页http://had.co.nz/plyr/)
Base function Input Output plyr function
---------------------------------------
aggregate d d ddply + colwise
apply a a/l aaply / alply
by d l dlply
lapply l l llply
mapply a a/l maply / mlply
replicate r a/l raply / rlply
sapply l a laply
One of the goals of plyr
is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply()
is easily passable to ldply()
to produce useful output, etc.
plyr的目标之一是为每个函数提供一致的命名约定,在函数名中编码输入和输出数据类型。它还提供了输出的一致性,从dlply()输出可以轻松地传递到ldply()以产生有用的输出,等等。
Conceptually, learning plyr
is no more difficult than understanding the base *apply
functions.
从概念上讲,学习plyr并不比理解基本的应用功能困难。
plyr
and reshape
functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:
在我的日常使用中,plyr和整形功能几乎取代了所有这些功能。但是,也从简介到Plyr文件:
Related functions
tapply
andsweep
have no corresponding function inplyr
, and remain useful.merge
is useful for combining summaries with the original data.相关函数tapply和扫描在plyr中没有相应的功能,并且仍然有用。合并对于将总结与原始数据结合起来很有用。
#3
116
From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:
来自http://www.slideshare.net/hadley/plyr-one-data- analysis -strategy的幻灯片21
(Hopefully it's clear that apply
corresponds to @Hadley's aaply
and aggregate
corresponds to @Hadley's ddply
etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)
(希望很清楚,apply与@Hadley的aaply相对应,聚合对应于@Hadley的ddply等。如果你不从这张图片中得到它,同样的slideshare的20张幻灯片将会澄清。)
(on the left is input, on the top is output)
(左边是输入,顶部是输出)
#4
84
First start with Joran's excellent answer -- doubtful anything can better that.
首先从Joran的出色回答开始——怀疑任何事情都能更好。
Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.
接下来的记忆术可能有助于记住每个人之间的区别。虽然有些是显而易见的,其他的可能不那么重要——因为这些你将在Joran的讨论中找到理由。
Mnemonics
助记符
-
lapply
is a list apply which acts on a list or vector and returns a list. - lapply是一个列表,它作用于列表或向量,并返回一个列表。
-
sapply
is a simplelapply
(function defaults to returning a vector or matrix when possible) - sapply是一个简单的lapply(在可能的情况下,函数默认返回一个矢量或矩阵)
-
vapply
is a verified apply (allows the return object type to be prespecified) - vapply是一个经过验证的应用程序(允许预先指定返回对象类型)
-
rapply
is a recursive apply for nested lists, i.e. lists within lists - rapply是一个递归应用于嵌套列表,即列表中的列表。
-
tapply
is a tagged apply where the tags identify the subsets - tapply是一个标记应用程序,其中标记标识子集。
-
apply
is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array) - apply是通用的:将一个函数应用到矩阵的行或列(或者,更一般地说,是一个数组的维度)
Building the Right Background
建立正确的背景
If using the apply
family still feels a bit alien to you, then it might be that you're missing a key point of view.
如果使用应用程序家庭对你来说仍然感觉有点陌生,那么可能是你忽略了一个关键的观点。
These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply
family of functions.
这两篇文章能帮上忙。它们提供了必要的背景,以激发应用程序家族提供的函数式编程技术。
Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply
will make a lot more sense.
Lisp的用户会立即识别这个范例。如果你对Lisp不熟悉,一旦你熟悉了FP,你就会在R中获得一个强有力的观点,而且应用会更有意义。
- Advanced R: Functional Programming, by Hadley Wickham
- 高级R:函数式编程,哈德利·韦翰。
- Simple Functional Programming in R, by Michael Barton
- 简单的函数编程在R,由Michael Barton。
#5
34
Since I realized that (the very excellent) answers of this post lack of by
and aggregate
explanations. Here is my contribution.
因为我意识到(非常优秀的)这篇文章的答案缺乏和聚合的解释。这是我的贡献。
BY
The by
function, as stated in the documentation can be though, as a "wrapper" for tapply
. The power of by
arises when we want to compute a task that tapply
can't handle. One example is this code:
正如文档中所述,通过函数可以作为tapply的“包装器”。当我们想要计算一个tapply无法处理的任务时,就会产生这种能力。一个例子就是这个代码:
ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )
cb
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
--------------------------------------------------------------
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
--------------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
ct
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, ct
and cb
, we "essentially" have the same results and the only differences are in how they are shown and the different class
attributes, respectively by
for cb
and array
for ct
.
如果我们打印这两个对象,ct和cb,我们“本质上”具有相同的结果,唯一的区别在于它们是如何显示的,以及不同的类属性,分别是用于ct的cb和数组。
As I've said, the power of by
arises when we can't use tapply
; the following code is one example:
正如我说过的,当我们不能使用tapply时,它的力量就会出现;下面的代码就是一个例子:
tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say "we want to calculate the summary
of all variable in iris
along the factor Species
": but R just can't do that because it does not know how to handle.
R说,参数必须有相同的长度,比如“我们想要计算的是所有的可变因素在鸢尾中的数量”:但是R不能这么做,因为它不知道如何处理。
With the by
function R dispatch a specific method for data frame
class and then let the summary
function works even if the length of the first argument (and the type too) are different.
通过函数R调度一个特定的数据帧类方法,即使第一个参数的长度(和类型)不同,也让summary函数工作。
bywork <- by(iris, iris$Species, summary )
bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
--------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
--------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
it works indeed and the result is very surprising. It is an object of class by
that along Species
(say, for each of them) computes the summary
of each variable.
它确实有效,结果非常令人惊讶。它是一个类的对象,沿着物种(比方说,对每一个物种)计算每个变量的摘要。
Note that if the first argument is a data frame
, the dispatched function must have a method for that class of objects. For example is we use this code with the mean
function we will have this code that has no sense at all:
注意,如果第一个参数是一个数据帧,那么被分派的函数必须有一个对象类的方法。例如,我们使用的是这个带有平均功能的代码我们将会有这个没有任何意义的代码:
by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
-------------------------------------------
iris$Species: versicolor
[1] NA
-------------------------------------------
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
AGGREGATE
aggregate
can be seen as another a different way of use tapply
if we use it in such a way.
聚合可以被看作是另一种不同的使用方法,如果我们以这种方式使用它。
at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
at
setosa versicolor virginica
5.006 5.936 6.588
ag
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of aggregate
must be a list while tapply
can (not mandatory) be a list and that the output of aggregate
is a data frame while the one of tapply
is an array
.
两个直接的区别是,聚合的第二个参数必须是一个列表,而tapply可以(不是必须的)是一个列表,而聚合的输出是一个数据帧,而tapply的输出是一个数组。
The power of aggregate
is that it can handle easily subsets of the data with subset
argument and that it has methods for ts
objects and formula
as well.
聚合的力量在于它可以用子集参数来处理数据的子集,并且它也有ts对象和公式的方法。
These elements make aggregate
easier to work with that tapply
in some situations. Here are some examples (available in documentation):
在某些情况下,这些元素使聚合更容易处理。这里有一些例子(可以在文档中找到):
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
ag
supp dose len
1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
We can achieve the same with tapply
but the syntax is slightly harder and the output (in some circumstances) less readable:
我们可以用tapply实现同样的效果,但是语法稍微困难一些,输出(在某些情况下)可读性更差:
att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
att
OJ VC
0.5 13.23 7.98
1 22.70 16.77
2 26.06 26.14
There are other times when we can't use by
or tapply
and we have to use aggregate
.
还有一些时候我们不能使用或tapply,我们必须使用聚合。
ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
ag1
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
We cannot obtain the previous result with tapply
in one call but we have to calculate the mean along Month
for each elements and then combine them (also note that we have to call the na.rm = TRUE
, because the formula
methods of the aggregate
function has by default the na.action = na.omit
):
我们不能在一个调用中得到之前的结果,但是我们必须计算每个元素的平均月数,然后再组合它们(还要注意我们必须调用na。rm = TRUE,因为聚合函数的公式方法默认为na。action = na.omit):
ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
cbind(ta1, ta2)
ta1 ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000
while with by
we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean
):
虽然我们无法做到这一点,但实际上以下函数调用返回一个错误(但很可能它与提供的函数有关):
by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
其他时候,结果是相同的,差别只是在类中(然后是如何显示/打印的,而不仅仅是——例如,如何对它进行子集)对象:
byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.
前面的代码实现了相同的目标和结果,在某些点上使用什么工具只是个人的爱好和需求的问题;前两个对象在子设置方面有非常不同的需求。
#6
27
There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the apply
function is not measured.
有很多很好的答案来讨论每个函数的用例的不同。没有一个答案讨论性能上的差异。这是合理的原因,各种函数期望不同的输入并产生不同的输出,但是大多数的函数都有一个通用的目标,可以通过序列/组来进行评估。我的答案是专注于表现。由于在时间上包含了来自矢量的输入,所以应用函数也没有被测量。
I have tested two different functions sum
and length
at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked, data.table
and dplyr
. Both are definitely worth to look if you are aiming for good performance.
我同时测试了两个不同的函数和长度。测试的音量为50M,输出为50K。我还包括了两个当前流行的软件包,在被问及问题时,它们并没有被广泛使用。表和dplyr。如果你的目标是良好的表现,两者都是值得一看的。
library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)
timing = list()
# sapply
timing[["sapply"]] = system.time({
lt = split(x, grp)
r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})
# lapply
timing[["lapply"]] = system.time({
lt = split(x, grp)
r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})
# tapply
timing[["tapply"]] = system.time(
r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)
# by
timing[["by"]] = system.time(
r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# aggregate
timing[["aggregate"]] = system.time(
r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# dplyr
timing[["dplyr"]] = system.time({
df = data_frame(x, grp)
r.dplyr = summarise(group_by(df, grp), sum(x), n())
})
# data.table
timing[["data.table"]] = system.time({
dt = setnames(setDT(list(x, grp)), c("x","grp"))
r.data.table = dt[, .(sum(x), .N), grp]
})
# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table),
function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
# sapply lapply tapply by aggregate dplyr data.table
# TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
)[,.(fun = V1, elapsed = V2)
][order(-elapsed)]
# fun elapsed
#1: aggregate 109.139
#2: by 25.738
#3: dplyr 18.978
#4: tapply 17.006
#5: lapply 11.524
#6: sapply 11.326
#7: data.table 2.686
#7
19
It is maybe worth mentioning ave
. ave
is tapply
's friendly cousin. It returns results in a form that you can plug straight back into your data frame.
也许值得一提的是,ave是tapply的友好表亲。它以一种可以直接插入到数据帧的形式返回结果。
dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
## A B C D E
## 2.5 6.5 10.5 14.5 18.5
## great, but putting it back in the data frame is another line:
dfr$m <- means[dfr$f]
dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
## a f m m2
## 1 A 2.5 2.5
## 2 A 2.5 2.5
## 3 A 2.5 2.5
## 4 A 2.5 2.5
## 5 B 6.5 6.5
## 6 B 6.5 6.5
## 7 B 6.5 6.5
## ...
There is nothing in the base package that works like ave
for whole data frames (as by
is like tapply
for data frames). But you can fudge it:
在整个数据帧中,基本包中没有像ave这样的东西(就像对数据帧的tapply一样)。但你可以蒙混过去:
dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
x <- dfr[x,]
sum(x$m*x$m2)
})
dfr
## a f m m2 foo
## 1 1 A 2.5 2.5 25
## 2 2 A 2.5 2.5 25
## 3 3 A 2.5 2.5 25
## ...
#8
19
Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful outer
function and the obscure eapply
function
尽管这里有很多重要的答案,但还有2个基本功能值得提及,有用的外部函数和模糊的eapply函数。
outer
外
outer
is a very useful function hidden as a more mundane one. If you read the help for outer
its description says:
外表是一种非常有用的功能,隐藏在一个更平凡的功能中。如果你读到外部的帮助,它的描述是:
The outer product of the arrays X and Y is the array A with dimension
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =
FUN(X[arrayindex.x], Y[arrayindex.y], ...).
which makes it seem like this is only useful for linear algebra type things. However, it can be used much like mapply
to apply a function to two vectors of inputs. The difference is that mapply
will apply the function to the first two elements and then the second two etc, whereas outer
will apply the function to every combination of one element from the first vector and one from the second. For example:
这使得它看起来只适用于线性代数类型的东西。但是,它可以很像mapply,将一个函数应用到两个输入向量。不同之处在于,mapply会将函数应用到前两个元素,然后将第二个元素应用到第二个元素,而外层则将这个函数应用于一个元素从第一个向量到第二个元素的每一个组合。例如:
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
mapply(FUN=pmax, A, B)
> mapply(FUN=pmax, A, B)
[1] 1 3 6 9 12
outer(A,B, pmax)
> outer(A,B, pmax)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 6 9 12
[2,] 3 3 6 9 12
[3,] 5 5 6 9 12
[4,] 7 7 7 9 12
[5,] 9 9 9 9 12
I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.
当我有一个值向量和一个条件向量时,我就用这个方法,希望看到哪个值满足条件。
eapply
eapply
eapply
is like lapply
except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:
eapply就像lapply,只不过它不是将函数应用到列表中的每个元素,而是将函数应用到环境中的每个元素。例如,如果您想在全局环境中查找用户定义的函数列表:
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}
> eapply(.GlobalEnv, is.function)
$A
[1] FALSE
$B
[1] FALSE
$C
[1] FALSE
$D
[1] TRUE
Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.
坦率地说,我并没有过多地使用它,但是如果您正在构建大量的包或创建许多环境,那么它可能会派上用场。
#9
4
I recently discovered the rather useful sweep
function and add it here for the sake of completeness:
我最近发现了一个非常有用的扫描函数,并将其添加到这里,以确保完整性:
sweep
扫描
The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):
基本思想是扫描数组行或列,并返回修改后的数组。一个示例将说明这一点(来源:datacamp):
Let's say you have a matrix and want to standardize it column-wise:
假设你有一个矩阵,想要使它标准化:
dataPoints <- matrix(4:15, nrow = 4)
# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)
# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)
# Center the points
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")
print(dataPoints_Trans1)
## [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,] 0.5 0.5 0.5
## [4,] 1.5 1.5 1.5
# Return the result
dataPoints_Trans1
## [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,] 0.5 0.5 0.5
## [4,] 1.5 1.5 1.5
# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")
# Return the result
dataPoints_Trans2
## [,1] [,2] [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950 1.1618950
NB: for this simple example the same result can of course be achieved more easily byapply(dataPoints, 2, scale)
NB:对于这个简单的例子来说,同样的结果当然可以通过应用(dataPoints, 2, scale)更容易实现。
#1
1172
R has many *apply functions which are ably described in the help files (e.g. ?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
R有许多*应用功能,这些功能在帮助文件中可以很好地描述(例如,应用)。但是,有足够多的人,开始的用户可能很难决定哪一个适合他们的情况,甚至是记住他们。他们可能有一种普遍的感觉,即“我应该在这里使用*apply函数”,但要在一开始就把它们都弄清楚是很困难的。
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
尽管事实(在其他答案中指出),*应用家庭的大部分功能都被非常流行的plyr包所覆盖,但是基本功能仍然有用并且值得了解。
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
这个答案的目的是作为新用户的一种路标,帮助他们指导他们正确的应用功能。注意,这不是简单地反刍或取代R文档!希望这个答案可以帮助你决定哪个*应用功能适合你的情况,然后你可以进一步研究它。只有一个例外,性能差异不会得到解决。
-
apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
应用—当您想要将一个函数应用到矩阵的行或列(以及高维的类似物)时;一般来说,对于数据帧来说,这是不可取的,因为它会首先强制一个矩阵。
# Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick
colMeans
,rowMeans
,colSums
,rowSums
.如果您想要一个2D矩阵的行/列方法或和,一定要研究高度优化的、闪电快速的colMeans、rowMeans、colsum、rowsum。
-
lapply - When you want to apply a function to each element of a list in turn and get a list back.
lapply——当您想要将一个函数应用到列表中的每个元素时,然后返回一个列表。
This is the workhorse of many of the other *apply functions. Peel back their code and you will often find
lapply
underneath.这是许多其他*应用函数的工作马。剥去他们的代码,你会发现下面是lapply。
x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005
-
sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
当你想要将一个函数应用到列表中的每个元素时,你需要的是一个向量,而不是一个列表。
If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.如果你发现自己在键入unlist(lapply(…)),停止并考虑sapply。
x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005
In more advanced uses of
sapply
it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length,sapply
will use them as columns of a matrix:在更高级的sapply应用中,它将尝试在适当的情况下将结果强制转换为多维数组。例如,如果函数返回相同长度的向量,则sapply将它们作为矩阵的列:
sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix,
sapply
will do essentially the same thing, treating each returned matrix as a single long vector:如果我们的函数返回一个二维矩阵,sapply将会做本质上相同的事情,把每个返回的矩阵当作一个单一的长向量:
sapply(1:5,function(x) matrix(x,2,2))
Unless we specify
simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:除非我们指定简化= "数组",在这种情况下,它将使用单个矩阵来构建多维数组:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
每一种行为当然取决于我们的函数返回的向量或相同长度或维度的矩阵。
-
vapply - When you want to use
sapply
but perhaps need to squeeze some more speed out of your code.vapply——当您想使用sapply时,可能需要从代码中挤出一些速度。
For
vapply
, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.对于vapply,您基本上可以给R一个示例,说明函数将返回什么类型的东西,这可以节省一些时间强制返回的值以适应单个原子向量。
x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91
-
mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in
sapply
.mapply——当你有几个数据结构(例如向量、列表)时,你想要将一个函数应用到每个元素的第1个元素,然后将每个元素的第2个元素,等等,将结果强制转换成一个向量/数组,就像在sapply中一样。
This is multivariate in the sense that your function must accept multiple arguments.
这是多变量的,因为您的函数必须接受多个参数。
#Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4
-
Map - A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.映射-一个用简化= FALSE的包装器,因此它保证返回一个列表。
Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15
-
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
rapply——当您想要将一个函数应用到嵌套列表结构的每个元素时,递归地执行。
To give you some idea of how uncommon
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.rapply
is best illustrated with a user-defined function to apply:为了让你知道rapply有多不寻常,我在第一次发布这个答案的时候就忘了它!很明显,我相信很多人都用它,但是YMMV。rapply最好用用户定义的函数来说明:
# Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace")
-
tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
tapply——当你想将一个函数应用到一个向量的子集,而子集是由另一个向量定义的,通常是一个因子。
The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.
* * *的害群之马。帮助文件使用“不规则数组”这个短语可能有点令人困惑,但实际上非常简单。
A vector:
一个向量:
x <- 1:20
A factor (of the same length!) defining groups:
一个因素(相同长度!)定义组:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in
x
within each subgroup defined byy
:在y定义的每个子组中,将x的值相加:
tapply(x, y, sum) a b c d e 10 26 42 58 74
More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors.
tapply
is similar in spirit to the split-apply-combine functions that are common in R (aggregate
,by
,ave
,ddply
, etc.) Hence its black sheep status.可以处理更复杂的示例,其中的子组由几个因素的列表的惟一组合定义。tapply在spirit中类似于在R(聚合、by、ave、ddply等)中常见的分割-应用组合函数,因此它是黑羊状态。
#2
167
On the side note, here is how the various plyr
functions correspond to the base *apply
functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)
另一方面,这里是不同的plyr函数如何对应于基本*应用函数(从plyr网页http://had.co.nz/plyr/)
Base function Input Output plyr function
---------------------------------------
aggregate d d ddply + colwise
apply a a/l aaply / alply
by d l dlply
lapply l l llply
mapply a a/l maply / mlply
replicate r a/l raply / rlply
sapply l a laply
One of the goals of plyr
is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply()
is easily passable to ldply()
to produce useful output, etc.
plyr的目标之一是为每个函数提供一致的命名约定,在函数名中编码输入和输出数据类型。它还提供了输出的一致性,从dlply()输出可以轻松地传递到ldply()以产生有用的输出,等等。
Conceptually, learning plyr
is no more difficult than understanding the base *apply
functions.
从概念上讲,学习plyr并不比理解基本的应用功能困难。
plyr
and reshape
functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:
在我的日常使用中,plyr和整形功能几乎取代了所有这些功能。但是,也从简介到Plyr文件:
Related functions
tapply
andsweep
have no corresponding function inplyr
, and remain useful.merge
is useful for combining summaries with the original data.相关函数tapply和扫描在plyr中没有相应的功能,并且仍然有用。合并对于将总结与原始数据结合起来很有用。
#3
116
From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:
来自http://www.slideshare.net/hadley/plyr-one-data- analysis -strategy的幻灯片21
(Hopefully it's clear that apply
corresponds to @Hadley's aaply
and aggregate
corresponds to @Hadley's ddply
etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)
(希望很清楚,apply与@Hadley的aaply相对应,聚合对应于@Hadley的ddply等。如果你不从这张图片中得到它,同样的slideshare的20张幻灯片将会澄清。)
(on the left is input, on the top is output)
(左边是输入,顶部是输出)
#4
84
First start with Joran's excellent answer -- doubtful anything can better that.
首先从Joran的出色回答开始——怀疑任何事情都能更好。
Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.
接下来的记忆术可能有助于记住每个人之间的区别。虽然有些是显而易见的,其他的可能不那么重要——因为这些你将在Joran的讨论中找到理由。
Mnemonics
助记符
-
lapply
is a list apply which acts on a list or vector and returns a list. - lapply是一个列表,它作用于列表或向量,并返回一个列表。
-
sapply
is a simplelapply
(function defaults to returning a vector or matrix when possible) - sapply是一个简单的lapply(在可能的情况下,函数默认返回一个矢量或矩阵)
-
vapply
is a verified apply (allows the return object type to be prespecified) - vapply是一个经过验证的应用程序(允许预先指定返回对象类型)
-
rapply
is a recursive apply for nested lists, i.e. lists within lists - rapply是一个递归应用于嵌套列表,即列表中的列表。
-
tapply
is a tagged apply where the tags identify the subsets - tapply是一个标记应用程序,其中标记标识子集。
-
apply
is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array) - apply是通用的:将一个函数应用到矩阵的行或列(或者,更一般地说,是一个数组的维度)
Building the Right Background
建立正确的背景
If using the apply
family still feels a bit alien to you, then it might be that you're missing a key point of view.
如果使用应用程序家庭对你来说仍然感觉有点陌生,那么可能是你忽略了一个关键的观点。
These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply
family of functions.
这两篇文章能帮上忙。它们提供了必要的背景,以激发应用程序家族提供的函数式编程技术。
Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply
will make a lot more sense.
Lisp的用户会立即识别这个范例。如果你对Lisp不熟悉,一旦你熟悉了FP,你就会在R中获得一个强有力的观点,而且应用会更有意义。
- Advanced R: Functional Programming, by Hadley Wickham
- 高级R:函数式编程,哈德利·韦翰。
- Simple Functional Programming in R, by Michael Barton
- 简单的函数编程在R,由Michael Barton。
#5
34
Since I realized that (the very excellent) answers of this post lack of by
and aggregate
explanations. Here is my contribution.
因为我意识到(非常优秀的)这篇文章的答案缺乏和聚合的解释。这是我的贡献。
BY
The by
function, as stated in the documentation can be though, as a "wrapper" for tapply
. The power of by
arises when we want to compute a task that tapply
can't handle. One example is this code:
正如文档中所述,通过函数可以作为tapply的“包装器”。当我们想要计算一个tapply无法处理的任务时,就会产生这种能力。一个例子就是这个代码:
ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )
cb
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
--------------------------------------------------------------
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
--------------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
ct
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, ct
and cb
, we "essentially" have the same results and the only differences are in how they are shown and the different class
attributes, respectively by
for cb
and array
for ct
.
如果我们打印这两个对象,ct和cb,我们“本质上”具有相同的结果,唯一的区别在于它们是如何显示的,以及不同的类属性,分别是用于ct的cb和数组。
As I've said, the power of by
arises when we can't use tapply
; the following code is one example:
正如我说过的,当我们不能使用tapply时,它的力量就会出现;下面的代码就是一个例子:
tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say "we want to calculate the summary
of all variable in iris
along the factor Species
": but R just can't do that because it does not know how to handle.
R说,参数必须有相同的长度,比如“我们想要计算的是所有的可变因素在鸢尾中的数量”:但是R不能这么做,因为它不知道如何处理。
With the by
function R dispatch a specific method for data frame
class and then let the summary
function works even if the length of the first argument (and the type too) are different.
通过函数R调度一个特定的数据帧类方法,即使第一个参数的长度(和类型)不同,也让summary函数工作。
bywork <- by(iris, iris$Species, summary )
bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
--------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
--------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
it works indeed and the result is very surprising. It is an object of class by
that along Species
(say, for each of them) computes the summary
of each variable.
它确实有效,结果非常令人惊讶。它是一个类的对象,沿着物种(比方说,对每一个物种)计算每个变量的摘要。
Note that if the first argument is a data frame
, the dispatched function must have a method for that class of objects. For example is we use this code with the mean
function we will have this code that has no sense at all:
注意,如果第一个参数是一个数据帧,那么被分派的函数必须有一个对象类的方法。例如,我们使用的是这个带有平均功能的代码我们将会有这个没有任何意义的代码:
by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
-------------------------------------------
iris$Species: versicolor
[1] NA
-------------------------------------------
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
AGGREGATE
aggregate
can be seen as another a different way of use tapply
if we use it in such a way.
聚合可以被看作是另一种不同的使用方法,如果我们以这种方式使用它。
at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
at
setosa versicolor virginica
5.006 5.936 6.588
ag
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of aggregate
must be a list while tapply
can (not mandatory) be a list and that the output of aggregate
is a data frame while the one of tapply
is an array
.
两个直接的区别是,聚合的第二个参数必须是一个列表,而tapply可以(不是必须的)是一个列表,而聚合的输出是一个数据帧,而tapply的输出是一个数组。
The power of aggregate
is that it can handle easily subsets of the data with subset
argument and that it has methods for ts
objects and formula
as well.
聚合的力量在于它可以用子集参数来处理数据的子集,并且它也有ts对象和公式的方法。
These elements make aggregate
easier to work with that tapply
in some situations. Here are some examples (available in documentation):
在某些情况下,这些元素使聚合更容易处理。这里有一些例子(可以在文档中找到):
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
ag
supp dose len
1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
We can achieve the same with tapply
but the syntax is slightly harder and the output (in some circumstances) less readable:
我们可以用tapply实现同样的效果,但是语法稍微困难一些,输出(在某些情况下)可读性更差:
att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
att
OJ VC
0.5 13.23 7.98
1 22.70 16.77
2 26.06 26.14
There are other times when we can't use by
or tapply
and we have to use aggregate
.
还有一些时候我们不能使用或tapply,我们必须使用聚合。
ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
ag1
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
We cannot obtain the previous result with tapply
in one call but we have to calculate the mean along Month
for each elements and then combine them (also note that we have to call the na.rm = TRUE
, because the formula
methods of the aggregate
function has by default the na.action = na.omit
):
我们不能在一个调用中得到之前的结果,但是我们必须计算每个元素的平均月数,然后再组合它们(还要注意我们必须调用na。rm = TRUE,因为聚合函数的公式方法默认为na。action = na.omit):
ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
cbind(ta1, ta2)
ta1 ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000
while with by
we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean
):
虽然我们无法做到这一点,但实际上以下函数调用返回一个错误(但很可能它与提供的函数有关):
by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
其他时候,结果是相同的,差别只是在类中(然后是如何显示/打印的,而不仅仅是——例如,如何对它进行子集)对象:
byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.
前面的代码实现了相同的目标和结果,在某些点上使用什么工具只是个人的爱好和需求的问题;前两个对象在子设置方面有非常不同的需求。
#6
27
There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the apply
function is not measured.
有很多很好的答案来讨论每个函数的用例的不同。没有一个答案讨论性能上的差异。这是合理的原因,各种函数期望不同的输入并产生不同的输出,但是大多数的函数都有一个通用的目标,可以通过序列/组来进行评估。我的答案是专注于表现。由于在时间上包含了来自矢量的输入,所以应用函数也没有被测量。
I have tested two different functions sum
and length
at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked, data.table
and dplyr
. Both are definitely worth to look if you are aiming for good performance.
我同时测试了两个不同的函数和长度。测试的音量为50M,输出为50K。我还包括了两个当前流行的软件包,在被问及问题时,它们并没有被广泛使用。表和dplyr。如果你的目标是良好的表现,两者都是值得一看的。
library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)
timing = list()
# sapply
timing[["sapply"]] = system.time({
lt = split(x, grp)
r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})
# lapply
timing[["lapply"]] = system.time({
lt = split(x, grp)
r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})
# tapply
timing[["tapply"]] = system.time(
r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)
# by
timing[["by"]] = system.time(
r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# aggregate
timing[["aggregate"]] = system.time(
r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# dplyr
timing[["dplyr"]] = system.time({
df = data_frame(x, grp)
r.dplyr = summarise(group_by(df, grp), sum(x), n())
})
# data.table
timing[["data.table"]] = system.time({
dt = setnames(setDT(list(x, grp)), c("x","grp"))
r.data.table = dt[, .(sum(x), .N), grp]
})
# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table),
function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
# sapply lapply tapply by aggregate dplyr data.table
# TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
)[,.(fun = V1, elapsed = V2)
][order(-elapsed)]
# fun elapsed
#1: aggregate 109.139
#2: by 25.738
#3: dplyr 18.978
#4: tapply 17.006
#5: lapply 11.524
#6: sapply 11.326
#7: data.table 2.686
#7
19
It is maybe worth mentioning ave
. ave
is tapply
's friendly cousin. It returns results in a form that you can plug straight back into your data frame.
也许值得一提的是,ave是tapply的友好表亲。它以一种可以直接插入到数据帧的形式返回结果。
dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
## A B C D E
## 2.5 6.5 10.5 14.5 18.5
## great, but putting it back in the data frame is another line:
dfr$m <- means[dfr$f]
dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
## a f m m2
## 1 A 2.5 2.5
## 2 A 2.5 2.5
## 3 A 2.5 2.5
## 4 A 2.5 2.5
## 5 B 6.5 6.5
## 6 B 6.5 6.5
## 7 B 6.5 6.5
## ...
There is nothing in the base package that works like ave
for whole data frames (as by
is like tapply
for data frames). But you can fudge it:
在整个数据帧中,基本包中没有像ave这样的东西(就像对数据帧的tapply一样)。但你可以蒙混过去:
dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
x <- dfr[x,]
sum(x$m*x$m2)
})
dfr
## a f m m2 foo
## 1 1 A 2.5 2.5 25
## 2 2 A 2.5 2.5 25
## 3 3 A 2.5 2.5 25
## ...
#8
19
Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful outer
function and the obscure eapply
function
尽管这里有很多重要的答案,但还有2个基本功能值得提及,有用的外部函数和模糊的eapply函数。
outer
外
outer
is a very useful function hidden as a more mundane one. If you read the help for outer
its description says:
外表是一种非常有用的功能,隐藏在一个更平凡的功能中。如果你读到外部的帮助,它的描述是:
The outer product of the arrays X and Y is the array A with dimension
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =
FUN(X[arrayindex.x], Y[arrayindex.y], ...).
which makes it seem like this is only useful for linear algebra type things. However, it can be used much like mapply
to apply a function to two vectors of inputs. The difference is that mapply
will apply the function to the first two elements and then the second two etc, whereas outer
will apply the function to every combination of one element from the first vector and one from the second. For example:
这使得它看起来只适用于线性代数类型的东西。但是,它可以很像mapply,将一个函数应用到两个输入向量。不同之处在于,mapply会将函数应用到前两个元素,然后将第二个元素应用到第二个元素,而外层则将这个函数应用于一个元素从第一个向量到第二个元素的每一个组合。例如:
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
mapply(FUN=pmax, A, B)
> mapply(FUN=pmax, A, B)
[1] 1 3 6 9 12
outer(A,B, pmax)
> outer(A,B, pmax)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 6 9 12
[2,] 3 3 6 9 12
[3,] 5 5 6 9 12
[4,] 7 7 7 9 12
[5,] 9 9 9 9 12
I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.
当我有一个值向量和一个条件向量时,我就用这个方法,希望看到哪个值满足条件。
eapply
eapply
eapply
is like lapply
except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:
eapply就像lapply,只不过它不是将函数应用到列表中的每个元素,而是将函数应用到环境中的每个元素。例如,如果您想在全局环境中查找用户定义的函数列表:
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}
> eapply(.GlobalEnv, is.function)
$A
[1] FALSE
$B
[1] FALSE
$C
[1] FALSE
$D
[1] TRUE
Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.
坦率地说,我并没有过多地使用它,但是如果您正在构建大量的包或创建许多环境,那么它可能会派上用场。
#9
4
I recently discovered the rather useful sweep
function and add it here for the sake of completeness:
我最近发现了一个非常有用的扫描函数,并将其添加到这里,以确保完整性:
sweep
扫描
The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):
基本思想是扫描数组行或列,并返回修改后的数组。一个示例将说明这一点(来源:datacamp):
Let's say you have a matrix and want to standardize it column-wise:
假设你有一个矩阵,想要使它标准化:
dataPoints <- matrix(4:15, nrow = 4)
# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)
# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)
# Center the points
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")
print(dataPoints_Trans1)
## [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,] 0.5 0.5 0.5
## [4,] 1.5 1.5 1.5
# Return the result
dataPoints_Trans1
## [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,] 0.5 0.5 0.5
## [4,] 1.5 1.5 1.5
# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")
# Return the result
dataPoints_Trans2
## [,1] [,2] [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950 1.1618950
NB: for this simple example the same result can of course be achieved more easily byapply(dataPoints, 2, scale)
NB:对于这个简单的例子来说,同样的结果当然可以通过应用(dataPoints, 2, scale)更容易实现。