I'm trying to learn how to write function in R/plyr. I am aware that there are easier ways to do what I show below, but that's not the point.
我在学习如何用R/plyr写函数。我知道有更简单的方法来做我下面展示的,但这不是重点。
In the example that follows, PLYR does not return a new variable to my new data frame
在下面的示例中,PLYR不会向我的新数据帧返回一个新变量
library(plyr)
highab <-subset(baseball, ab >= 600)
testfunc1 <-function(x) {
print(x) #just to show me that the vector does get into the function. Works fine.
medianAB <- median(x)
print(medianAB) #just to prove that medianAB was calculated correctly. Works fine
}
baseball3 <-ddply(highab, .(id), transform, testfunc1(ab))
str(baseball3$medianAB) #No medianAB
What obvious thing am I missing?
我遗漏了什么明显的东西?
R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreign_0.8-42 ggplot2_0.8.9 proto_0.3-9.1 reshape_0.8.4 plyr_1.4.1 rms_3.3-0 Hmisc_3.8-3
[8] survival_2.36-5 stringr_0.4
loaded via a namespace (and not attached):
[1] cluster_1.13.3 lattice_0.19-23 tools_2.12.2
3 个解决方案
#1
3
Just make two changes
只是让两个变化
- Remove the print command inside the function, so that median is returned
- 删除函数中的print命令,以便返回中位数
- Add
medianAB = testfunc1(ab)
as suggested by Joshua - 按照约书亚的建议添加medianAB = testfunc1(ab)
You are done!
你做的!
Here is the simplified code with the output
这是带有输出的简化代码
library(plyr)
highab <-subset(baseball, ab >= 600)
baseball3 <-ddply(highab, .(id), transform, medianAB = median(ab))
summary(baseball3$medianAB)
Min. 1st Qu. Median Mean 3rd Qu.
Max. 600.0 612.0 621.5 623.1 631.5 677.0第1区:中位数第三区。最大值:600.0 621.5 623.1 631.5 677.0
#2
0
Sorry. I mis-understood the question.
对不起。我误解了问题。
See ?transform
. You need to specify the new variables you want as tag=value
pairs. So you need something like
看到了什么?变换。您需要指定您想要的新变量为标记=值对。所以你需要一些类似的东西。
baseball3 <- ddply(highab, .(id), transform, medianAB=testfunc1(ab))
#3
0
At first I liked the idiom to add derived columns to a data.frame, but I find the usage of transform()
unacceptably slow far large sets.
起初,我喜欢将派生列添加到data.frame中,但我发现transform()的使用速度慢得令人无法接受。
Would it be better to use a lambda form in ddply()
and a subsequent call to merge merge()
? Timing it looks like it's worth it:
在ddply()中使用lambda表单和随后调用merge()会更好吗?时机看起来是值得的:
> library(plyr)
> highab <-subset(baseball, ab >= 600)
>
> system.time(
+ baseball3.lambda <-merge(highab,
+ ddply(highab, .(id),
+ function(u) data.frame(medianAB = median(u$ab)))), FALSE)
user system elapsed
0.336 0.000 0.336
>
> system.time(
baseball3.orig <- ddply(highab, .(id),
transform, medianAB = median(ab)), FALSE)
user system elapsed
0.640 0.000 0.641
>
> summary(baseball3.lambda$medianAB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
600.0 612.0 621.5 623.1 631.5 677.0
> summary(baseball3.orig$medianAB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
600.0 612.0 621.5 623.1 631.5 677.0
3 tenths of a second may not seem much but it is halving the execution time. The improvement is even bigger by selecting the whole baseball
dataset.
3 / 10秒似乎并不多,但它将执行时间缩短了一半。通过选择整个棒球数据集,改进更大。
#1
3
Just make two changes
只是让两个变化
- Remove the print command inside the function, so that median is returned
- 删除函数中的print命令,以便返回中位数
- Add
medianAB = testfunc1(ab)
as suggested by Joshua - 按照约书亚的建议添加medianAB = testfunc1(ab)
You are done!
你做的!
Here is the simplified code with the output
这是带有输出的简化代码
library(plyr)
highab <-subset(baseball, ab >= 600)
baseball3 <-ddply(highab, .(id), transform, medianAB = median(ab))
summary(baseball3$medianAB)
Min. 1st Qu. Median Mean 3rd Qu.
Max. 600.0 612.0 621.5 623.1 631.5 677.0第1区:中位数第三区。最大值:600.0 621.5 623.1 631.5 677.0
#2
0
Sorry. I mis-understood the question.
对不起。我误解了问题。
See ?transform
. You need to specify the new variables you want as tag=value
pairs. So you need something like
看到了什么?变换。您需要指定您想要的新变量为标记=值对。所以你需要一些类似的东西。
baseball3 <- ddply(highab, .(id), transform, medianAB=testfunc1(ab))
#3
0
At first I liked the idiom to add derived columns to a data.frame, but I find the usage of transform()
unacceptably slow far large sets.
起初,我喜欢将派生列添加到data.frame中,但我发现transform()的使用速度慢得令人无法接受。
Would it be better to use a lambda form in ddply()
and a subsequent call to merge merge()
? Timing it looks like it's worth it:
在ddply()中使用lambda表单和随后调用merge()会更好吗?时机看起来是值得的:
> library(plyr)
> highab <-subset(baseball, ab >= 600)
>
> system.time(
+ baseball3.lambda <-merge(highab,
+ ddply(highab, .(id),
+ function(u) data.frame(medianAB = median(u$ab)))), FALSE)
user system elapsed
0.336 0.000 0.336
>
> system.time(
baseball3.orig <- ddply(highab, .(id),
transform, medianAB = median(ab)), FALSE)
user system elapsed
0.640 0.000 0.641
>
> summary(baseball3.lambda$medianAB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
600.0 612.0 621.5 623.1 631.5 677.0
> summary(baseball3.orig$medianAB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
600.0 612.0 621.5 623.1 631.5 677.0
3 tenths of a second may not seem much but it is halving the execution time. The improvement is even bigger by selecting the whole baseball
dataset.
3 / 10秒似乎并不多,但它将执行时间缩短了一半。通过选择整个棒球数据集,改进更大。