I'm having trouble to understand the usage of the plyr package. I try to use it to split up dataframes that a stored in a list, apply a function, store the results as dataframes and combine the dataframes again as a list. So given the follwing data:
我无法理解plyr包的用法。我尝试使用它来分割存储在列表中的数据帧,应用函数,将结果存储为数据帧,并将数据帧再次组合为列表。所以给出以下数据:
#create test dfs
df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))
#make them a list
list.1<-list(df1=df1,df2=df2,df3=df3)
I would like to calculate the mean of each group defined in d of each dataframe. If I'd use plyr only on one dataframe (to calculate the mean by a specific column by groups) a possibility to use the plyr package would be:
我想计算每个数据帧的d中定义的每个组的平均值。如果我只在一个数据帧上使用plyr(按组计算特定列的平均值),使用plyr包的可能性是:
ddply(df1,.(d),summarise, mean=mean(a))
but how do I apply it on every column within the dataframe and on every dataframe within the list? and how can I reassamble all the data so that in the end I get a list with matrizes cotaining the results? Sorry for this very basic question, but I'm new to R and I have really been trying to solve this for quite some time... thx.
但是如何将其应用于数据框中的每一列以及列表中的每个数据框?我如何重新编码所有数据,以便最终得到一个包含结果的matrizes列表?对不起这个非常基本的问题,但我是R的新手,我已经尝试解决这个问题很长一段时间...... thx。
3 个解决方案
#1
1
Here is a solution combining llply()
and ddply()
. First, llply()
will apply function to each element of list and will return a list. Then ddply()
is applied to each data frame of list and also divides each data frame according to column d
. Function colMeans()
is used to calculate mean value for each numeric column.
这是一个结合llply()和ddply()的解决方案。首先,llply()将函数应用于列表的每个元素,并返回一个列表。然后将ddply()应用于列表的每个数据帧,并且还根据列d划分每个数据帧。函数colMeans()用于计算每个数字列的平均值。
llply(list.1,function(x) ddply(x,.(d),function(x) colMeans(x[,1:3])))
$df1
d a b c
1 a 22.25000 26.25 34.25000
2 b 19.66667 22.00 28.66667
3 c 37.00000 44.50 18.00000
4 d 17.00000 3.00 4.00000
$df2
d a b c
1 e 20.50000 32.25000 18.5
2 f 25.33333 34.33333 21.0
3 g 20.50000 26.50000 16.5
$df3
d a b c
1 h 17.5 26.50000 37.25000
2 i 45.0 22.33333 26.33333
3 j 25.0 33.00000 42.00000
#2
3
You need to put all the data into one big data.frame
:
您需要将所有数据放入一个大数据框架中:
library(reshape)
big_dataframe = ldply(list.1, function(x) melt(x, id.vars = "d"))
> head(big_dataframe)
.id d variable value
1 df1 a a 44
2 df1 b a 17
3 df1 c a 15
4 df1 a a 30
5 df1 a a 49
6 df1 b a 33
...and then use ddply
on it.
...然后在上面使用ddply。
res = ddply(big_dataframe, .(.id, d, variable), summarise, mn = mean(value))
> res
.id d variable mn
1 df1 a a 40.00000
2 df1 a b 25.25000
3 df1 a c 31.25000
4 df1 b a 22.66667
5 df1 b b 16.00000
6 df1 b c 26.00000
7 df1 c a 9.00000
8 df1 c b 16.50000
9 df1 c c 15.00000
10 df1 d a 28.00000
11 df1 d b 24.00000
12 df1 d c 39.00000
13 df2 e a 18.50000
14 df2 e b 15.50000
15 df2 e c 16.50000
16 df2 f a 26.33333
17 df2 f b 42.00000
18 df2 f c 37.00000
19 df2 g a 26.50000
20 df2 g b 22.00000
21 df2 g c 31.00000
22 df3 h a 29.25000
23 df3 h b 34.25000
24 df3 h c 32.00000
25 df3 i a 30.33333
26 df3 i b 40.00000
27 df3 i c 24.33333
28 df3 j a 21.00000
29 df3 j b 5.00000
30 df3 j c 46.00000
which gives the mean of each variable (a
-c
), per level of factor d
, and per sub-dataframe (df1-df3).
它给出每个变量(a-c)的平均值,每个因子d的水平和每个子数据帧(df1-df3)。
#3
3
you can always just lapply
your ddply
:
你总是可以为你的ddply提供服务:
lapply(list.1, function(x) ddply(x, .(d), function(x)
data.frame(a=mean(x$a),b=mean(x$b),c= mean(x$c))) )
or using your code exactly:
或完全使用您的代码:
lapply(list.1, function(x) ddply(x,.(d),summarise, mean=mean(a)) )
#1
1
Here is a solution combining llply()
and ddply()
. First, llply()
will apply function to each element of list and will return a list. Then ddply()
is applied to each data frame of list and also divides each data frame according to column d
. Function colMeans()
is used to calculate mean value for each numeric column.
这是一个结合llply()和ddply()的解决方案。首先,llply()将函数应用于列表的每个元素,并返回一个列表。然后将ddply()应用于列表的每个数据帧,并且还根据列d划分每个数据帧。函数colMeans()用于计算每个数字列的平均值。
llply(list.1,function(x) ddply(x,.(d),function(x) colMeans(x[,1:3])))
$df1
d a b c
1 a 22.25000 26.25 34.25000
2 b 19.66667 22.00 28.66667
3 c 37.00000 44.50 18.00000
4 d 17.00000 3.00 4.00000
$df2
d a b c
1 e 20.50000 32.25000 18.5
2 f 25.33333 34.33333 21.0
3 g 20.50000 26.50000 16.5
$df3
d a b c
1 h 17.5 26.50000 37.25000
2 i 45.0 22.33333 26.33333
3 j 25.0 33.00000 42.00000
#2
3
You need to put all the data into one big data.frame
:
您需要将所有数据放入一个大数据框架中:
library(reshape)
big_dataframe = ldply(list.1, function(x) melt(x, id.vars = "d"))
> head(big_dataframe)
.id d variable value
1 df1 a a 44
2 df1 b a 17
3 df1 c a 15
4 df1 a a 30
5 df1 a a 49
6 df1 b a 33
...and then use ddply
on it.
...然后在上面使用ddply。
res = ddply(big_dataframe, .(.id, d, variable), summarise, mn = mean(value))
> res
.id d variable mn
1 df1 a a 40.00000
2 df1 a b 25.25000
3 df1 a c 31.25000
4 df1 b a 22.66667
5 df1 b b 16.00000
6 df1 b c 26.00000
7 df1 c a 9.00000
8 df1 c b 16.50000
9 df1 c c 15.00000
10 df1 d a 28.00000
11 df1 d b 24.00000
12 df1 d c 39.00000
13 df2 e a 18.50000
14 df2 e b 15.50000
15 df2 e c 16.50000
16 df2 f a 26.33333
17 df2 f b 42.00000
18 df2 f c 37.00000
19 df2 g a 26.50000
20 df2 g b 22.00000
21 df2 g c 31.00000
22 df3 h a 29.25000
23 df3 h b 34.25000
24 df3 h c 32.00000
25 df3 i a 30.33333
26 df3 i b 40.00000
27 df3 i c 24.33333
28 df3 j a 21.00000
29 df3 j b 5.00000
30 df3 j c 46.00000
which gives the mean of each variable (a
-c
), per level of factor d
, and per sub-dataframe (df1-df3).
它给出每个变量(a-c)的平均值,每个因子d的水平和每个子数据帧(df1-df3)。
#3
3
you can always just lapply
your ddply
:
你总是可以为你的ddply提供服务:
lapply(list.1, function(x) ddply(x, .(d), function(x)
data.frame(a=mean(x$a),b=mean(x$b),c= mean(x$c))) )
or using your code exactly:
或完全使用您的代码:
lapply(list.1, function(x) ddply(x,.(d),summarise, mean=mean(a)) )