无法使用plyr包并使用列表

时间:2021-12-22 09:17:02

I'm having trouble to understand the usage of the plyr package. I try to use it to split up dataframes that a stored in a list, apply a function, store the results as dataframes and combine the dataframes again as a list. So given the follwing data:

我无法理解plyr包的用法。我尝试使用它来分割存储在列表中的数据帧,应用函数,将结果存储为数据帧,并将数据帧再次组合为列表。所以给出以下数据:

    #create test dfs
    df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
    df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
    df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))

    #make them a list
    list.1<-list(df1=df1,df2=df2,df3=df3)

I would like to calculate the mean of each group defined in d of each dataframe. If I'd use plyr only on one dataframe (to calculate the mean by a specific column by groups) a possibility to use the plyr package would be:

我想计算每个数据帧的d中定义的每个组的平均值。如果我只在一个数据帧上使用plyr(按组计算特定列的平均值),使用plyr包的可能性是:

    ddply(df1,.(d),summarise, mean=mean(a))

but how do I apply it on every column within the dataframe and on every dataframe within the list? and how can I reassamble all the data so that in the end I get a list with matrizes cotaining the results? Sorry for this very basic question, but I'm new to R and I have really been trying to solve this for quite some time... thx.

但是如何将其应用于数据框中的每一列以及列表中的每个数据框?我如何重新编码所有数据,以便最终得到一个包含结果的matrizes列表?对不起这个非常基本的问题,但我是R的新手,我已经尝试解决这个问题很长一段时间...... thx。

3 个解决方案

#1


1  

Here is a solution combining llply() and ddply(). First, llply() will apply function to each element of list and will return a list. Then ddply() is applied to each data frame of list and also divides each data frame according to column d. Function colMeans() is used to calculate mean value for each numeric column.

这是一个结合llply()和ddply()的解决方案。首先,llply()将函数应用于列表的每个元素,并返回一个列表。然后将ddply()应用于列表的每个数据帧,并且还根据列d划分每个数据帧。函数colMeans()用于计算每个数字列的平均值。

llply(list.1,function(x) ddply(x,.(d),function(x) colMeans(x[,1:3])))
$df1
  d        a     b        c
1 a 22.25000 26.25 34.25000
2 b 19.66667 22.00 28.66667
3 c 37.00000 44.50 18.00000
4 d 17.00000  3.00  4.00000

$df2
  d        a        b    c
1 e 20.50000 32.25000 18.5
2 f 25.33333 34.33333 21.0
3 g 20.50000 26.50000 16.5

$df3
  d    a        b        c
1 h 17.5 26.50000 37.25000
2 i 45.0 22.33333 26.33333
3 j 25.0 33.00000 42.00000

#2


3  

You need to put all the data into one big data.frame:

您需要将所有数据放入一个大数据框架中:

library(reshape)

big_dataframe = ldply(list.1, function(x) melt(x, id.vars = "d"))
> head(big_dataframe)
  .id d variable value
1 df1 a        a    44                                                      
2 df1 b        a    17                                                      
3 df1 c        a    15                                                      
4 df1 a        a    30                                                      
5 df1 a        a    49                                                      
6 df1 b        a    33

...and then use ddply on it.

...然后在上面使用ddply。

res = ddply(big_dataframe, .(.id, d, variable), summarise, mn = mean(value))
> res
   .id d variable       mn
1  df1 a        a 40.00000                                                  
2  df1 a        b 25.25000                                                  
3  df1 a        c 31.25000                                                  
4  df1 b        a 22.66667                                                  
5  df1 b        b 16.00000                                                  
6  df1 b        c 26.00000                                                  
7  df1 c        a  9.00000                                                  
8  df1 c        b 16.50000                                                  
9  df1 c        c 15.00000                                                  
10 df1 d        a 28.00000                                                  
11 df1 d        b 24.00000                                                  
12 df1 d        c 39.00000                                                  
13 df2 e        a 18.50000                                                  
14 df2 e        b 15.50000                                                  
15 df2 e        c 16.50000                                                  
16 df2 f        a 26.33333                                                  
17 df2 f        b 42.00000                                                  
18 df2 f        c 37.00000                                                  
19 df2 g        a 26.50000                                                  
20 df2 g        b 22.00000                                                  
21 df2 g        c 31.00000                                                  
22 df3 h        a 29.25000                                                  
23 df3 h        b 34.25000                                                  
24 df3 h        c 32.00000                                                  
25 df3 i        a 30.33333                                                  
26 df3 i        b 40.00000                                                  
27 df3 i        c 24.33333                                                  
28 df3 j        a 21.00000                                                  
29 df3 j        b  5.00000                                                  
30 df3 j        c 46.00000 

which gives the mean of each variable (a-c), per level of factor d, and per sub-dataframe (df1-df3).

它给出每个变量(a-c)的平均值,每个因子d的水平和每个子数据帧(df1-df3)。

#3


3  

you can always just lapply your ddply:

你总是可以为你的ddply提供服务:

 lapply(list.1, function(x)   ddply(x, .(d), function(x)  
                             data.frame(a=mean(x$a),b=mean(x$b),c= mean(x$c))) )

or using your code exactly:

或完全使用您的代码:

lapply(list.1, function(x) ddply(x,.(d),summarise, mean=mean(a)) )

#1


1  

Here is a solution combining llply() and ddply(). First, llply() will apply function to each element of list and will return a list. Then ddply() is applied to each data frame of list and also divides each data frame according to column d. Function colMeans() is used to calculate mean value for each numeric column.

这是一个结合llply()和ddply()的解决方案。首先,llply()将函数应用于列表的每个元素,并返回一个列表。然后将ddply()应用于列表的每个数据帧,并且还根据列d划分每个数据帧。函数colMeans()用于计算每个数字列的平均值。

llply(list.1,function(x) ddply(x,.(d),function(x) colMeans(x[,1:3])))
$df1
  d        a     b        c
1 a 22.25000 26.25 34.25000
2 b 19.66667 22.00 28.66667
3 c 37.00000 44.50 18.00000
4 d 17.00000  3.00  4.00000

$df2
  d        a        b    c
1 e 20.50000 32.25000 18.5
2 f 25.33333 34.33333 21.0
3 g 20.50000 26.50000 16.5

$df3
  d    a        b        c
1 h 17.5 26.50000 37.25000
2 i 45.0 22.33333 26.33333
3 j 25.0 33.00000 42.00000

#2


3  

You need to put all the data into one big data.frame:

您需要将所有数据放入一个大数据框架中:

library(reshape)

big_dataframe = ldply(list.1, function(x) melt(x, id.vars = "d"))
> head(big_dataframe)
  .id d variable value
1 df1 a        a    44                                                      
2 df1 b        a    17                                                      
3 df1 c        a    15                                                      
4 df1 a        a    30                                                      
5 df1 a        a    49                                                      
6 df1 b        a    33

...and then use ddply on it.

...然后在上面使用ddply。

res = ddply(big_dataframe, .(.id, d, variable), summarise, mn = mean(value))
> res
   .id d variable       mn
1  df1 a        a 40.00000                                                  
2  df1 a        b 25.25000                                                  
3  df1 a        c 31.25000                                                  
4  df1 b        a 22.66667                                                  
5  df1 b        b 16.00000                                                  
6  df1 b        c 26.00000                                                  
7  df1 c        a  9.00000                                                  
8  df1 c        b 16.50000                                                  
9  df1 c        c 15.00000                                                  
10 df1 d        a 28.00000                                                  
11 df1 d        b 24.00000                                                  
12 df1 d        c 39.00000                                                  
13 df2 e        a 18.50000                                                  
14 df2 e        b 15.50000                                                  
15 df2 e        c 16.50000                                                  
16 df2 f        a 26.33333                                                  
17 df2 f        b 42.00000                                                  
18 df2 f        c 37.00000                                                  
19 df2 g        a 26.50000                                                  
20 df2 g        b 22.00000                                                  
21 df2 g        c 31.00000                                                  
22 df3 h        a 29.25000                                                  
23 df3 h        b 34.25000                                                  
24 df3 h        c 32.00000                                                  
25 df3 i        a 30.33333                                                  
26 df3 i        b 40.00000                                                  
27 df3 i        c 24.33333                                                  
28 df3 j        a 21.00000                                                  
29 df3 j        b  5.00000                                                  
30 df3 j        c 46.00000 

which gives the mean of each variable (a-c), per level of factor d, and per sub-dataframe (df1-df3).

它给出每个变量(a-c)的平均值,每个因子d的水平和每个子数据帧(df1-df3)。

#3


3  

you can always just lapply your ddply:

你总是可以为你的ddply提供服务:

 lapply(list.1, function(x)   ddply(x, .(d), function(x)  
                             data.frame(a=mean(x$a),b=mean(x$b),c= mean(x$c))) )

or using your code exactly:

或完全使用您的代码:

lapply(list.1, function(x) ddply(x,.(d),summarise, mean=mean(a)) )