如何将数据帧按行拆分为n块，应用函数并合并？

I have a data.frame of 130,209 rows.

我有一个130,209行的data.frame。

> head(dt)

              mLow1 mHigh1 mLow2 mHigh2 meanLow meanHigh        fc     mean
     A_00001  37.00  12.75 99.25  78.50  68.125   45.625 1.4931507  56.8750
     A_00002  31.00  21.50 84.75  53.00  57.875   37.250 1.5536913  47.5625
     A_00003  72.50  26.50 81.75  74.75  77.125   50.625 1.5234568  63.8750

I want to split the data.frame in 12, apply the scale function on the column fc and then combine it. There is no grouping variable here, else I'd have used ddply. Also, because 130,209 is not perfectly divisible by 12, the resulting data.frames will be unbalanced, i.e., 11 data.frames will have 10,851 rows and the last one will have 10,848 rows, but that's fine.

我想在12中拆分data.frame,在列fc上应用scale函数然后将它组合起来。这里没有分组变量,否则我已经使用了ddply。此外,因为130,209不能完全被12整除,所以得到的data.frames将是不平衡的,即11个data.frames将有10,851行,最后一个将有10,848行,但这很好。

So how do I split a data.frame by row into chunks of n (in this case 12), apply a function and then combine them together? Any help'd be much appreciated.

那么我如何逐行将data.frame拆分为n的块(在本例中为12),应用一个函数然后将它们组合在一起?任何帮助都会非常感激。

Update: Using the two top solutions, I get different results: Using @Ben Bolker's solution,

更新:使用两个*解决方案,我得到不同的结果:使用@Ben Bolker的解决方案,

mLow1 mHigh1 mLow2 mHigh2          UID       gene_id meanLow meanHigh mean         fc
  1.5   3.25     1   1.25 MGLibB_00021 0610010K14Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00034 0610037L13Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibB_00058 1100001G20Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00061 1110001A16Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00104 1110034G24Rik    1.25     2.25 1.75 -0.5231249
  1.5   3.25     1   1.25 MGLibA_00110 1110038F14Rik    1.25     2.25 1.75 -0.5231249

Using @MichaelChirico's answer:

使用@ MichaelChirico的答案:

mLow1 mHigh1 mLow2 mHigh2          UID       gene_id meanLow meanHigh mean        fc  fc_scaled
  1.5   3.25     1   1.25 MGLibB_00021 0610010K14Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00034 0610037L13Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibB_00058 1100001G20Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00061 1110001A16Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00104 1110034G24Rik    1.25     2.25 1.75 0.5555556 -0.5089608
  1.5   3.25     1   1.25 MGLibA_00110 1110038F14Rik    1.25     2.25 1.75 0.5555556 -0.5089608

3 个解决方案

#1

ggplot2 has a cut_number() convenience function that will do this for you. If you don't want the overhead of loading that package, you can look at ggplot2:::breaks for the necessary logic.

ggplot2有一个cut_number()方便函数,可以为你做这个。如果您不想要加载该包的开销,可以查看ggplot2 :::中断所需的逻辑。

Reproducible example stolen from @MichaelChirico:

来自@MichaelChirico的可重复示例:

set.seed(100)
KK<-130209L; nn<-12L
library("dplyr")
dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
               mLow2=rnorm(KK),mHigh2=rnorm(KK),
               meanLow=rnorm(KK),meanHigh=rnorm(KK),
               fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean)

With apologies to those who don't like pipes:

向那些不喜欢烟斗的人道歉:

library("ggplot2")  ## for cut_number()
dt %>% mutate(grp=cut_number(mean,12)) %>%
       group_by(grp) %>%
       mutate(fc=c(scale(fc))) %>%
       ungroup() %>%        
       select(-grp) %>%     ## drop grouping variable
       as.data.frame -> dt2 ## convert back to data frame, assign result

It turns out that the c() around scale() is necessary -- otherwise the fc variable ends up with some attributes that confuse tail() ...

事实证明,scale()周围的c()是必要的 - 否则fc变量会以一些混淆tail()的属性结束......

The same logic should apply to using plyr, or base R split-apply-combine, as well (the key is using cut_number() to define the grouping variable).

同样的逻辑应该适用于使用plyr或base R split-apply-combine(关键是使用cut_number()来定义分组变量)。

#2

I'm not sure the structure of dt matters that much (if you are not using any of its internal values to do the splitting). Does this help?

我不确定dt的结构如此重要(如果你没有使用它的任何内部值来进行拆分)。这有帮助吗?

 spl.dt <- split( dt , cut(1:nrow(dt), 12) )

 lapply( spl.dt, my_fun)

#3

With data.table, you can do:

使用data.table,您可以:

library(data.table)
setDT(dt)[,scale(fc),by=rep(1:nn,each=ceiling(KK/nn),length.out=KK)]

Here, KK is 130,209 and nn is 12. Reproducible data:

这里,KK是130,209,nn是12.可重复数据:

set.seed(100)
KK<-130209L; nn<-12L
dt<-data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
               mLow2=rnorm(KK),mHigh2=rnorm(KK),
               meanLow=rnorm(KK),meanHigh=rnorm(KK),
               fc=rnorm(KK),mean=rnorm(KK))

So no need to split the data and recombine.

因此无需拆分数据并重新组合。

If you'd like to add this to the data frame instead of just extract it, you can use the := operator to assign by reference:

如果您想将其添加到数据框而不是仅提取它,可以使用:=运算符通过引用分配:

setDT(dt)[,fc_scaled:=scale(fc)...]

#1