I have a data.frame
of 130,209 rows.
我有一个130,209行的data.frame。
> head(dt)
mLow1 mHigh1 mLow2 mHigh2 meanLow meanHigh fc mean
A_00001 37.00 12.75 99.25 78.50 68.125 45.625 1.4931507 56.8750
A_00002 31.00 21.50 84.75 53.00 57.875 37.250 1.5536913 47.5625
A_00003 72.50 26.50 81.75 74.75 77.125 50.625 1.5234568 63.8750
I want to split the data.frame
in 12, apply the scale
function on the column fc
and then combine it. There is no grouping variable here, else I'd have used ddply
. Also, because 130,209 is not perfectly divisible by 12, the resulting data.frames
will be unbalanced, i.e., 11 data.frame
s will have 10,851 rows and the last one will have 10,848 rows, but that's fine.
我想在12中拆分data.frame,在列fc上应用scale函数然后将它组合起来。这里没有分组变量,否则我已经使用了ddply。此外,因为130,209不能完全被12整除,所以得到的data.frames将是不平衡的,即11个data.frames将有10,851行,最后一个将有10,848行,但这很好。
So how do I split a data.frame
by row into chunks of n (in this case 12), apply a function and then combine them together? Any help'd be much appreciated.
那么我如何逐行将data.frame拆分为n的块(在本例中为12),应用一个函数然后将它们组合在一起?任何帮助都会非常感激。
Update: Using the two top solutions, I get different results: Using @Ben Bolker's solution,
更新:使用两个*解决方案,我得到不同的结果:使用@Ben Bolker的解决方案,
mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc
1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 -0.5231249
Using @MichaelChirico's answer:
使用@ MichaelChirico的答案:
mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc fc_scaled
1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 0.5555556 -0.5089608
3 个解决方案
#1
4
ggplot2
has a cut_number()
convenience function that will do this for you. If you don't want the overhead of loading that package, you can look at ggplot2:::breaks
for the necessary logic.
ggplot2有一个cut_number()方便函数,可以为你做这个。如果您不想要加载该包的开销,可以查看ggplot2 :::中断所需的逻辑。
Reproducible example stolen from @MichaelChirico:
来自@MichaelChirico的可重复示例:
set.seed(100)
KK<-130209L; nn<-12L
library("dplyr")
dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
mLow2=rnorm(KK),mHigh2=rnorm(KK),
meanLow=rnorm(KK),meanHigh=rnorm(KK),
fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean)
With apologies to those who don't like pipes:
向那些不喜欢烟斗的人道歉:
library("ggplot2") ## for cut_number()
dt %>% mutate(grp=cut_number(mean,12)) %>%
group_by(grp) %>%
mutate(fc=c(scale(fc))) %>%
ungroup() %>%
select(-grp) %>% ## drop grouping variable
as.data.frame -> dt2 ## convert back to data frame, assign result
It turns out that the c()
around scale()
is necessary -- otherwise the fc
variable ends up with some attributes that confuse tail()
...
事实证明,scale()周围的c()是必要的 - 否则fc变量会以一些混淆tail()的属性结束......
The same logic should apply to using plyr
, or base R split-apply-combine, as well (the key is using cut_number()
to define the grouping variable).
同样的逻辑应该适用于使用plyr或base R split-apply-combine(关键是使用cut_number()来定义分组变量)。
#2
5
I'm not sure the structure of dt
matters that much (if you are not using any of its internal values to do the splitting). Does this help?
我不确定dt的结构如此重要(如果你没有使用它的任何内部值来进行拆分)。这有帮助吗?
spl.dt <- split( dt , cut(1:nrow(dt), 12) )
lapply( spl.dt, my_fun)
#3
2
With data.table
, you can do:
使用data.table,您可以:
library(data.table)
setDT(dt)[,scale(fc),by=rep(1:nn,each=ceiling(KK/nn),length.out=KK)]
Here, KK
is 130,209 and nn
is 12. Reproducible data:
这里,KK是130,209,nn是12.可重复数据:
set.seed(100)
KK<-130209L; nn<-12L
dt<-data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
mLow2=rnorm(KK),mHigh2=rnorm(KK),
meanLow=rnorm(KK),meanHigh=rnorm(KK),
fc=rnorm(KK),mean=rnorm(KK))
So no need to split the data and recombine.
因此无需拆分数据并重新组合。
If you'd like to add this to the data frame instead of just extract it, you can use the :=
operator to assign by reference:
如果您想将其添加到数据框而不是仅提取它,可以使用:=运算符通过引用分配:
setDT(dt)[,fc_scaled:=scale(fc)...]
#1
4
ggplot2
has a cut_number()
convenience function that will do this for you. If you don't want the overhead of loading that package, you can look at ggplot2:::breaks
for the necessary logic.
ggplot2有一个cut_number()方便函数,可以为你做这个。如果您不想要加载该包的开销,可以查看ggplot2 :::中断所需的逻辑。
Reproducible example stolen from @MichaelChirico:
来自@MichaelChirico的可重复示例:
set.seed(100)
KK<-130209L; nn<-12L
library("dplyr")
dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
mLow2=rnorm(KK),mHigh2=rnorm(KK),
meanLow=rnorm(KK),meanHigh=rnorm(KK),
fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean)
With apologies to those who don't like pipes:
向那些不喜欢烟斗的人道歉:
library("ggplot2") ## for cut_number()
dt %>% mutate(grp=cut_number(mean,12)) %>%
group_by(grp) %>%
mutate(fc=c(scale(fc))) %>%
ungroup() %>%
select(-grp) %>% ## drop grouping variable
as.data.frame -> dt2 ## convert back to data frame, assign result
It turns out that the c()
around scale()
is necessary -- otherwise the fc
variable ends up with some attributes that confuse tail()
...
事实证明,scale()周围的c()是必要的 - 否则fc变量会以一些混淆tail()的属性结束......
The same logic should apply to using plyr
, or base R split-apply-combine, as well (the key is using cut_number()
to define the grouping variable).
同样的逻辑应该适用于使用plyr或base R split-apply-combine(关键是使用cut_number()来定义分组变量)。
#2
5
I'm not sure the structure of dt
matters that much (if you are not using any of its internal values to do the splitting). Does this help?
我不确定dt的结构如此重要(如果你没有使用它的任何内部值来进行拆分)。这有帮助吗?
spl.dt <- split( dt , cut(1:nrow(dt), 12) )
lapply( spl.dt, my_fun)
#3
2
With data.table
, you can do:
使用data.table,您可以:
library(data.table)
setDT(dt)[,scale(fc),by=rep(1:nn,each=ceiling(KK/nn),length.out=KK)]
Here, KK
is 130,209 and nn
is 12. Reproducible data:
这里,KK是130,209,nn是12.可重复数据:
set.seed(100)
KK<-130209L; nn<-12L
dt<-data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
mLow2=rnorm(KK),mHigh2=rnorm(KK),
meanLow=rnorm(KK),meanHigh=rnorm(KK),
fc=rnorm(KK),mean=rnorm(KK))
So no need to split the data and recombine.
因此无需拆分数据并重新组合。
If you'd like to add this to the data frame instead of just extract it, you can use the :=
operator to assign by reference:
如果您想将其添加到数据框而不是仅提取它,可以使用:=运算符通过引用分配:
setDT(dt)[,fc_scaled:=scale(fc)...]