重复数据的行。

时间:2022-11-09 22:50:58

I want to repeat the rows of a data.frame, each N times. The result should be a new data.frame (with nrow(new.df) == nrow(old.df) * N) keeping the data types of the columns.

我要重复数据a的行,每个N次。结果应该是一个新的data.frame(使用nrow(new.df) == nrow(old.df) * N)保存这些列的数据类型。

Example for N = 2:

N = 2的例子:

                        A B   C
  A B   C             1 j i 100
1 j i 100     -->     2 j i 100
2 K P 101             3 K P 101
                      4 K P 101

So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...

因此,每一行重复2次,字符保留字符,因子保持因子,数字保持数字,…

My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N)), but this one transforms my values to characters and I get:

我的第一次尝试使用了apply(旧的)。df, 2, function(co) rep(co, each = N),但是这个函数将我的值转换为字符,我得到:

     A   B   C    
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"

9 个解决方案

#1


85  

df <- data.frame(a=1:2, b=letters[1:2]) 
df[rep(seq_len(nrow(df)), each=2),]

#2


6  

A clean dplyr solution, taken from here

一个干净的dplyr解决方案,从这里开始。

library(dplyr)
df <- data_frame(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))

#3


4  

If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:

如果你可以重复整件事,或者先把它的子集重复一遍,那么这个类似的问题可能会有所帮助。再次:

library(mefa)
rep(mtcars,10) 

or simply

或者简单地

mefa:::rep.data.frame(mtcars)

#4


4  

The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:

row函数有时会对列进行列表,这会导致糟糕的内存hijinks。我已经写了下面的文章,看起来效果不错:

library(plyr)
rep.row <- function(r, n){
  colwise(function(x) rep(x, n))(r)
}

#5


3  

Adding to what @dardisco mentioned about mefa::rep.data.frame(), it's very flexible.

添加到@dardisco提到的mefa::rep.data.frame(),它非常灵活。

You can either repeat each row N times:

你可以重复每一行N次:

rep(df, each=N)

or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)

或者重复整个dataframe N times(想想:当你回收一个矢量化的参数时)

rep(df, times=N)

Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.

为mefa竖起两个大拇指!直到现在我还没有听说过它,我不得不编写手工代码来完成它。

#6


3  

For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:

为了引用和添加引用mefa的答案,您可以看看mefa的实现::rep.data.frame(),以防您不想包含整个包:

> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
  a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
  a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f

#7


1  

try using for example

尝试使用例如

N=2
rep(1:4, each = N) 

as an index

作为一个指标

#8


1  

My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:

我的解决方案类似于mefa:::rep.data.frame,但是稍微快一点,并且关心行名称:

rep.data.frame <- function(x, times) {
    rnames <- attr(x, "row.names")
    x <- lapply(x, rep.int, times = times)
    class(x) <- "data.frame"
    if (!is.numeric(rnames))
        attr(x, "row.names") <- make.unique(rep.int(rnames, times))
    else
        attr(x, "row.names") <- .set_row_names(length(rnames) * times)
    x
}

Compare solutions:

比较解决方案:

library(Lahman)
library(microbenchmark)
microbenchmark(
    mefa:::rep.data.frame(Batting, 10),
    rep.data.frame(Batting, 10),
    Batting[rep.int(seq_len(nrow(Batting)), 10), ],
    times = 10
)
#> Unit: milliseconds
#>                                            expr       min       lq     mean   median        uq       max neval cld
#>              mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749  278.1066  356.3210    10  a 
#>                     rep.data.frame(Batting, 10)  79.70335  82.8165 134.0974  87.2587  191.1713  307.4567    10  a 
#>  Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927    10   b

#9


0  

Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:

这样做的另一种方法是,首先获取行索引,追加df的额外副本,然后按索引顺序排列:

df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]

Although the other solutions may be shorter, this method may be more advantageous in certain situations.

虽然其他的解决方案可能比较短,但在某些情况下这种方法可能更有利。

#1


85  

df <- data.frame(a=1:2, b=letters[1:2]) 
df[rep(seq_len(nrow(df)), each=2),]

#2


6  

A clean dplyr solution, taken from here

一个干净的dplyr解决方案,从这里开始。

library(dplyr)
df <- data_frame(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))

#3


4  

If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:

如果你可以重复整件事,或者先把它的子集重复一遍,那么这个类似的问题可能会有所帮助。再次:

library(mefa)
rep(mtcars,10) 

or simply

或者简单地

mefa:::rep.data.frame(mtcars)

#4


4  

The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:

row函数有时会对列进行列表,这会导致糟糕的内存hijinks。我已经写了下面的文章,看起来效果不错:

library(plyr)
rep.row <- function(r, n){
  colwise(function(x) rep(x, n))(r)
}

#5


3  

Adding to what @dardisco mentioned about mefa::rep.data.frame(), it's very flexible.

添加到@dardisco提到的mefa::rep.data.frame(),它非常灵活。

You can either repeat each row N times:

你可以重复每一行N次:

rep(df, each=N)

or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)

或者重复整个dataframe N times(想想:当你回收一个矢量化的参数时)

rep(df, times=N)

Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.

为mefa竖起两个大拇指!直到现在我还没有听说过它,我不得不编写手工代码来完成它。

#6


3  

For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:

为了引用和添加引用mefa的答案,您可以看看mefa的实现::rep.data.frame(),以防您不想包含整个包:

> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
  a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
  a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f

#7


1  

try using for example

尝试使用例如

N=2
rep(1:4, each = N) 

as an index

作为一个指标

#8


1  

My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:

我的解决方案类似于mefa:::rep.data.frame,但是稍微快一点,并且关心行名称:

rep.data.frame <- function(x, times) {
    rnames <- attr(x, "row.names")
    x <- lapply(x, rep.int, times = times)
    class(x) <- "data.frame"
    if (!is.numeric(rnames))
        attr(x, "row.names") <- make.unique(rep.int(rnames, times))
    else
        attr(x, "row.names") <- .set_row_names(length(rnames) * times)
    x
}

Compare solutions:

比较解决方案:

library(Lahman)
library(microbenchmark)
microbenchmark(
    mefa:::rep.data.frame(Batting, 10),
    rep.data.frame(Batting, 10),
    Batting[rep.int(seq_len(nrow(Batting)), 10), ],
    times = 10
)
#> Unit: milliseconds
#>                                            expr       min       lq     mean   median        uq       max neval cld
#>              mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749  278.1066  356.3210    10  a 
#>                     rep.data.frame(Batting, 10)  79.70335  82.8165 134.0974  87.2587  191.1713  307.4567    10  a 
#>  Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927    10   b

#9


0  

Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:

这样做的另一种方法是,首先获取行索引,追加df的额外副本,然后按索引顺序排列:

df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]

Although the other solutions may be shorter, this method may be more advantageous in certain situations.

虽然其他的解决方案可能比较短,但在某些情况下这种方法可能更有利。