I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came from.
我有几个数据框,我想按行组合。在生成的单个数据框中,我想创建一个新变量,用于标识观察结果来自哪个数据集。
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
# desired, combined data frame
df3 <- data.frame(x = c(1, 3, 5, 7), y = c(2, 4, 6, 8),
source = c("df1", "df1", "df2", "df2")
# x y source
# 1 2 df1
# 3 4 df1
# 5 6 df2
# 7 8 df2
How can I achieve this? Thanks in advance!
我怎样才能做到这一点?提前致谢!
6 个解决方案
#1
16
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
这不完全是你要求的,但它非常接近。将您的对象放在命名列表中并使用do.call(rbind ...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frame
s.
请注意,行名称现在反映了源data.frames。
Update: Use cbind
and rbind
Another option is to make a basic function like the following:
另一种选择是制作如下基本功能:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame
names that you want to "stack", as follows:
然后,此函数将获取要“堆叠”的data.frame名称的字符向量,如下所示:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine
from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist
from "data.table"
Another approach that can be used now is to use rbindlist
from "data.table". With that, the approach could be:
现在可以使用的另一种方法是使用“data.table”中的rbindlist。有了这个,方法可能是:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df
from "purrr"
Similar to rbindlist
, you can also use map_df
from "purrr" with I
or c
as the function to apply to each list element.
与rbindlist类似,您也可以使用“purrr”中的map_df,其中I或c作为应用于每个列表元素的函数。
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
#2
7
I'm not sure if such a function already exists, but this seems to do the trick:
我不确定这样的功能是否已经存在,但这似乎可以解决问题:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply
-like calls
警告:这不适用于*类似电话的呼叫
#3
6
A blend of the other two answers:
其他两个答案的混合:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call
business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2
) and using names(args)
to access the names.
如果要避免使用match.call业务,可以始终限制自己命名函数参数(即df1 = df1,df2 = df2)并使用名称(args)来访问名称。
#4
6
Another approach using dplyr
:
使用dplyr的另一种方法:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
#5
2
Another workaround for this one is using ldply in the plyr package...
这个的另一个解决方法是在plyr包中使用ldply ...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8
#6
0
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R
so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R
solutions.
尽管这里已经有了一些很好的答案,但我只想添加一直使用的答案。它是基础R所以如果你想在一个包中使用它可能会受到更少限制,并且它比一些其他基础R解决方案快一点。
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs)
binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1))
finds how many rows each data frame has which is passed to rep
in rep(names(dfs), vapply(dfs, nrow, numeric(1)))
to repeat the name of the data frame once for each row of the data frame. cbind
puts them all together.
第一部分,do.call(rbind,dfs)将数据帧行绑定到单个数据帧中。 vapply(dfs,nrow,numeric(1))查找每个数据帧有多少行传递给rep in rep(names(dfs),vapply(dfs,nrow,numeric(1)))重复名称对于数据帧的每一行,数据帧一次。 cbind把它们放在一起。
This is similar to a previously posted solution, but about 2x faster.
这类似于之前发布的解决方案,但速度提高了约2倍。
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind
rather than one per data frame.
我不是100%肯定,但我相信加速是由于对cbind进行一次调用而不是每个数据帧调用一次。
#1
16
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
这不完全是你要求的,但它非常接近。将您的对象放在命名列表中并使用do.call(rbind ...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frame
s.
请注意,行名称现在反映了源data.frames。
Update: Use cbind
and rbind
Another option is to make a basic function like the following:
另一种选择是制作如下基本功能:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame
names that you want to "stack", as follows:
然后,此函数将获取要“堆叠”的data.frame名称的字符向量,如下所示:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine
from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist
from "data.table"
Another approach that can be used now is to use rbindlist
from "data.table". With that, the approach could be:
现在可以使用的另一种方法是使用“data.table”中的rbindlist。有了这个,方法可能是:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df
from "purrr"
Similar to rbindlist
, you can also use map_df
from "purrr" with I
or c
as the function to apply to each list element.
与rbindlist类似,您也可以使用“purrr”中的map_df,其中I或c作为应用于每个列表元素的函数。
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
#2
7
I'm not sure if such a function already exists, but this seems to do the trick:
我不确定这样的功能是否已经存在,但这似乎可以解决问题:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply
-like calls
警告:这不适用于*类似电话的呼叫
#3
6
A blend of the other two answers:
其他两个答案的混合:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call
business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2
) and using names(args)
to access the names.
如果要避免使用match.call业务,可以始终限制自己命名函数参数(即df1 = df1,df2 = df2)并使用名称(args)来访问名称。
#4
6
Another approach using dplyr
:
使用dplyr的另一种方法:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
#5
2
Another workaround for this one is using ldply in the plyr package...
这个的另一个解决方法是在plyr包中使用ldply ...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8
#6
0
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R
so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R
solutions.
尽管这里已经有了一些很好的答案,但我只想添加一直使用的答案。它是基础R所以如果你想在一个包中使用它可能会受到更少限制,并且它比一些其他基础R解决方案快一点。
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs)
binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1))
finds how many rows each data frame has which is passed to rep
in rep(names(dfs), vapply(dfs, nrow, numeric(1)))
to repeat the name of the data frame once for each row of the data frame. cbind
puts them all together.
第一部分,do.call(rbind,dfs)将数据帧行绑定到单个数据帧中。 vapply(dfs,nrow,numeric(1))查找每个数据帧有多少行传递给rep in rep(names(dfs),vapply(dfs,nrow,numeric(1)))重复名称对于数据帧的每一行,数据帧一次。 cbind把它们放在一起。
This is similar to a previously posted solution, but about 2x faster.
这类似于之前发布的解决方案,但速度提高了约2倍。
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind
rather than one per data frame.
我不是100%肯定,但我相信加速是由于对cbind进行一次调用而不是每个数据帧调用一次。