I have a list of dataframes which I eventually want to merge while maintaining a record of their original dataframe name or list index. This will allow me to subset etc across all the rows. To accomplish this I would like to add a new variable 'id' to every dataframe, which contains the name/index of the dataframe it belongs to.
我有一个dataframes列表,我最终想要合并它,同时保持其原始dataframe名称或列表索引的记录。这将允许我对所有行进行子集等等。为此,我想向每个dataframe添加一个新的变量“id”,它包含它所属的dataframe的名称/索引。
Edit: "In my real code the dataframe variables are created from reading multiple files using the following code, so I don't have actual names only those in the 'files.to.read' list which I'm unsure if they will align with the dataframe order:
编辑:“在我的真实代码中,dataframe变量是通过使用以下代码读取多个文件创建的,所以我没有实际的名称,只有'files.to '中的名称。请阅读“列表”,我不确定它们是否符合dataframe订单:
mylist <- llply(files.to.read, read.csv)
A few methods have been highlighted in several posts: Working-with-dataframes-in-a-list-drop-variables-add-new-ones and Using-lapply-with-changing-arguments
一些方法已经在几个帖子中得到了强调:工作的-dataframes- - - - -drop- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
I have tried two similar methods, the first using the index list:
我尝试过两种类似的方法,第一种是使用索引列表:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1,df2)
# Adds a new coloumn 'id' with a value of 5 to every row in every dataframe.
# I WANT to change the value based on the list index.
mylist1 <- lapply(mylist,
function(x){
x$id <- 5
return (x)
}
)
#Example of what I WANT, instead of '5'.
#> mylist1
#[[1]]
#x y id
#1 1 11 1
#2 2 12 1
#3 3 13 1
#4 4 14 1
#5 5 15 1
#
#[[2]]
#x y id
#1 1 11 2
#2 2 12 2
#3 3 13 2
#4 4 14 2
#5 5 15 2
The second attempts to pass the names() of the list.
第二个尝试传递列表的name()。
# I WANT it to add a new coloumn 'id' with the name of the respective dataframe
# to every row in every dataframe.
mylist2 <- lapply(names(mylist),
function(x){
portfolio.results[[x]]$id <- "dataframe name here"
return (portfolio.results[[x]])
}
)
#Example of what I WANT, instead of 'dataframe name here'.
# mylist2
#[[1]]
#x y id
#1 1 11 df1
#2 2 12 df1
#3 3 13 df1
#4 4 14 df1
#5 5 15 df1
#
#[[2]]
#x y id
#1 1 11 df2
#2 2 12 df2
#3 3 13 df2
#4 4 14 df2
#5 5 15 df2
But the names() function doesn't work on a list of dataframes; it returns NULL. Could I use seq_along(mylist) in the first example.
但是names()函数不能作用于一个dataframes列表;它会返回NULL。我可以在第一个例子中使用seq_along(mylist)吗?
Any ideas or better way to handle the whole "merge with source id"
处理整个“与源id合并”的任何想法或更好的方法
Edit - Added Solution below: I've implemented a solution using Hadleys suggestion and Tommy’s nudge which looks something like this.
编辑-添加解决方案如下:我已经实现了一个解决方案,使用hadley的建议和汤米的轻推,看起来像这样。
files.to.read <- list.files(datafolder, pattern="\\_D.csv$", full.names=FALSE)
mylist <- llply(files.to.read, read.csv)
all <- do.call("rbind", mylist)
all$id <- rep(files.to.read, sapply(mylist, nrow))
I used the files.to.read vector as the id for each dataframe
我使用了files.to。读取向量作为每个dataframe的id
I also changed from using merge_recurse() as it was very slow for some reason.
我也改变了使用merge_recurse(),因为由于某些原因,它非常缓慢。
all <- merge_recurse(mylist)
Thanks everyone.
谢谢每一个人。
4 个解决方案
#1
17
Personally, I think it's easier to add the names after collapse:
就我个人而言,我认为在崩溃后加上名字更容易:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- do.call("rbind", mylist)
all$id <- rep(names(mylist), sapply(mylist, nrow))
#2
8
Your first attempt was very close. By using indices instead of values it will work. Your second attempt failed because you didn't name the elements in your list.
你的第一次尝试非常接近。通过使用索引而不是值,它将起作用。您的第二次尝试失败了,因为您没有在列表中命名元素。
Both solutions below use the fact that lapply
can pass extra parameters (mylist) to the function.
下面两个解决方案都使用lapply可以向函数传递额外参数(mylist)的事实。
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1=df1,df2=df2) # Name each data.frame!
# names(mylist) <- c("df1", "df2") # Alternative way of naming...
# Use indices - and pass in mylist
mylist1 <- lapply(seq_along(mylist),
function(i, x){
x[[i]]$id <- i
return (x[[i]])
}, mylist
)
# Now the names work - but I pass in mylist instead of using portfolio.results.
mylist2 <- lapply(names(mylist),
function(n, x){
x[[n]]$id <- n
return (x[[n]])
}, mylist
)
#3
2
names()
could work it it had names, but you didn't give it any. It's an unnamed list. You will need ti use numeric indices:
name()可以工作它有名字,但你什么都没给。这是一个不愿透露姓名的列表。你需要使用数字索引:
> for(i in 1:length(mylist) ){ mylist[[i]] <- cbind(mylist[[i]], id=rep(i, nrow(mylist[[i]]) ) ) }
> mylist
[[1]]
x y id
1 1 11 1
2 2 12 1
3 3 13 1
4 4 14 1
5 5 15 1
[[2]]
x y id
1 1 11 2
2 2 12 2
3 3 13 2
4 4 14 2
5 5 15 2
#4
1
dlply function form plyr package could be an answer:
功能模块plyr包可能是一个答案:
library('plyr')
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- ldply(mylist)
#1
17
Personally, I think it's easier to add the names after collapse:
就我个人而言,我认为在崩溃后加上名字更容易:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- do.call("rbind", mylist)
all$id <- rep(names(mylist), sapply(mylist, nrow))
#2
8
Your first attempt was very close. By using indices instead of values it will work. Your second attempt failed because you didn't name the elements in your list.
你的第一次尝试非常接近。通过使用索引而不是值,它将起作用。您的第二次尝试失败了,因为您没有在列表中命名元素。
Both solutions below use the fact that lapply
can pass extra parameters (mylist) to the function.
下面两个解决方案都使用lapply可以向函数传递额外参数(mylist)的事实。
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1=df1,df2=df2) # Name each data.frame!
# names(mylist) <- c("df1", "df2") # Alternative way of naming...
# Use indices - and pass in mylist
mylist1 <- lapply(seq_along(mylist),
function(i, x){
x[[i]]$id <- i
return (x[[i]])
}, mylist
)
# Now the names work - but I pass in mylist instead of using portfolio.results.
mylist2 <- lapply(names(mylist),
function(n, x){
x[[n]]$id <- n
return (x[[n]])
}, mylist
)
#3
2
names()
could work it it had names, but you didn't give it any. It's an unnamed list. You will need ti use numeric indices:
name()可以工作它有名字,但你什么都没给。这是一个不愿透露姓名的列表。你需要使用数字索引:
> for(i in 1:length(mylist) ){ mylist[[i]] <- cbind(mylist[[i]], id=rep(i, nrow(mylist[[i]]) ) ) }
> mylist
[[1]]
x y id
1 1 11 1
2 2 12 1
3 3 13 1
4 4 14 1
5 5 15 1
[[2]]
x y id
1 1 11 2
2 2 12 2
3 3 13 2
4 4 14 2
5 5 15 2
#4
1
dlply function form plyr package could be an answer:
功能模块plyr包可能是一个答案:
library('plyr')
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- ldply(mylist)