Very often I want to convert a list wherein each index has identical element types to a data frame. For example, I may have a list:
我经常想要转换一个列表,其中每个索引都有相同的元素类型到一个数据帧。例如,我可能有一个列表:
> my.list
[[1]]
[[1]]$global_stdev_ppb
[1] 24267673
[[1]]$range
[1] 0.03114799
[[1]]$tok
[1] "hello"
[[1]]$global_freq_ppb
[1] 211592.6
[[2]]
[[2]]$global_stdev_ppb
[1] 11561448
[[2]]$range
[1] 0.08870838
[[2]]$tok
[1] "world"
[[2]]$global_freq_ppb
[1] 1002043
I want to convert this list to a data frame where each index element is a column. The natural (to me) thing to go is to is use do.call
:
我想将这个列表转换为一个数据帧,其中每个索引元素都是一个列。(对我来说)最自然的事情就是用doto。
> my.matrix<-do.call("rbind", my.list)
> my.matrix
global_stdev_ppb range tok global_freq_ppb
[1,] 24267673 0.03114799 "hello" 211592.6
[2,] 11561448 0.08870838 "world" 1002043
Straightforward enough, but when I attempt to cast this matrix as a data frame, the columns remain list elements, rather than vectors:
很简单,但是当我尝试将这个矩阵作为一个数据框架时,列仍然是列表元素,而不是向量:
> my.df<-as.data.frame(my.matrix, stringsAsFactors=FALSE)
> my.df[,1]
[[1]]
[1] 24267673
[[2]]
[1] 11561448
Currently, to get the data frame cast properly I am iterating over each column using unlist
and as.vector
, then recasting the data frame as such:
目前,为了正确地获得数据框架,我使用unlist和as迭代每个列。向量,然后重铸数据框如下:
new.list<-lapply(1:ncol(my.matrix), function(x) as.vector(unlist(my.matrix[,x])))
my.df<-as.data.frame(do.call(cbind, new.list), stringsAsFactors=FALSE)
This, however, seem very inefficient. Is there are better way to do this?
然而,这似乎非常低效。有更好的办法吗?
6 个解决方案
#1
47
I think you want:
我认为你想要的:
> do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE))
global_stdev_ppb range tok global_freq_ppb
1 24267673 0.03114799 hello 211592.6
2 11561448 0.08870838 world 1002043.0
> str(do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE)))
'data.frame': 2 obs. of 4 variables:
$ global_stdev_ppb: num 24267673 11561448
$ range : num 0.0311 0.0887
$ tok : chr "hello" "world"
$ global_freq_ppb : num 211593 1002043
#2
31
Another option is:
另一个选项是:
data.frame(t(sapply(mylist, `[`)))
but this simple manipulation results in a data frame of lists:
但是这个简单的操作导致了列表的数据框架:
> str(data.frame(t(sapply(mylist, `[`))))
'data.frame': 2 obs. of 3 variables:
$ a:List of 2
..$ : num 1
..$ : num 2
$ b:List of 2
..$ : num 2
..$ : num 3
$ c:List of 2
..$ : chr "a"
..$ : chr "b"
An alternative to this, along the same lines but now the result same as the other solutions, is:
另一种方法,沿着相同的直线但现在的结果和其他解一样,是:
data.frame(lapply(data.frame(t(sapply(mylist, `[`))), unlist))
[Edit: included timings of @Martin Morgan's two solutions, which have the edge over the other solution that return a data frame of vectors.] Some representative timings on a very simple problem:
[编辑:包括@Martin Morgan的两个解决方案的时间安排,它们比返回向量数据帧的另一个解决方案具有优势。[英语背诵文选关于一个非常简单的问题的一些有代表性的时间:
mylist <- list(list(a = 1, b = 2, c = "a"), list(a = 2, b = 3, c = "b"))
> ## @Joshua Ulrich's solution:
> system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame,
+ stringsAsFactors=FALSE))))
user system elapsed
1.740 0.001 1.750
> ## @JD Long's solution:
> system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame))))
user system elapsed
2.308 0.002 2.339
> ## my sapply solution No.1:
> system.time(replicate(1000, data.frame(t(sapply(mylist, `[`)))))
user system elapsed
0.296 0.000 0.301
> ## my sapply solution No.2:
> system.time(replicate(1000, data.frame(lapply(data.frame(t(sapply(mylist, `[`))),
+ unlist))))
user system elapsed
1.067 0.001 1.091
> ## @Martin Morgan's Map() sapply() solution:
> f = function(x) function(i) sapply(x, `[[`, i)
> system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
user system elapsed
0.775 0.000 0.778
> ## @Martin Morgan's Map() lapply() unlist() solution:
> f = function(x) function(i) unlist(lapply(x, `[[`, i), use.names=FALSE)
> system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
user system elapsed
0.653 0.000 0.658
#3
17
I can't tell you this is the "most efficient" in terms of memory or speed, but it's pretty efficient in terms of coding:
我不能说这是内存或速度方面的“最有效”,但在编码方面它是相当高效的:
my.df <- do.call("rbind", lapply(my.list, data.frame))
the lapply() step with data.frame() turns each list item into a single row data frame which then acts nice with rbind()
使用data.frame()的lapply()步骤将每个列表项转换为单个行数据帧,然后使用rbind()进行良好的操作
#4
14
Although this question has long since been answered, it's worth pointing out the data.table
package has rbindlist
which accomplishes this task very quickly:
虽然这个问题很久以前就有人回答过了,但值得指出的是数据。table package有rbindlist,可以很快完成这个任务:
library(microbenchmark)
library(data.table)
l <- replicate(1E4, list(a=runif(1), b=runif(1), c=runif(1)), simplify=FALSE)
microbenchmark( times=5,
R=as.data.frame(Map(f(l), names(l[[1]]))),
dt=data.frame(rbindlist(l))
)
gives me
给我
Unit: milliseconds
expr min lq median uq max neval
R 31.060119 31.403943 32.278537 32.370004 33.932700 5
dt 2.271059 2.273157 2.600976 2.635001 2.729421 5
#5
13
This
这
f = function(x) function(i) sapply(x, `[[`, i)
is a function that returns a function that extracts the i'th element of x. So
是一个返回一个提取x的第i个元素的函数
Map(f(mylist), names(mylist[[1]]))
gets a named (thanks Map!) list of vectors that can be made into a data frame
获取可构成数据帧的已命名(感谢映射!)向量列表
as.data.frame(Map(f(mylist), names(mylist[[1]])))
For speed it's usually faster to use unlist(lapply(...), use.names=FALSE)
as
对于速度,使用unlist通常会更快(lapply(…),use.names=FALSE)。
f = function(x) function(i) unlist(lapply(x, `[[`, i), use.names=FALSE)
A more general variant is
一个更普遍的变体是
f = function(X, FUN) function(...) sapply(X, FUN, ...)
When do the list-of-lists structures come up? Maybe there's an earlier step where an iteration could be replaced by something more vectorized?
列表结构什么时候出现?也许有一个早期的步骤,迭代可以被更矢量化的东西代替?
#6
2
The dplyr package's bind_rows
is efficient.
dplyr包的bind_rows是有效的。
one <- mtcars[1:4, ]
two <- mtcars[11:14, ]
system.time(dplyr::bind_rows(one, two))
user system elapsed
0.001 0.000 0.001
#1
47
I think you want:
我认为你想要的:
> do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE))
global_stdev_ppb range tok global_freq_ppb
1 24267673 0.03114799 hello 211592.6
2 11561448 0.08870838 world 1002043.0
> str(do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE)))
'data.frame': 2 obs. of 4 variables:
$ global_stdev_ppb: num 24267673 11561448
$ range : num 0.0311 0.0887
$ tok : chr "hello" "world"
$ global_freq_ppb : num 211593 1002043
#2
31
Another option is:
另一个选项是:
data.frame(t(sapply(mylist, `[`)))
but this simple manipulation results in a data frame of lists:
但是这个简单的操作导致了列表的数据框架:
> str(data.frame(t(sapply(mylist, `[`))))
'data.frame': 2 obs. of 3 variables:
$ a:List of 2
..$ : num 1
..$ : num 2
$ b:List of 2
..$ : num 2
..$ : num 3
$ c:List of 2
..$ : chr "a"
..$ : chr "b"
An alternative to this, along the same lines but now the result same as the other solutions, is:
另一种方法,沿着相同的直线但现在的结果和其他解一样,是:
data.frame(lapply(data.frame(t(sapply(mylist, `[`))), unlist))
[Edit: included timings of @Martin Morgan's two solutions, which have the edge over the other solution that return a data frame of vectors.] Some representative timings on a very simple problem:
[编辑:包括@Martin Morgan的两个解决方案的时间安排,它们比返回向量数据帧的另一个解决方案具有优势。[英语背诵文选关于一个非常简单的问题的一些有代表性的时间:
mylist <- list(list(a = 1, b = 2, c = "a"), list(a = 2, b = 3, c = "b"))
> ## @Joshua Ulrich's solution:
> system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame,
+ stringsAsFactors=FALSE))))
user system elapsed
1.740 0.001 1.750
> ## @JD Long's solution:
> system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame))))
user system elapsed
2.308 0.002 2.339
> ## my sapply solution No.1:
> system.time(replicate(1000, data.frame(t(sapply(mylist, `[`)))))
user system elapsed
0.296 0.000 0.301
> ## my sapply solution No.2:
> system.time(replicate(1000, data.frame(lapply(data.frame(t(sapply(mylist, `[`))),
+ unlist))))
user system elapsed
1.067 0.001 1.091
> ## @Martin Morgan's Map() sapply() solution:
> f = function(x) function(i) sapply(x, `[[`, i)
> system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
user system elapsed
0.775 0.000 0.778
> ## @Martin Morgan's Map() lapply() unlist() solution:
> f = function(x) function(i) unlist(lapply(x, `[[`, i), use.names=FALSE)
> system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
user system elapsed
0.653 0.000 0.658
#3
17
I can't tell you this is the "most efficient" in terms of memory or speed, but it's pretty efficient in terms of coding:
我不能说这是内存或速度方面的“最有效”,但在编码方面它是相当高效的:
my.df <- do.call("rbind", lapply(my.list, data.frame))
the lapply() step with data.frame() turns each list item into a single row data frame which then acts nice with rbind()
使用data.frame()的lapply()步骤将每个列表项转换为单个行数据帧,然后使用rbind()进行良好的操作
#4
14
Although this question has long since been answered, it's worth pointing out the data.table
package has rbindlist
which accomplishes this task very quickly:
虽然这个问题很久以前就有人回答过了,但值得指出的是数据。table package有rbindlist,可以很快完成这个任务:
library(microbenchmark)
library(data.table)
l <- replicate(1E4, list(a=runif(1), b=runif(1), c=runif(1)), simplify=FALSE)
microbenchmark( times=5,
R=as.data.frame(Map(f(l), names(l[[1]]))),
dt=data.frame(rbindlist(l))
)
gives me
给我
Unit: milliseconds
expr min lq median uq max neval
R 31.060119 31.403943 32.278537 32.370004 33.932700 5
dt 2.271059 2.273157 2.600976 2.635001 2.729421 5
#5
13
This
这
f = function(x) function(i) sapply(x, `[[`, i)
is a function that returns a function that extracts the i'th element of x. So
是一个返回一个提取x的第i个元素的函数
Map(f(mylist), names(mylist[[1]]))
gets a named (thanks Map!) list of vectors that can be made into a data frame
获取可构成数据帧的已命名(感谢映射!)向量列表
as.data.frame(Map(f(mylist), names(mylist[[1]])))
For speed it's usually faster to use unlist(lapply(...), use.names=FALSE)
as
对于速度,使用unlist通常会更快(lapply(…),use.names=FALSE)。
f = function(x) function(i) unlist(lapply(x, `[[`, i), use.names=FALSE)
A more general variant is
一个更普遍的变体是
f = function(X, FUN) function(...) sapply(X, FUN, ...)
When do the list-of-lists structures come up? Maybe there's an earlier step where an iteration could be replaced by something more vectorized?
列表结构什么时候出现?也许有一个早期的步骤,迭代可以被更矢量化的东西代替?
#6
2
The dplyr package's bind_rows
is efficient.
dplyr包的bind_rows是有效的。
one <- mtcars[1:4, ]
two <- mtcars[11:14, ]
system.time(dplyr::bind_rows(one, two))
user system elapsed
0.001 0.000 0.001