I have a data.frame which I would like to convert to a list by rows, meaning each row would correspond to its own list elements. In other words, I would like a list that is as long as the data.frame has rows.
我有一个data.frame,我想把它转换成一行,这意味着每一行对应它自己的列表元素。换句话说,我想要一个列表,只要数据。frame有行。
So far, I've tackled this problem in the following manner, but I was wondering if there's a better way to approach this.
到目前为止,我已经按照如下方式处理了这个问题,但是我想知道是否有更好的方法来解决这个问题。
xy.df <- data.frame(x = runif(10), y = runif(10))
# pre-allocate a list and fill it with a loop
xy.list <- vector("list", nrow(xy.df))
for (i in 1:nrow(xy.df)) {
xy.list[[i]] <- xy.df[i,]
}
11 个解决方案
#1
94
Like this:
是这样的:
xy.list <- split(xy.df, seq(nrow(xy.df)))
And if you want the rownames of xy.df
to be the names of the output list, you can do:
如果你想要xy的行名。df是输出列表的名称,您可以这样做:
xy.list <- setNames(split(xy.df, seq(nrow(xy.df))), rownames(xy.df))
#2
38
Eureka!
尤里卡!
xy.list <- as.list(as.data.frame(t(xy.df)))
#3
11
If you want to completely abuse the data.frame (as I do) and like to keep the $ functionality, one way is to split you data.frame into one-line data.frames gathered in a list :
如果你想要完全滥用数据。框架(就像我一样),并且喜欢保持这个功能,一种方法是将你的数据a帧分割成一行数据。
> df = data.frame(x=c('a','b','c'), y=3:1)
> df
x y
1 a 3
2 b 2
3 c 1
# 'convert' into a list of data.frames
ldf = lapply(as.list(1:dim(df)[1]), function(x) df[x[1],])
> ldf
[[1]]
x y
1 a 3
[[2]]
x y
2 b 2
[[3]]
x y
3 c 1
# and the 'coolest'
> ldf[[2]]$y
[1] 2
It is not only intellectual masturbation, but allows to 'transform' the data.frame into a list of its lines, keeping the $ indexation which can be useful for further use with lapply (assuming the function you pass to lapply uses this $ indexation)
它不仅是知识的手淫,而且允许将数据a的框架“转换”为其行的列表,保持$ indexation,这对于进一步使用lapply是有用的(假设您传递给lapply的函数使用这个$ indexation)
#4
5
Seems a current version of the purrr
(0.2.2) package is the fastest solution:
似乎当前版本的purrr(0.2.2)包是最快的解决方案:
by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
Let's compare the most interesting solutions:
让我们来比较一下最有趣的解决方案:
data("Batting", package = "Lahman")
x <- Batting[1:10000, 1:10]
library(benchr)
library(purrr)
benchmark(
split = split(x, seq_len(.row_names_info(x, 2L))),
mapply = .mapply(function(...) structure(list(...), class = "data.frame", row.names = 1L), x, NULL),
purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
)
Rsults:
为主;:
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
split 100 983.0 1060.0 1130.0 1130.0 1180.0 1450 113000 34.3
mapply 100 826.0 894.0 963.0 972.0 1030.0 1320 97200 29.3
purrr 100 24.1 28.6 32.9 44.9 40.5 183 4490 1.0
Also we can get the same result with Rcpp
:
我们也可以得到与Rcpp相同的结果:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List df2list(const DataFrame& x) {
std::size_t nrows = x.rows();
std::size_t ncols = x.cols();
CharacterVector nms = x.names();
List res(no_init(nrows));
for (std::size_t i = 0; i < nrows; ++i) {
List tmp(no_init(ncols));
for (std::size_t j = 0; j < ncols; ++j) {
switch(TYPEOF(x[j])) {
case INTSXP: {
if (Rf_isFactor(x[j])) {
IntegerVector t = as<IntegerVector>(x[j]);
RObject t2 = wrap(t[i]);
t2.attr("class") = "factor";
t2.attr("levels") = t.attr("levels");
tmp[j] = t2;
} else {
tmp[j] = as<IntegerVector>(x[j])[i];
}
break;
}
case LGLSXP: {
tmp[j] = as<LogicalVector>(x[j])[i];
break;
}
case CPLXSXP: {
tmp[j] = as<ComplexVector>(x[j])[i];
break;
}
case REALSXP: {
tmp[j] = as<NumericVector>(x[j])[i];
break;
}
case STRSXP: {
tmp[j] = as<std::string>(as<CharacterVector>(x[j])[i]);
break;
}
default: stop("Unsupported type '%s'.", type2name(x));
}
}
tmp.attr("class") = "data.frame";
tmp.attr("row.names") = 1;
tmp.attr("names") = nms;
res[i] = tmp;
}
res.attr("names") = x.attr("row.names");
return res;
}
Now caompare with purrr
:
现在caompare purrr:
benchmark(
purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out,
rcpp = df2list(x)
)
Results:
结果:
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
purrr 100 25.2 29.8 37.5 43.4 44.2 159.0 4340 1.1
rcpp 100 19.0 27.9 34.3 35.8 37.2 93.8 3580 1.0
#5
2
Another alternative using library(purrr)
(that seems to be a bit quicker on large data.frames)
另一种使用库(purrr)的替代方法(在大数据上看起来要快一些)。
flatten(by_row(xy.df, ..f = function(x) flatten_chr(x), .labels = FALSE))
#6
2
I was working on this today for a data.frame (really a data.table) with 250 million observations and 35 columns. My goal was to return a list of data.frames (data.tables) each with a single row. That is, I wanted to split each row into a separate data.frame and store these in a list.
我今天在为一个data.frame(真的是一个数据表)工作,有2.5亿个观察和35个列。我的目标是返回一个data.frame (data.tables)列表,每个表都有一行。也就是说,我希望将每一行拆分为一个单独的数据。frame并将它们存储在一个列表中。
Here are two methods I came up with that were roughly 3 times faster than split(dat, seq_len(nrow(dat)))
for that data set. Below, I benchmark the three methods on a 7500 row, 5 column data set (iris repeated 50 times).
下面是我提出的两种方法,它们的速度比数据集的拆分速度快3倍(dat、seq_len(nrow(dat))),下面是7500行、5列数据集(iris重复50次)的三个方法。
library(data.table)
library(microbenchmark)
microbenchmark(
split={dat1 <- split(dat, seq_len(nrow(dat)))},
setDF={dat2 <- lapply(seq_len(nrow(dat)),
function(i) setDF(lapply(dat, "[", i)))},
attrDT={dat3 <- lapply(seq_len(nrow(dat)),
function(i) {
tmp <- lapply(dat, "[", i)
attr(tmp, "class") <- c("data.table", "data.frame")
setDF(tmp)
})},
datList = {datL <- lapply(seq_len(nrow(dat)),
function(i) lapply(dat, "[", i))},
times=20
)
This returns
这将返回
Unit: milliseconds
expr min lq mean median uq max neval
split 861.8126 889.1849 973.5294 943.2288 1041.7206 1250.6150 20
setDF 459.0577 466.3432 511.2656 482.1943 500.6958 750.6635 20
attrDT 399.1999 409.6316 461.6454 422.5436 490.5620 717.6355 20
datList 192.1175 201.9896 241.4726 208.4535 246.4299 411.2097 20
While the differences are not as striking as in my previous test, the straight setDF
method is significantly faster at all levels of the distribution of runs and the attr
method is typically more than twice as fast.
虽然这些差异不像我以前的测试那样显著,但在运行的所有级别上,直setDF方法的速度要快得多,而attr方法的速度通常是前者的两倍多。
A fourth method is the extreme champion, which is a simple nested lapply
, returning a nested list. This method exemplifies the cost of constructing a data.frame from a list. Moreover, all methods I tried with the data.frame
function were roughly an order of magnitude slower than the data.table
techniques.
第四个方法是extreme champion,它是一个简单的嵌套lapply,返回一个嵌套列表。这个方法举例说明了从一个列表中构造一个数据框的成本。此外,我尝试的所有方法的数据。框架函数大约比数据慢一个数量级。表技术。
data
数据
dat <- vector("list", 50)
for(i in 1:50) dat[[i]] <- iris
dat <- setDF(rbindlist(dat))
#7
1
An alternative way is to convert the df to a matrix then applying the list apply lappy
function over it: ldf <- lapply(as.matrix(myDF), function(x)x)
另一种方法是将df转换为一个矩阵,然后应用该列表应用lappy函数:ldf <- lapply(as.matrix(myDF), function(x)x)
#8
1
The best way for me was:
对我来说最好的方法是:
Example data:
示例数据:
Var1<-c("X1",X2","X3")
Var2<-c("X1",X2","X3")
Var3<-c("X1",X2","X3")
Data<-cbind(Var1,Var2,Var3)
ID Var1 Var2 Var3
1 X1 X2 X3
2 X4 X5 X6
3 X7 X8 X9
We call the BBmisc
library
我们叫BBmisc图书馆。
library(BBmisc)
data$lists<-convertRowsToList(data[,2:4])
And the result will be:
结果是:
ID Var1 Var2 Var3 lists
1 X1 X2 X3 list("X1", "X2", X3")
2 X4 X5 X6 list("X4","X5", "X6")
3 X7 X8 X9 list("X7,"X8,"X9)
#9
0
The by_row
function from the purrrlyr
package will do this for you.
purrrlyr包中的by_row函数将为您执行此操作。
This example demonstrates
这个例子演示了
myfn <- function(row) {
#row is a tibble with one row, and the same number of columns as the original df
l <- as.list(row)
return(l)
}
list_of_lists <- purrrlyr::by_row(df, myfn, .labels=FALSE)$.out
By default, the returned value from myfn
is put into a new list column in the df called .out
. The $.out
at the end of the above statement immediately selects this column, returning a list of lists.
默认情况下,myfn返回的值被放入df中名为.out的新列表列中。美元的。在上述语句的末尾,立即选择该列,返回列表的列表。
#10
0
Like @flodel wrote: This converts your dataframe into a list that has the same number of elements as number of rows in dataframe:
如@flodel所写的:这将您的dataframe转换为一个列表,该列表的元素个数与dataframe中的行数相同:
NewList <- split(df, f = seq(nrow(df)))
You can additionaly add a function to select only those columns that are not NA in each element of the list:
您还可以添加一个函数来选择列表中每个元素中没有NA的列:
NewList2 <- lapply(NewList, function(x) x[,!is.na(x)])
#11
-1
A more modern solution uses only purrr::transpose
:
一个更现代的解决方案只使用purrr::转置:
library(purrr)
iris[1:2,] %>% purrr::transpose()
#> [[1]]
#> [[1]]$Sepal.Length
#> [1] 5.1
#>
#> [[1]]$Sepal.Width
#> [1] 3.5
#>
#> [[1]]$Petal.Length
#> [1] 1.4
#>
#> [[1]]$Petal.Width
#> [1] 0.2
#>
#> [[1]]$Species
#> [1] 1
#>
#>
#> [[2]]
#> [[2]]$Sepal.Length
#> [1] 4.9
#>
#> [[2]]$Sepal.Width
#> [1] 3
#>
#> [[2]]$Petal.Length
#> [1] 1.4
#>
#> [[2]]$Petal.Width
#> [1] 0.2
#>
#> [[2]]$Species
#> [1] 1
#1
94
Like this:
是这样的:
xy.list <- split(xy.df, seq(nrow(xy.df)))
And if you want the rownames of xy.df
to be the names of the output list, you can do:
如果你想要xy的行名。df是输出列表的名称,您可以这样做:
xy.list <- setNames(split(xy.df, seq(nrow(xy.df))), rownames(xy.df))
#2
38
Eureka!
尤里卡!
xy.list <- as.list(as.data.frame(t(xy.df)))
#3
11
If you want to completely abuse the data.frame (as I do) and like to keep the $ functionality, one way is to split you data.frame into one-line data.frames gathered in a list :
如果你想要完全滥用数据。框架(就像我一样),并且喜欢保持这个功能,一种方法是将你的数据a帧分割成一行数据。
> df = data.frame(x=c('a','b','c'), y=3:1)
> df
x y
1 a 3
2 b 2
3 c 1
# 'convert' into a list of data.frames
ldf = lapply(as.list(1:dim(df)[1]), function(x) df[x[1],])
> ldf
[[1]]
x y
1 a 3
[[2]]
x y
2 b 2
[[3]]
x y
3 c 1
# and the 'coolest'
> ldf[[2]]$y
[1] 2
It is not only intellectual masturbation, but allows to 'transform' the data.frame into a list of its lines, keeping the $ indexation which can be useful for further use with lapply (assuming the function you pass to lapply uses this $ indexation)
它不仅是知识的手淫,而且允许将数据a的框架“转换”为其行的列表,保持$ indexation,这对于进一步使用lapply是有用的(假设您传递给lapply的函数使用这个$ indexation)
#4
5
Seems a current version of the purrr
(0.2.2) package is the fastest solution:
似乎当前版本的purrr(0.2.2)包是最快的解决方案:
by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
Let's compare the most interesting solutions:
让我们来比较一下最有趣的解决方案:
data("Batting", package = "Lahman")
x <- Batting[1:10000, 1:10]
library(benchr)
library(purrr)
benchmark(
split = split(x, seq_len(.row_names_info(x, 2L))),
mapply = .mapply(function(...) structure(list(...), class = "data.frame", row.names = 1L), x, NULL),
purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
)
Rsults:
为主;:
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
split 100 983.0 1060.0 1130.0 1130.0 1180.0 1450 113000 34.3
mapply 100 826.0 894.0 963.0 972.0 1030.0 1320 97200 29.3
purrr 100 24.1 28.6 32.9 44.9 40.5 183 4490 1.0
Also we can get the same result with Rcpp
:
我们也可以得到与Rcpp相同的结果:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List df2list(const DataFrame& x) {
std::size_t nrows = x.rows();
std::size_t ncols = x.cols();
CharacterVector nms = x.names();
List res(no_init(nrows));
for (std::size_t i = 0; i < nrows; ++i) {
List tmp(no_init(ncols));
for (std::size_t j = 0; j < ncols; ++j) {
switch(TYPEOF(x[j])) {
case INTSXP: {
if (Rf_isFactor(x[j])) {
IntegerVector t = as<IntegerVector>(x[j]);
RObject t2 = wrap(t[i]);
t2.attr("class") = "factor";
t2.attr("levels") = t.attr("levels");
tmp[j] = t2;
} else {
tmp[j] = as<IntegerVector>(x[j])[i];
}
break;
}
case LGLSXP: {
tmp[j] = as<LogicalVector>(x[j])[i];
break;
}
case CPLXSXP: {
tmp[j] = as<ComplexVector>(x[j])[i];
break;
}
case REALSXP: {
tmp[j] = as<NumericVector>(x[j])[i];
break;
}
case STRSXP: {
tmp[j] = as<std::string>(as<CharacterVector>(x[j])[i]);
break;
}
default: stop("Unsupported type '%s'.", type2name(x));
}
}
tmp.attr("class") = "data.frame";
tmp.attr("row.names") = 1;
tmp.attr("names") = nms;
res[i] = tmp;
}
res.attr("names") = x.attr("row.names");
return res;
}
Now caompare with purrr
:
现在caompare purrr:
benchmark(
purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out,
rcpp = df2list(x)
)
Results:
结果:
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
purrr 100 25.2 29.8 37.5 43.4 44.2 159.0 4340 1.1
rcpp 100 19.0 27.9 34.3 35.8 37.2 93.8 3580 1.0
#5
2
Another alternative using library(purrr)
(that seems to be a bit quicker on large data.frames)
另一种使用库(purrr)的替代方法(在大数据上看起来要快一些)。
flatten(by_row(xy.df, ..f = function(x) flatten_chr(x), .labels = FALSE))
#6
2
I was working on this today for a data.frame (really a data.table) with 250 million observations and 35 columns. My goal was to return a list of data.frames (data.tables) each with a single row. That is, I wanted to split each row into a separate data.frame and store these in a list.
我今天在为一个data.frame(真的是一个数据表)工作,有2.5亿个观察和35个列。我的目标是返回一个data.frame (data.tables)列表,每个表都有一行。也就是说,我希望将每一行拆分为一个单独的数据。frame并将它们存储在一个列表中。
Here are two methods I came up with that were roughly 3 times faster than split(dat, seq_len(nrow(dat)))
for that data set. Below, I benchmark the three methods on a 7500 row, 5 column data set (iris repeated 50 times).
下面是我提出的两种方法,它们的速度比数据集的拆分速度快3倍(dat、seq_len(nrow(dat))),下面是7500行、5列数据集(iris重复50次)的三个方法。
library(data.table)
library(microbenchmark)
microbenchmark(
split={dat1 <- split(dat, seq_len(nrow(dat)))},
setDF={dat2 <- lapply(seq_len(nrow(dat)),
function(i) setDF(lapply(dat, "[", i)))},
attrDT={dat3 <- lapply(seq_len(nrow(dat)),
function(i) {
tmp <- lapply(dat, "[", i)
attr(tmp, "class") <- c("data.table", "data.frame")
setDF(tmp)
})},
datList = {datL <- lapply(seq_len(nrow(dat)),
function(i) lapply(dat, "[", i))},
times=20
)
This returns
这将返回
Unit: milliseconds
expr min lq mean median uq max neval
split 861.8126 889.1849 973.5294 943.2288 1041.7206 1250.6150 20
setDF 459.0577 466.3432 511.2656 482.1943 500.6958 750.6635 20
attrDT 399.1999 409.6316 461.6454 422.5436 490.5620 717.6355 20
datList 192.1175 201.9896 241.4726 208.4535 246.4299 411.2097 20
While the differences are not as striking as in my previous test, the straight setDF
method is significantly faster at all levels of the distribution of runs and the attr
method is typically more than twice as fast.
虽然这些差异不像我以前的测试那样显著,但在运行的所有级别上,直setDF方法的速度要快得多,而attr方法的速度通常是前者的两倍多。
A fourth method is the extreme champion, which is a simple nested lapply
, returning a nested list. This method exemplifies the cost of constructing a data.frame from a list. Moreover, all methods I tried with the data.frame
function were roughly an order of magnitude slower than the data.table
techniques.
第四个方法是extreme champion,它是一个简单的嵌套lapply,返回一个嵌套列表。这个方法举例说明了从一个列表中构造一个数据框的成本。此外,我尝试的所有方法的数据。框架函数大约比数据慢一个数量级。表技术。
data
数据
dat <- vector("list", 50)
for(i in 1:50) dat[[i]] <- iris
dat <- setDF(rbindlist(dat))
#7
1
An alternative way is to convert the df to a matrix then applying the list apply lappy
function over it: ldf <- lapply(as.matrix(myDF), function(x)x)
另一种方法是将df转换为一个矩阵,然后应用该列表应用lappy函数:ldf <- lapply(as.matrix(myDF), function(x)x)
#8
1
The best way for me was:
对我来说最好的方法是:
Example data:
示例数据:
Var1<-c("X1",X2","X3")
Var2<-c("X1",X2","X3")
Var3<-c("X1",X2","X3")
Data<-cbind(Var1,Var2,Var3)
ID Var1 Var2 Var3
1 X1 X2 X3
2 X4 X5 X6
3 X7 X8 X9
We call the BBmisc
library
我们叫BBmisc图书馆。
library(BBmisc)
data$lists<-convertRowsToList(data[,2:4])
And the result will be:
结果是:
ID Var1 Var2 Var3 lists
1 X1 X2 X3 list("X1", "X2", X3")
2 X4 X5 X6 list("X4","X5", "X6")
3 X7 X8 X9 list("X7,"X8,"X9)
#9
0
The by_row
function from the purrrlyr
package will do this for you.
purrrlyr包中的by_row函数将为您执行此操作。
This example demonstrates
这个例子演示了
myfn <- function(row) {
#row is a tibble with one row, and the same number of columns as the original df
l <- as.list(row)
return(l)
}
list_of_lists <- purrrlyr::by_row(df, myfn, .labels=FALSE)$.out
By default, the returned value from myfn
is put into a new list column in the df called .out
. The $.out
at the end of the above statement immediately selects this column, returning a list of lists.
默认情况下,myfn返回的值被放入df中名为.out的新列表列中。美元的。在上述语句的末尾,立即选择该列,返回列表的列表。
#10
0
Like @flodel wrote: This converts your dataframe into a list that has the same number of elements as number of rows in dataframe:
如@flodel所写的:这将您的dataframe转换为一个列表,该列表的元素个数与dataframe中的行数相同:
NewList <- split(df, f = seq(nrow(df)))
You can additionaly add a function to select only those columns that are not NA in each element of the list:
您还可以添加一个函数来选择列表中每个元素中没有NA的列:
NewList2 <- lapply(NewList, function(x) x[,!is.na(x)])
#11
-1
A more modern solution uses only purrr::transpose
:
一个更现代的解决方案只使用purrr::转置:
library(purrr)
iris[1:2,] %>% purrr::transpose()
#> [[1]]
#> [[1]]$Sepal.Length
#> [1] 5.1
#>
#> [[1]]$Sepal.Width
#> [1] 3.5
#>
#> [[1]]$Petal.Length
#> [1] 1.4
#>
#> [[1]]$Petal.Width
#> [1] 0.2
#>
#> [[1]]$Species
#> [1] 1
#>
#>
#> [[2]]
#> [[2]]$Sepal.Length
#> [1] 4.9
#>
#> [[2]]$Sepal.Width
#> [1] 3
#>
#> [[2]]$Petal.Length
#> [1] 1.4
#>
#> [[2]]$Petal.Width
#> [1] 0.2
#>
#> [[2]]$Species
#> [1] 1