I'm having trouble rearranging the following data frame:
我无法重新排列以下数据框:
set.seed(45)dat1 <- data.frame( name = rep(c("firstName", "secondName"), each=4), numbers = rep(1:4, 2), value = rnorm(8) )dat1 name numbers value1 firstName 1 0.34079972 firstName 2 -0.70334033 firstName 3 -0.37953774 firstName 4 -0.74604745 secondName 1 -0.89810736 secondName 2 -0.33479417 secondName 3 -0.50137828 secondName 4 -0.1745357
I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:
我想重塑它,以便每个唯一的“名称”变量是一个rowname,其中“值”作为沿该行的观察值,“数字”作为同名。有点像:
name 1 2 3 41 firstName 0.3407997 -0.7033403 -0.3795377 -0.74604745 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
I've looked at melt
and cast
and a few other things, but none seem to do the job.
我看过融化和演员以及其他一些事情,但似乎都没有做到这一点。
9 个解决方案
#1
180
Using reshape
function:
使用重塑功能:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
#2
95
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
新的(2014年)tidyr软件包也可以简单地完成这项工作,其中gather()/ spread()是熔化/铸造的术语。
library(tidyr)spread(dat1, key = numbers, value = value)
From github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.tidyr是reshape2的重新设计,旨在配合整洁的数据框架,并与magrittr和dplyr携手合作,为数据分析打造坚实的管道。
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.就像reshape2重塑不到重塑一样,tidyr不会重塑2。它专门用于整理数据,而不是reshape2执行的一般重塑,或者重塑数据的一般重组。特别是,内置方法仅适用于数据帧,而tidyr不提供边距或聚合。
#3
62
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
您可以使用reshape()函数或reshape包中的melt()/ cast()函数执行此操作。对于第二个选项,示例代码是
library(reshape)cast(dat1, name ~ numbers)
Or using reshape2
或者使用reshape2
library(reshape2)dcast(dat1, name ~ numbers)
#4
26
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
如果性能受到关注,另一个选择是使用data.table扩展reshape2的融合和dcast功能
(Reference: Efficient reshaping using data.tables)
(参考:使用data.tables进行高效重塑)
library(data.table)setDT(dat1)dcast(dat1, name ~ numbers, value.var = "value")# name 1 2 3 4# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
而且,从data.table v1.9.6开始,我们可以在多列上进行转换
## add an extra columndat1[, value2 := value * 2]## cast multiple value columnsdcast(dat1, name ~ numbers, value.var = c("value", "value2"))# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
#5
22
Using your example dataframe, we could:
使用您的示例数据框,我们可以:
xtabs(value ~ name + numbers, data = dat1)
#6
14
Other two options:
其他两个选择:
Base package:
df <- unstack(dat1, form = value ~ numbers)rownames(df) <- unique(dat1$name)df
sqldf
package:
library(sqldf)sqldf('SELECT name, MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1, MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2, MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3, MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4 FROM dat1 GROUP BY name')
#7
7
Using base R aggregate
function:
使用基R聚合函数:
aggregate(value ~ name, dat1, I)# name value.1 value.2 value.3 value.4#1 firstName 0.4145 -0.4747 0.0659 -0.5024#2 secondName -0.8259 0.1669 -0.8962 0.1681
#8
4
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
来自Win-Vector(创造了vtreat,seplyr和replyr的人)的天才数据科学家提供了非常强大的新软件包,名为cdata。它实现了本文档和本博文中描述的“协调数据”原则。我们的想法是,无论您如何组织数据,都应该可以使用“数据坐标”系统识别各个数据点。以下摘录自John Mount最近的博客文章:
The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.
整个系统基于两个原语或运算符cdata :: moveValuesToRowsD()和cdata :: moveValuesToColumnsD()。这些运算符具有pivot,un-pivot,one-hot编码,转置,移动多行和多列以及许多其他转换作为简单的特殊情况。
It is easy to write many different operations in terms of the cdata primitives. These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants). The transforms are controlled by a control table that itself is a diagram of (or picture of) the transform.
根据cdata原语很容易编写许多不同的操作。这些运算符可以在内存或大数据规模下工作(使用数据库和Apache Spark;对于大数据,使用cdata :: moveValuesToRowsN()和cdata :: moveValuesToColumnsN()变体)。变换由控制表控制,控制表本身是变换的图(或图片)。
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
我们将首先构建控制表(有关详细信息,请参阅博客文章),然后执行从行到列的数据移动。
library(cdata)# first build the control tablepivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset columnToTakeKeysFrom = 'numbers', # this will become column headers columnToTakeValuesFrom = 'value', # this contains data sep="_") # optional for making column names# perform the move of data to columnsdat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset keyColumns = c('name'), # this(these) column(s) should stay untouched controlTable = pivotControlTable# control table above ) dat_wide#> name numbers_1 numbers_2 numbers_3 numbers_4#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
#9
2
The base reshape
function works perfectly fine:
基本重塑功能完美无缺:
df <- data.frame( year = c(rep(2000, 12), rep(2001, 12)), month = rep(1:12, 2), values = rnorm(24))df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")df_wide
Where
-
idvar
is the column of classes that separates rows -
timevar
is the column of classes to cast wide -
v.names
is the column containing numeric values -
direction
specifies wide or long format - the optional
sep
argument is the separator used in betweentimevar
class names andv.names
in the outputdata.frame
.
idvar是分隔行的类列
timevar是要广泛投射的类的列
v.names是包含数值的列
direction指定宽或长格式
可选的sep参数是timevar类名和输出data.frame中的v.names之间使用的分隔符。
If no idvar
exists, create one before using the reshape()
function:
如果不存在idvar,请在使用reshape()函数之前创建一个:
df$id <- c(rep("year1", 12), rep("year2", 12))df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")df_wide
Just remember that idvar
is required! The timevar
and v.names
part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.
请记住,idvar是必需的! timevar和v.names部分很简单。这个函数的输出比其他一些函数更容易预测,因为所有内容都是明确定义的。
#1
180
Using reshape
function:
使用重塑功能:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
#2
95
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
新的(2014年)tidyr软件包也可以简单地完成这项工作,其中gather()/ spread()是熔化/铸造的术语。
library(tidyr)spread(dat1, key = numbers, value = value)
From github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.tidyr是reshape2的重新设计,旨在配合整洁的数据框架,并与magrittr和dplyr携手合作,为数据分析打造坚实的管道。
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.就像reshape2重塑不到重塑一样,tidyr不会重塑2。它专门用于整理数据,而不是reshape2执行的一般重塑,或者重塑数据的一般重组。特别是,内置方法仅适用于数据帧,而tidyr不提供边距或聚合。
#3
62
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
您可以使用reshape()函数或reshape包中的melt()/ cast()函数执行此操作。对于第二个选项,示例代码是
library(reshape)cast(dat1, name ~ numbers)
Or using reshape2
或者使用reshape2
library(reshape2)dcast(dat1, name ~ numbers)
#4
26
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
如果性能受到关注,另一个选择是使用data.table扩展reshape2的融合和dcast功能
(Reference: Efficient reshaping using data.tables)
(参考:使用data.tables进行高效重塑)
library(data.table)setDT(dat1)dcast(dat1, name ~ numbers, value.var = "value")# name 1 2 3 4# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
而且,从data.table v1.9.6开始,我们可以在多列上进行转换
## add an extra columndat1[, value2 := value * 2]## cast multiple value columnsdcast(dat1, name ~ numbers, value.var = c("value", "value2"))# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
#5
22
Using your example dataframe, we could:
使用您的示例数据框,我们可以:
xtabs(value ~ name + numbers, data = dat1)
#6
14
Other two options:
其他两个选择:
Base package:
df <- unstack(dat1, form = value ~ numbers)rownames(df) <- unique(dat1$name)df
sqldf
package:
library(sqldf)sqldf('SELECT name, MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1, MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2, MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3, MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4 FROM dat1 GROUP BY name')
#7
7
Using base R aggregate
function:
使用基R聚合函数:
aggregate(value ~ name, dat1, I)# name value.1 value.2 value.3 value.4#1 firstName 0.4145 -0.4747 0.0659 -0.5024#2 secondName -0.8259 0.1669 -0.8962 0.1681
#8
4
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
来自Win-Vector(创造了vtreat,seplyr和replyr的人)的天才数据科学家提供了非常强大的新软件包,名为cdata。它实现了本文档和本博文中描述的“协调数据”原则。我们的想法是,无论您如何组织数据,都应该可以使用“数据坐标”系统识别各个数据点。以下摘录自John Mount最近的博客文章:
The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.
整个系统基于两个原语或运算符cdata :: moveValuesToRowsD()和cdata :: moveValuesToColumnsD()。这些运算符具有pivot,un-pivot,one-hot编码,转置,移动多行和多列以及许多其他转换作为简单的特殊情况。
It is easy to write many different operations in terms of the cdata primitives. These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants). The transforms are controlled by a control table that itself is a diagram of (or picture of) the transform.
根据cdata原语很容易编写许多不同的操作。这些运算符可以在内存或大数据规模下工作(使用数据库和Apache Spark;对于大数据,使用cdata :: moveValuesToRowsN()和cdata :: moveValuesToColumnsN()变体)。变换由控制表控制,控制表本身是变换的图(或图片)。
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
我们将首先构建控制表(有关详细信息,请参阅博客文章),然后执行从行到列的数据移动。
library(cdata)# first build the control tablepivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset columnToTakeKeysFrom = 'numbers', # this will become column headers columnToTakeValuesFrom = 'value', # this contains data sep="_") # optional for making column names# perform the move of data to columnsdat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset keyColumns = c('name'), # this(these) column(s) should stay untouched controlTable = pivotControlTable# control table above ) dat_wide#> name numbers_1 numbers_2 numbers_3 numbers_4#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
#9
2
The base reshape
function works perfectly fine:
基本重塑功能完美无缺:
df <- data.frame( year = c(rep(2000, 12), rep(2001, 12)), month = rep(1:12, 2), values = rnorm(24))df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")df_wide
Where
-
idvar
is the column of classes that separates rows -
timevar
is the column of classes to cast wide -
v.names
is the column containing numeric values -
direction
specifies wide or long format - the optional
sep
argument is the separator used in betweentimevar
class names andv.names
in the outputdata.frame
.
idvar是分隔行的类列
timevar是要广泛投射的类的列
v.names是包含数值的列
direction指定宽或长格式
可选的sep参数是timevar类名和输出data.frame中的v.names之间使用的分隔符。
If no idvar
exists, create one before using the reshape()
function:
如果不存在idvar,请在使用reshape()函数之前创建一个:
df$id <- c(rep("year1", 12), rep("year2", 12))df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")df_wide
Just remember that idvar
is required! The timevar
and v.names
part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.
请记住,idvar是必需的! timevar和v.names部分很简单。这个函数的输出比其他一些函数更容易预测,因为所有内容都是明确定义的。