I'm having trouble rearranging the following data frame:
我在重新排列以下数据框时遇到了麻烦:
set.seed(45)
dat1 <- data.frame(
name = rep(c("firstName", "secondName"), each=4),
numbers = rep(1:4, 2),
value = rnorm(8)
)
dat1
name numbers value
1 firstName 1 0.3407997
2 firstName 2 -0.7033403
3 firstName 3 -0.3795377
4 firstName 4 -0.7460474
5 secondName 1 -0.8981073
6 secondName 2 -0.3347941
7 secondName 3 -0.5013782
8 secondName 4 -0.1745357
I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:
我想要对它进行重新构造,使每个惟一的“name”变量都是一个行名,用“values”作为对该行的观察,并将“编号”作为colnames。就像这样:
name 1 2 3 4
1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
5 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
I've looked at melt
and cast
and a few other things, but none seem to do the job.
我看过《熔融与铸造》和其他一些作品,但似乎没有人做得到。
8 个解决方案
#1
169
Using reshape
function:
使用重塑功能:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
#2
85
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
新的(2014年)tidyr包也做得很简单,其中gather()/spread()是熔融/铸造的术语。
library(tidyr)
spread(dat1, key = numbers, value = value)
From github,
从github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.tidyr是对reshape2的重新设计,以配合tidy数据框架,并与magrittr和dplyr携手构建一个用于数据分析的可靠管道。
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.正如reshape2做得比重塑小,tidyr做得比重塑小。它是专为整理数据而设计的,而不是reshape2所做的一般重构,也不是重塑的一般聚合。特别是,内置的方法只适用于数据帧,而tidyr不提供边距或聚合。
#3
61
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
您可以使用rebuild()函数,或者使用plastic () / cast()函数来完成这个任务。对于第二个选项,示例代码是
library(reshape)
cast(dat1, name ~ numbers)
Or using reshape2
或者使用reshape2
library(reshape2)
dcast(dat1, name ~ numbers)
#4
22
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
如果考虑性能,另一个选择是使用数据。表中对reshape2的熔融和dcast函数的扩展
(Reference: Efficient reshaping using data.tables)
(参考:利用数据进行高效的整形)
library(data.table)
setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")
# name 1 2 3 4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
而且,正如数据。表v1.9.6我们可以对多个列进行强制转换
## add an extra column
dat1[, value2 := value * 2]
## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))
# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
#5
20
Using your example dataframe, we could:
使用您的dataframe示例,我们可以:
xtabs(value ~ name + numbers, data = dat1)
#6
13
Other two options:
其他两个选项:
Base package:
基本包:
df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df
sqldf
package:
sqldf包:
library(sqldf)
sqldf('SELECT name,
MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,
MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
FROM dat1
GROUP BY name')
#7
6
Using base R aggregate
function:
使用基R聚合函数:
aggregate(value ~ name, dat1, I)
# name value.1 value.2 value.3 value.4
#1 firstName 0.4145 -0.4747 0.0659 -0.5024
#2 secondName -0.8259 0.1669 -0.8962 0.1681
#8
2
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
Win-Vector的天才数据科学家(制作vtreat、seplyr和replyr的人)有一个非常强大的新包,叫做cdata。它实现了本文档和本博文中描述的“协调数据”原则。其思想是,无论如何组织数据,都应该使用“数据坐标”系统来识别各个数据点。以下是约翰·蒙特最近发表的一篇博客文章的摘录:
The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.
整个系统基于两个基元或操作符:::moveValuesToRowsD()和cdata::moveValuesToColumnsD()。这些操作符有主元、非主元、单热编码、转置、移动多行和多列,以及许多其他作为简单特殊情况的转换。
It is easy to write many different operations in terms of the cdata primitives. These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants). The transforms are controlled by a control table that itself is a diagram of (or picture of) the transform.
用cdata原语编写许多不同的操作是很容易的。这些操作符可以在内存中工作,也可以在大数据范围内工作(使用数据库和Apache Spark;对于大数据,使用cdata::moveValuesToRowsN()和cdata::moveValuesToColumnsN()变量。转换由控制表控制,它本身是转换的(或图片)的图。
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
我们将首先构建控制表(详细信息请参阅blog post),然后执行数据从行到列的移动。
library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
columnToTakeKeysFrom = 'numbers', # this will become column headers
columnToTakeValuesFrom = 'value', # this contains data
sep="_") # optional for making column names
# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset
keyColumns = c('name'), # this(these) column(s) should stay untouched
controlTable = pivotControlTable# control table above
)
dat_wide
#> name numbers_1 numbers_2 numbers_3 numbers_4
#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
#1
169
Using reshape
function:
使用重塑功能:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
#2
85
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
新的(2014年)tidyr包也做得很简单,其中gather()/spread()是熔融/铸造的术语。
library(tidyr)
spread(dat1, key = numbers, value = value)
From github,
从github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.tidyr是对reshape2的重新设计,以配合tidy数据框架,并与magrittr和dplyr携手构建一个用于数据分析的可靠管道。
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.正如reshape2做得比重塑小,tidyr做得比重塑小。它是专为整理数据而设计的,而不是reshape2所做的一般重构,也不是重塑的一般聚合。特别是,内置的方法只适用于数据帧,而tidyr不提供边距或聚合。
#3
61
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
您可以使用rebuild()函数,或者使用plastic () / cast()函数来完成这个任务。对于第二个选项,示例代码是
library(reshape)
cast(dat1, name ~ numbers)
Or using reshape2
或者使用reshape2
library(reshape2)
dcast(dat1, name ~ numbers)
#4
22
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
如果考虑性能,另一个选择是使用数据。表中对reshape2的熔融和dcast函数的扩展
(Reference: Efficient reshaping using data.tables)
(参考:利用数据进行高效的整形)
library(data.table)
setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")
# name 1 2 3 4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
而且,正如数据。表v1.9.6我们可以对多个列进行强制转换
## add an extra column
dat1[, value2 := value * 2]
## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))
# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
#5
20
Using your example dataframe, we could:
使用您的dataframe示例,我们可以:
xtabs(value ~ name + numbers, data = dat1)
#6
13
Other two options:
其他两个选项:
Base package:
基本包:
df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df
sqldf
package:
sqldf包:
library(sqldf)
sqldf('SELECT name,
MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,
MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
FROM dat1
GROUP BY name')
#7
6
Using base R aggregate
function:
使用基R聚合函数:
aggregate(value ~ name, dat1, I)
# name value.1 value.2 value.3 value.4
#1 firstName 0.4145 -0.4747 0.0659 -0.5024
#2 secondName -0.8259 0.1669 -0.8962 0.1681
#8
2
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
Win-Vector的天才数据科学家(制作vtreat、seplyr和replyr的人)有一个非常强大的新包,叫做cdata。它实现了本文档和本博文中描述的“协调数据”原则。其思想是,无论如何组织数据,都应该使用“数据坐标”系统来识别各个数据点。以下是约翰·蒙特最近发表的一篇博客文章的摘录:
The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.
整个系统基于两个基元或操作符:::moveValuesToRowsD()和cdata::moveValuesToColumnsD()。这些操作符有主元、非主元、单热编码、转置、移动多行和多列,以及许多其他作为简单特殊情况的转换。
It is easy to write many different operations in terms of the cdata primitives. These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants). The transforms are controlled by a control table that itself is a diagram of (or picture of) the transform.
用cdata原语编写许多不同的操作是很容易的。这些操作符可以在内存中工作,也可以在大数据范围内工作(使用数据库和Apache Spark;对于大数据,使用cdata::moveValuesToRowsN()和cdata::moveValuesToColumnsN()变量。转换由控制表控制,它本身是转换的(或图片)的图。
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
我们将首先构建控制表(详细信息请参阅blog post),然后执行数据从行到列的移动。
library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
columnToTakeKeysFrom = 'numbers', # this will become column headers
columnToTakeValuesFrom = 'value', # this contains data
sep="_") # optional for making column names
# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset
keyColumns = c('name'), # this(these) column(s) should stay untouched
controlTable = pivotControlTable# control table above
)
dat_wide
#> name numbers_1 numbers_2 numbers_3 numbers_4
#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357