如何将数据从长格式转换为宽格式?

时间:2022-09-16 11:05:02

I'm having trouble rearranging the following data frame:

我在重新排列以下数据框时遇到了麻烦:

set.seed(45)
dat1 <- data.frame(
    name = rep(c("firstName", "secondName"), each=4),
    numbers = rep(1:4, 2),
    value = rnorm(8)
    )

dat1
       name  numbers      value
1  firstName       1  0.3407997
2  firstName       2 -0.7033403
3  firstName       3 -0.3795377
4  firstName       4 -0.7460474
5 secondName       1 -0.8981073
6 secondName       2 -0.3347941
7 secondName       3 -0.5013782
8 secondName       4 -0.1745357

I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:

我想要对它进行重新构造,使每个惟一的“name”变量都是一个行名,用“values”作为对该行的观察,并将“编号”作为colnames。就像这样:

     name          1          2          3         4
1  firstName  0.3407997 -0.7033403 -0.3795377 -0.7460474
5 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357

I've looked at melt and cast and a few other things, but none seem to do the job.

我看过《熔融与铸造》和其他一些作品,但似乎没有人做得到。

8 个解决方案

#1


169  

Using reshape function:

使用重塑功能:

reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")

#2


85  

The new (in 2014) tidyr package also does this simply, with gather()/spread() being the terms for melt/cast.

新的(2014年)tidyr包也做得很简单,其中gather()/spread()是熔融/铸造的术语。

library(tidyr)
spread(dat1, key = numbers, value = value)

From github,

从github,

tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.

tidyr是对reshape2的重新设计,以配合tidy数据框架,并与magrittr和dplyr携手构建一个用于数据分析的可靠管道。

Just as reshape2 did less than reshape, tidyr does less than reshape2. It's designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.

正如reshape2做得比重塑小,tidyr做得比重塑小。它是专为整理数据而设计的,而不是reshape2所做的一般重构,也不是重塑的一般聚合。特别是,内置的方法只适用于数据帧,而tidyr不提供边距或聚合。

#3


61  

You can do this with the reshape() function, or with the melt() / cast() functions in the reshape package. For the second option, example code is

您可以使用rebuild()函数,或者使用plastic () / cast()函数来完成这个任务。对于第二个选项,示例代码是

library(reshape)
cast(dat1, name ~ numbers)

Or using reshape2

或者使用reshape2

library(reshape2)
dcast(dat1, name ~ numbers)

#4


22  

Another option if performance is a concern is to use data.table's extension of reshape2's melt & dcast functions

如果考虑性能,另一个选择是使用数据。表中对reshape2的熔融和dcast函数的扩展

(Reference: Efficient reshaping using data.tables)

(参考:利用数据进行高效的整形)

library(data.table)

setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")

#          name          1          2         3         4
# 1:  firstName  0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684  0.4874291 0.7383247 0.5757814

And, as of data.table v1.9.6 we can cast on multiple columns

而且,正如数据。表v1.9.6我们可以对多个列进行强制转换

## add an extra column
dat1[, value2 := value * 2]

## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))

#          name    value_1    value_2   value_3   value_4   value2_1   value2_2 value2_3  value2_4
# 1:  firstName  0.1836433 -0.8356286 1.5952808 0.3295078  0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684  0.4874291 0.7383247 0.5757814 -1.6409368  0.9748581 1.476649 1.1515627

#5


20  

Using your example dataframe, we could:

使用您的dataframe示例,我们可以:

xtabs(value ~ name + numbers, data = dat1)

#6


13  

Other two options:

其他两个选项:

Base package:

基本包:

df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df

sqldf package:

sqldf包:

library(sqldf)
sqldf('SELECT name,
      MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1, 
      MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
      MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
      MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
      FROM dat1
      GROUP BY name')

#7


6  

Using base R aggregate function:

使用基R聚合函数:

aggregate(value ~ name, dat1, I)

# name           value.1  value.2  value.3  value.4
#1 firstName      0.4145  -0.4747   0.0659   -0.5024
#2 secondName    -0.8259   0.1669  -0.8962    0.1681

#8


2  

There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat, seplyr and replyr) called cdata. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:

Win-Vector的天才数据科学家(制作vtreat、seplyr和replyr的人)有一个非常强大的新包,叫做cdata。它实现了本文档和本博文中描述的“协调数据”原则。其思想是,无论如何组织数据,都应该使用“数据坐标”系统来识别各个数据点。以下是约翰·蒙特最近发表的一篇博客文章的摘录:

The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.

整个系统基于两个基元或操作符:::moveValuesToRowsD()和cdata::moveValuesToColumnsD()。这些操作符有主元、非主元、单热编码、转置、移动多行和多列,以及许多其他作为简单特殊情况的转换。

It is easy to write many different operations in terms of the cdata primitives. These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants). The transforms are controlled by a control table that itself is a diagram of (or picture of) the transform.

用cdata原语编写许多不同的操作是很容易的。这些操作符可以在内存中工作,也可以在大数据范围内工作(使用数据库和Apache Spark;对于大数据,使用cdata::moveValuesToRowsN()和cdata::moveValuesToColumnsN()变量。转换由控制表控制,它本身是转换的(或图片)的图。

We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.

我们将首先构建控制表(详细信息请参阅blog post),然后执行数据从行到列的移动。

library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
                        columnToTakeKeysFrom = 'numbers', # this will become column headers
                        columnToTakeValuesFrom = 'value', # this contains data
                        sep="_")                          # optional for making column names

# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable =  dat1, # reference to dataset
                    keyColumns = c('name'),         # this(these) column(s) should stay untouched 
                    controlTable = pivotControlTable# control table above
                    ) 
dat_wide

#>         name  numbers_1  numbers_2  numbers_3  numbers_4
#> 1  firstName  0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357

#1


169  

Using reshape function:

使用重塑功能:

reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")

#2


85  

The new (in 2014) tidyr package also does this simply, with gather()/spread() being the terms for melt/cast.

新的(2014年)tidyr包也做得很简单,其中gather()/spread()是熔融/铸造的术语。

library(tidyr)
spread(dat1, key = numbers, value = value)

From github,

从github,

tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.

tidyr是对reshape2的重新设计,以配合tidy数据框架,并与magrittr和dplyr携手构建一个用于数据分析的可靠管道。

Just as reshape2 did less than reshape, tidyr does less than reshape2. It's designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.

正如reshape2做得比重塑小,tidyr做得比重塑小。它是专为整理数据而设计的,而不是reshape2所做的一般重构,也不是重塑的一般聚合。特别是,内置的方法只适用于数据帧,而tidyr不提供边距或聚合。

#3


61  

You can do this with the reshape() function, or with the melt() / cast() functions in the reshape package. For the second option, example code is

您可以使用rebuild()函数,或者使用plastic () / cast()函数来完成这个任务。对于第二个选项,示例代码是

library(reshape)
cast(dat1, name ~ numbers)

Or using reshape2

或者使用reshape2

library(reshape2)
dcast(dat1, name ~ numbers)

#4


22  

Another option if performance is a concern is to use data.table's extension of reshape2's melt & dcast functions

如果考虑性能,另一个选择是使用数据。表中对reshape2的熔融和dcast函数的扩展

(Reference: Efficient reshaping using data.tables)

(参考:利用数据进行高效的整形)

library(data.table)

setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")

#          name          1          2         3         4
# 1:  firstName  0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684  0.4874291 0.7383247 0.5757814

And, as of data.table v1.9.6 we can cast on multiple columns

而且,正如数据。表v1.9.6我们可以对多个列进行强制转换

## add an extra column
dat1[, value2 := value * 2]

## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))

#          name    value_1    value_2   value_3   value_4   value2_1   value2_2 value2_3  value2_4
# 1:  firstName  0.1836433 -0.8356286 1.5952808 0.3295078  0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684  0.4874291 0.7383247 0.5757814 -1.6409368  0.9748581 1.476649 1.1515627

#5


20  

Using your example dataframe, we could:

使用您的dataframe示例,我们可以:

xtabs(value ~ name + numbers, data = dat1)

#6


13  

Other two options:

其他两个选项:

Base package:

基本包:

df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df

sqldf package:

sqldf包:

library(sqldf)
sqldf('SELECT name,
      MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1, 
      MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
      MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
      MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
      FROM dat1
      GROUP BY name')

#7


6  

Using base R aggregate function:

使用基R聚合函数:

aggregate(value ~ name, dat1, I)

# name           value.1  value.2  value.3  value.4
#1 firstName      0.4145  -0.4747   0.0659   -0.5024
#2 secondName    -0.8259   0.1669  -0.8962    0.1681

#8


2  

There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat, seplyr and replyr) called cdata. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:

Win-Vector的天才数据科学家(制作vtreat、seplyr和replyr的人)有一个非常强大的新包,叫做cdata。它实现了本文档和本博文中描述的“协调数据”原则。其思想是,无论如何组织数据,都应该使用“数据坐标”系统来识别各个数据点。以下是约翰·蒙特最近发表的一篇博客文章的摘录:

The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.

整个系统基于两个基元或操作符:::moveValuesToRowsD()和cdata::moveValuesToColumnsD()。这些操作符有主元、非主元、单热编码、转置、移动多行和多列,以及许多其他作为简单特殊情况的转换。

It is easy to write many different operations in terms of the cdata primitives. These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants). The transforms are controlled by a control table that itself is a diagram of (or picture of) the transform.

用cdata原语编写许多不同的操作是很容易的。这些操作符可以在内存中工作,也可以在大数据范围内工作(使用数据库和Apache Spark;对于大数据,使用cdata::moveValuesToRowsN()和cdata::moveValuesToColumnsN()变量。转换由控制表控制,它本身是转换的(或图片)的图。

We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.

我们将首先构建控制表(详细信息请参阅blog post),然后执行数据从行到列的移动。

library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
                        columnToTakeKeysFrom = 'numbers', # this will become column headers
                        columnToTakeValuesFrom = 'value', # this contains data
                        sep="_")                          # optional for making column names

# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable =  dat1, # reference to dataset
                    keyColumns = c('name'),         # this(these) column(s) should stay untouched 
                    controlTable = pivotControlTable# control table above
                    ) 
dat_wide

#>         name  numbers_1  numbers_2  numbers_3  numbers_4
#> 1  firstName  0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357