I want to reshape a wide format dataset that has multiple tests which are measured at 3 time points:
我想重塑一个具有多个测试的宽格式数据集,这些测试在3个时间点测量:
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
into a data set that separates the tests by column but converts the measurement time into long format, for each of the new columns like this:
将测试按列分离的数据集,但将度量时间转换为长格式,对于每一列这样的新列:
ID Year Time Test1 Test2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
3 2008 Fall 11 17
3 2008 Spring 12 12
3 2008 Winter 22 23
3 2009 Fall 13 14
3 2009 Spring 11 9
3 2009 Winter 27 31
I have unsuccessfully tried to use reshape and melt. Existing posts address transforming to single column outcome.
我尝试过重新塑造和融化,但没有成功。现有员额的地址改为单一栏的结果。
4 个解决方案
#1
17
Using reshape2
:
使用reshape2:
# Thanks to Ista for helping with direct naming using "variable.name"
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
df.m <- transform(df.m, Test = paste0("Test", Test))
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
Update: Using data.table melt/cast from versions >= 1.9.0:
data.table
from versions 1.9.0 imports reshape2
package and implements fast melt
and dcast
methods in C for data.tables. A comparison of speed on bigger data is shown below.
数据。来自版本1.9.0的表导入reshape2包,并在C中实现数据表的快速熔融和dcast方法。比较大数据的速度如下所示。
For more info regarding NEWS, go here.
有关新闻的更多信息,请访问这里。
require(data.table) ## ver. >=1.9.0
require(reshape2)
dt <- as.data.table(df, key=c("ID", "Test", "Year"))
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
dt.m[, Test := paste0("Test", Test)]
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
At the moment, you'll have to write dcast.data.table
explicitly as it's not a S3 generic in reshape2
yet.
目前,您必须编写dcast.data。表是显式的,因为它还不是reshape2中的S3泛型。
Benchmarking on bigger data:
# generate data:
set.seed(45L)
DT <- data.table(ID = sample(1e2, 1e7, TRUE),
Test = sample(1e3, 1e7, TRUE),
Year = sample(2008:2014, 1e7,TRUE),
Fall = sample(50, 1e7, TRUE),
Spring = sample(50, 1e7,TRUE),
Winter = sample(50, 1e7, TRUE))
DF <- as.data.frame(DT)
reshape2 timings:
reshape2_melt <- function(df) {
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(df.m <- reshape2_melt(DF))
# user system elapsed
# 43.319 4.909 48.932
df.m <- transform(df.m, Test = paste0("Test", Test))
reshape2_cast <- function(df) {
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(reshape2_cast(df.m))
# user system elapsed
# 57.728 9.712 69.573
data.table timings:
DT_melt <- function(dt) {
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(dt.m <- reshape2_melt(DT))
# user system elapsed
# 0.276 0.001 0.279
dt.m[, Test := paste0("Test", Test)]
DT_cast <- function(dt) {
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(DT_cast(dt.m))
# user system elapsed
# 12.732 0.825 14.006
melt.data.table
is ~175x faster than reshape2:::melt
and dcast.data.table
is ~5x than reshape2:::dcast
.
melt.data。表比reshape2快175x:熔体和dcast.data。表是~5x比reshape2:: dcast。
#2
2
Base reshape
function alternative method is below. Though this required using reshape
twice, there might be a simpler way.
基本重塑函数替代方法如下。尽管这需要使用两次整形,但可能有一种更简单的方法。
Assuming your dataset is called df1
假设您的数据集被称为df1
tmp <- reshape(df1,idvar=c("ID","Year"),timevar="Test",direction="wide")
result <- reshape(
tmp,
idvar=c("ID","Year"),
varying=list(3:5,6:8),
v.names=c("Test1","Test2"),
times=c("Fall","Spring","Winter"),
direction="long"
)
Which gives:
这使:
> result
ID Year time Test1 Test2
1.2008.Fall 1 2008 Fall 15 22
1.2009.Fall 1 2009 Fall 12 10
2.2008.Fall 2 2008 Fall 12 13
2.2009.Fall 2 2009 Fall 16 23
3.2008.Fall 3 2008 Fall 11 17
3.2009.Fall 3 2009 Fall 13 14
1.2008.Spring 1 2008 Spring 16 22
1.2009.Spring 1 2009 Spring 13 14
2.2008.Spring 2 2008 Spring 13 11
2.2009.Spring 2 2009 Spring 14 20
3.2008.Spring 3 2008 Spring 12 12
3.2009.Spring 3 2009 Spring 11 9
1.2008.Winter 1 2008 Winter 19 24
1.2009.Winter 1 2009 Winter 27 20
2.2008.Winter 2 2008 Winter 25 29
2.2009.Winter 2 2009 Winter 21 26
3.2008.Winter 3 2008 Winter 22 23
3.2009.Winter 3 2009 Winter 27 31
#3
2
Sticking with base R, this is another good candidate for the "stack
+ reshape
" routine. Assuming our dataset is called "mydf":
坚持使用基本的R,这是“堆栈+重塑”例程的另一个很好的候选。假设我们的数据集被称为“mydf”:
mydf.temp <- data.frame(mydf[1:3], stack(mydf[4:6]))
mydf2 <- reshape(mydf.temp, direction = "wide",
idvar=c("ID", "Year", "ind"),
timevar="Test")
names(mydf2) <- c("ID", "Year", "Time", "Test1", "Test2")
mydf2
# ID Year Time Test1 Test2
# 1 1 2008 Fall 15 22
# 2 1 2009 Fall 12 10
# 5 2 2008 Fall 12 13
# 6 2 2009 Fall 16 23
# 9 3 2008 Fall 11 17
# 10 3 2009 Fall 13 14
# 13 1 2008 Spring 16 22
# 14 1 2009 Spring 13 14
# 17 2 2008 Spring 13 11
# 18 2 2009 Spring 14 20
# 21 3 2008 Spring 12 12
# 22 3 2009 Spring 11 9
# 25 1 2008 Winter 19 24
# 26 1 2009 Winter 27 20
# 29 2 2008 Winter 25 29
# 30 2 2009 Winter 21 26
# 33 3 2008 Winter 22 23
# 34 3 2009 Winter 27 31
#4
1
tidyverse
/tidyr
solution:
tidyverse / tidyr解决方案:
library(dplyr)
library(tidyr)
df %>%
gather("Time", "Value", Fall, Spring, Winter) %>%
spread(Test, Value, sep = "")
#1
17
Using reshape2
:
使用reshape2:
# Thanks to Ista for helping with direct naming using "variable.name"
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
df.m <- transform(df.m, Test = paste0("Test", Test))
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
Update: Using data.table melt/cast from versions >= 1.9.0:
data.table
from versions 1.9.0 imports reshape2
package and implements fast melt
and dcast
methods in C for data.tables. A comparison of speed on bigger data is shown below.
数据。来自版本1.9.0的表导入reshape2包,并在C中实现数据表的快速熔融和dcast方法。比较大数据的速度如下所示。
For more info regarding NEWS, go here.
有关新闻的更多信息,请访问这里。
require(data.table) ## ver. >=1.9.0
require(reshape2)
dt <- as.data.table(df, key=c("ID", "Test", "Year"))
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
dt.m[, Test := paste0("Test", Test)]
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
At the moment, you'll have to write dcast.data.table
explicitly as it's not a S3 generic in reshape2
yet.
目前,您必须编写dcast.data。表是显式的,因为它还不是reshape2中的S3泛型。
Benchmarking on bigger data:
# generate data:
set.seed(45L)
DT <- data.table(ID = sample(1e2, 1e7, TRUE),
Test = sample(1e3, 1e7, TRUE),
Year = sample(2008:2014, 1e7,TRUE),
Fall = sample(50, 1e7, TRUE),
Spring = sample(50, 1e7,TRUE),
Winter = sample(50, 1e7, TRUE))
DF <- as.data.frame(DT)
reshape2 timings:
reshape2_melt <- function(df) {
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(df.m <- reshape2_melt(DF))
# user system elapsed
# 43.319 4.909 48.932
df.m <- transform(df.m, Test = paste0("Test", Test))
reshape2_cast <- function(df) {
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(reshape2_cast(df.m))
# user system elapsed
# 57.728 9.712 69.573
data.table timings:
DT_melt <- function(dt) {
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(dt.m <- reshape2_melt(DT))
# user system elapsed
# 0.276 0.001 0.279
dt.m[, Test := paste0("Test", Test)]
DT_cast <- function(dt) {
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(DT_cast(dt.m))
# user system elapsed
# 12.732 0.825 14.006
melt.data.table
is ~175x faster than reshape2:::melt
and dcast.data.table
is ~5x than reshape2:::dcast
.
melt.data。表比reshape2快175x:熔体和dcast.data。表是~5x比reshape2:: dcast。
#2
2
Base reshape
function alternative method is below. Though this required using reshape
twice, there might be a simpler way.
基本重塑函数替代方法如下。尽管这需要使用两次整形,但可能有一种更简单的方法。
Assuming your dataset is called df1
假设您的数据集被称为df1
tmp <- reshape(df1,idvar=c("ID","Year"),timevar="Test",direction="wide")
result <- reshape(
tmp,
idvar=c("ID","Year"),
varying=list(3:5,6:8),
v.names=c("Test1","Test2"),
times=c("Fall","Spring","Winter"),
direction="long"
)
Which gives:
这使:
> result
ID Year time Test1 Test2
1.2008.Fall 1 2008 Fall 15 22
1.2009.Fall 1 2009 Fall 12 10
2.2008.Fall 2 2008 Fall 12 13
2.2009.Fall 2 2009 Fall 16 23
3.2008.Fall 3 2008 Fall 11 17
3.2009.Fall 3 2009 Fall 13 14
1.2008.Spring 1 2008 Spring 16 22
1.2009.Spring 1 2009 Spring 13 14
2.2008.Spring 2 2008 Spring 13 11
2.2009.Spring 2 2009 Spring 14 20
3.2008.Spring 3 2008 Spring 12 12
3.2009.Spring 3 2009 Spring 11 9
1.2008.Winter 1 2008 Winter 19 24
1.2009.Winter 1 2009 Winter 27 20
2.2008.Winter 2 2008 Winter 25 29
2.2009.Winter 2 2009 Winter 21 26
3.2008.Winter 3 2008 Winter 22 23
3.2009.Winter 3 2009 Winter 27 31
#3
2
Sticking with base R, this is another good candidate for the "stack
+ reshape
" routine. Assuming our dataset is called "mydf":
坚持使用基本的R,这是“堆栈+重塑”例程的另一个很好的候选。假设我们的数据集被称为“mydf”:
mydf.temp <- data.frame(mydf[1:3], stack(mydf[4:6]))
mydf2 <- reshape(mydf.temp, direction = "wide",
idvar=c("ID", "Year", "ind"),
timevar="Test")
names(mydf2) <- c("ID", "Year", "Time", "Test1", "Test2")
mydf2
# ID Year Time Test1 Test2
# 1 1 2008 Fall 15 22
# 2 1 2009 Fall 12 10
# 5 2 2008 Fall 12 13
# 6 2 2009 Fall 16 23
# 9 3 2008 Fall 11 17
# 10 3 2009 Fall 13 14
# 13 1 2008 Spring 16 22
# 14 1 2009 Spring 13 14
# 17 2 2008 Spring 13 11
# 18 2 2009 Spring 14 20
# 21 3 2008 Spring 12 12
# 22 3 2009 Spring 11 9
# 25 1 2008 Winter 19 24
# 26 1 2009 Winter 27 20
# 29 2 2008 Winter 25 29
# 30 2 2009 Winter 21 26
# 33 3 2008 Winter 22 23
# 34 3 2009 Winter 27 31
#4
1
tidyverse
/tidyr
solution:
tidyverse / tidyr解决方案:
library(dplyr)
library(tidyr)
df %>%
gather("Time", "Value", Fall, Spring, Winter) %>%
spread(Test, Value, sep = "")