This question already has an answer here:
这个问题在这里已有答案:
- Reshaping multiple sets of measurement columns (wide format) into single columns (long format) 6 answers
- 将多组测量列(宽格式)整形为单列(长格式)6个答案
I have a data frame like this:
我有这样的数据框:
structure(list(one = structure(1:4, .Label = c("a", "b", "c",
"d"), class = "factor"), two = c(2, 4, 7, 3), x.1 = c("x1a",
"x1b", "x1c", "x1d"), x.2 = c("x2a", "x2b", "x2c", "x2d"), x.3 = c("x3a",
"x3b", "x3c", "x3d"), y.1 = c(NA, "y1b", "y1c", NA), y.2 = c(NA,
"y2b", "y2c", NA), y.3 = c(NA, "y3b", "y3c", NA)), .Names = c("one",
"two", "x.1", "x.2", "x.3", "y.1", "y.2", "y.3"), row.names = c(NA,
-4L), class = "data.frame")
As you can see, the observations per event a, b, c, and d (variable "one") are stored as columns, where x and y define separate observations and 1, 2 and 3 define the variables. Variable "two" does not have a meaning here.
如您所见,每个事件a,b,c和d(变量“one”)的观察结果存储为列,其中x和y定义单独的观察值,1,2和3定义变量。变量“two”在这里没有意义。
I like to reshape this data frame to have it tidy in the form that each observation has it's own row and each variable it's own column.
我喜欢重塑这个数据框,让它整洁,每个观察都有自己的行,每个变量都有自己的列。
The final data frame should look like this:
最终数据框应如下所示:
structure(list(one = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("a",
"b", "c", "d"), class = "factor"), two = c(2, 4, 2, 7, 5, 3),
var1 = c("x1a", "x1b", "y1b", "x1c", "y1c", "x1d"), var2 = c("x2a",
"x2b", "y2b", "x2c", "y2c", "x2d"), var3 = c("x3a", "x3b",
"y3b", "x3c", "y3c", "x3d")), .Names = c("one", "two", "var1",
"var2", "var3"), row.names = c(1L, 2L, 5L, 3L, 6L, 4L), class = "data.frame")
I am slightly familiar with what the cast and melt function from the reshape packages do, but was not able yet to figure out a way to reshape the DF in a smart way. For now the following provides the sate that I have gotten to:
我稍微熟悉了重塑包的铸造和熔化功能,但还没有找到一种以智能方式重塑DF的方法。现在,以下内容提供了我所得到的状态:
df.between <- melt(df.in, id.vars=c("one", "two"))
df.between$variable <- gsub("x.|y.", "", df.between$variable)
Now the "variable" column does correctly identify the variable (1, 2 or 3). However, I was not able to cast this into the required form and this solution does not seem to be useful for larger sets of data due to the use of grepl
.
现在,“变量”列可以正确识别变量(1,2或3)。但是,我无法将其转换为所需的格式,并且由于使用了grepl,此解决方案似乎对较大的数据集似乎没有用处。
Happy to get a nudge into the right direction here.
很高兴在这里轻轻推进正确的方向。
3 个解决方案
#1
5
We can use melt
from the devel version of data.table
i.e. v1.9.5
, which can handle multiple patterns
for the measure
variables.
我们可以使用devel版本的data.table,即v1.9.5,它可以处理测量变量的多个模式。
library(data.table)
melt(setDT(df1), measure=patterns('.1', '.2', '.3'),
na.rm=TRUE, value.name=paste0('var', 1:3))[, variable:=NULL][order(one)]
# one two var1 var2 var3
#1: a 2 x1a x2a x3a
#2: b 4 x1b x2b x3b
#3: b 4 y1b y2b y3b
#4: c 7 x1c x2c x3c
#5: c 7 y1c y2c y3c
#6: d 3 x1d x2d x3d
EDIT: We don't need c
inside the patterns
and it will also give exact matches (from @Jaap's comments).
编辑:我们在模式中不需要c,它也会给出完全匹配(来自@ Jaap的评论)。
#2
3
melt
from "data.table" will be much faster than the following, but you could also consider merged.stack
from my "splitstackshape" package:
来自“data.table”的融合将比以下快得多,但您也可以考虑从我的“splitstackshape”包中的merged.stack:
library(splitstackshape)
na.omit(merged.stack(mydf, var.stubs = c(".1", ".2", ".3"),
sep = "var.stubs", atStart = FALSE))
# one two .time_1 .1 .2 .3
# 1: a 2 x x1a x2a x3a
# 2: b 4 x x1b x2b x3b
# 3: b 4 y y1b y2b y3b
# 4: c 7 x x1c x2c x3c
# 5: c 7 y y1c y2c y3c
# 6: d 3 x x1d x2d x3d
#3
2
You were almost there with the reshape-route, so I finished it for you. All you needed was a differentiation between x and y variables. (which are later easy to remove if you don't want or need them). I've left the missings in because they are easy to remove and to prevent silent removal of missing data.
你几乎在那里重塑路线,所以我为你完成了它。您所需要的只是区分x和y变量。 (如果您不想要或不需要它们,以后很容易删除)。我留下了遗漏,因为它们很容易删除,并防止无声删除丢失的数据。
df.between <- melt(df.in, id.vars=c("one", "two"))
#replace with 'var' so no numeric column names.
df.between$variable_n <- gsub("x.|y.", "var", df.between$variable)
df.between$variable_xy <- gsub(".[0-9]","",df.between$variable)
res <- dcast(one+two+variable_xy~variable_n,value.var="value",data=df.between)
> res
one two variable_xy var1 var2 var3
1 a 2 x x1a x2a x3a
2 a 2 y <NA> <NA> <NA>
3 b 4 x x1b x2b x3b
4 b 4 y y1b y2b y3b
5 c 7 x x1c x2c x3c
6 c 7 y y1c y2c y3c
7 d 3 x x1d x2d x3d
8 d 3 y <NA> <NA> <NA>
#1
5
We can use melt
from the devel version of data.table
i.e. v1.9.5
, which can handle multiple patterns
for the measure
variables.
我们可以使用devel版本的data.table,即v1.9.5,它可以处理测量变量的多个模式。
library(data.table)
melt(setDT(df1), measure=patterns('.1', '.2', '.3'),
na.rm=TRUE, value.name=paste0('var', 1:3))[, variable:=NULL][order(one)]
# one two var1 var2 var3
#1: a 2 x1a x2a x3a
#2: b 4 x1b x2b x3b
#3: b 4 y1b y2b y3b
#4: c 7 x1c x2c x3c
#5: c 7 y1c y2c y3c
#6: d 3 x1d x2d x3d
EDIT: We don't need c
inside the patterns
and it will also give exact matches (from @Jaap's comments).
编辑:我们在模式中不需要c,它也会给出完全匹配(来自@ Jaap的评论)。
#2
3
melt
from "data.table" will be much faster than the following, but you could also consider merged.stack
from my "splitstackshape" package:
来自“data.table”的融合将比以下快得多,但您也可以考虑从我的“splitstackshape”包中的merged.stack:
library(splitstackshape)
na.omit(merged.stack(mydf, var.stubs = c(".1", ".2", ".3"),
sep = "var.stubs", atStart = FALSE))
# one two .time_1 .1 .2 .3
# 1: a 2 x x1a x2a x3a
# 2: b 4 x x1b x2b x3b
# 3: b 4 y y1b y2b y3b
# 4: c 7 x x1c x2c x3c
# 5: c 7 y y1c y2c y3c
# 6: d 3 x x1d x2d x3d
#3
2
You were almost there with the reshape-route, so I finished it for you. All you needed was a differentiation between x and y variables. (which are later easy to remove if you don't want or need them). I've left the missings in because they are easy to remove and to prevent silent removal of missing data.
你几乎在那里重塑路线,所以我为你完成了它。您所需要的只是区分x和y变量。 (如果您不想要或不需要它们,以后很容易删除)。我留下了遗漏,因为它们很容易删除,并防止无声删除丢失的数据。
df.between <- melt(df.in, id.vars=c("one", "two"))
#replace with 'var' so no numeric column names.
df.between$variable_n <- gsub("x.|y.", "var", df.between$variable)
df.between$variable_xy <- gsub(".[0-9]","",df.between$variable)
res <- dcast(one+two+variable_xy~variable_n,value.var="value",data=df.between)
> res
one two variable_xy var1 var2 var3
1 a 2 x x1a x2a x3a
2 a 2 y <NA> <NA> <NA>
3 b 4 x x1b x2b x3b
4 b 4 y y1b y2b y3b
5 c 7 x x1c x2c x3c
6 c 7 y y1c y2c y3c
7 d 3 x x1d x2d x3d
8 d 3 y <NA> <NA> <NA>