This question already has an answer here:
这个问题已经有了答案:
- Reshaping multiple sets of measurement columns (wide format) into single columns (long format) 6 answers
- 将多组测量列(宽格式)重新形成单列(长格式)6个答案
I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is, Q3.2 in the survey will have columns Q3.2.1.
, Q3.2.2.
, and Q3.2.3.
:
我有一项在线调查的数据,调查对象会在回答1-3次问题时进行循环。调查软件(Qualtrics)在多个专栏中记录了这一数据——也就是说,调查中的Q3.2将有专栏Q3.2.1。,Q3.2.2。,Q3.2.3。:
df <- data.frame(
id = 1:10,
time = as.Date('2009-01-01') + 0:9,
Q3.2.1. = rnorm(10, 0, 1),
Q3.2.2. = rnorm(10, 0, 1),
Q3.2.3. = rnorm(10, 0, 1),
Q3.3.1. = rnorm(10, 0, 1),
Q3.3.2. = rnorm(10, 0, 1),
Q3.3.3. = rnorm(10, 0, 1)
)
# Sample data
id time Q3.2.1. Q3.2.2. Q3.2.3. Q3.3.1. Q3.3.2. Q3.3.3.
1 1 2009-01-01 -0.2059165 -0.29177677 -0.7107192 1.52718069 -0.4484351 -1.21550600
2 2 2009-01-02 -0.1981136 -1.19813815 1.1750200 -0.40380049 -1.8376094 1.03588482
3 3 2009-01-03 0.3514795 -0.27425539 1.1171712 -1.02641801 -2.0646661 -0.35353058
...
I want to combine all the QN.N* columns into tidy individual QN.N columns, ultimately ending up with something like this:
我想把所有的QN组合起来。N*列组成整洁的独立QN。N列,最终得到这样的结果
id time loop_number Q3.2 Q3.3
1 1 2009-01-01 1 -0.20591649 1.52718069
2 2 2009-01-02 1 -0.19811357 -0.40380049
3 3 2009-01-03 1 0.35147949 -1.02641801
...
11 1 2009-01-01 2 -0.29177677 -0.4484351
12 2 2009-01-02 2 -1.19813815 -1.8376094
13 3 2009-01-03 2 -0.27425539 -2.0646661
...
21 1 2009-01-01 3 -0.71071921 -1.21550600
22 2 2009-01-02 3 1.17501999 1.03588482
23 3 2009-01-03 3 1.11717121 -0.35353058
...
The tidyr
library has the gather()
function, which works great for combining one set of columns:
tidyr库具有gather()函数,它对组合一组列非常有用:
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>%
mutate(loop_number = str_sub(loop_number,-2,-2)) %>%
select(id, time, loop_number, Q3.2)
id time loop_number Q3.2
1 1 2009-01-01 1 -0.20591649
2 2 2009-01-02 1 -0.19811357
3 3 2009-01-03 1 0.35147949
...
29 9 2009-01-09 3 -0.58581232
30 10 2009-01-10 3 -2.33393981
The resultant data frame has 30 rows, as expected (10 individuals, 3 loops each). However, gathering a second set of columns does not work correctly—it successfully makes the two combined columns Q3.2
and Q3.3
, but ends up with 90 rows instead of 30 (all combinations of 10 individuals, 3 loops of Q3.2, and 3 loops of Q3.3; the combinations will increase substantially for each group of columns in the actual data):
结果数据帧有30行,如预期的(10个人,3个循环)。但是,收集第二组列并不正确——它成功地使两个组合列Q3.2和Q3.3组合在一起,但是最后得到90行而不是30行(10个个体的所有组合,Q3.2的3个循环,Q3.3的3个循环);实际数据中的每一组列的组合将大量增加):
df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>%
gather(loop_number, Q3.3, starts_with("Q3.3")) %>%
mutate(loop_number = str_sub(loop_number,-2,-2))
id time loop_number Q3.2 Q3.3
1 1 2009-01-01 1 -0.20591649 1.52718069
2 2 2009-01-02 1 -0.19811357 -0.40380049
3 3 2009-01-03 1 0.35147949 -1.02641801
...
89 9 2009-01-09 3 -0.58581232 -0.13187024
90 10 2009-01-10 3 -2.33393981 -0.48502131
Is there a way to use multiple calls to gather()
like this, combining small subsets of columns like this while maintaining the correct number of rows?
是否有一种方法可以像这样使用多个调用collect(),同时组合像这样的小列子集,同时保持正确的行数?
5 个解决方案
#1
97
This approach seems pretty natural to me:
这种方法在我看来很自然:
df %>%
gather(key, value, -id, -time) %>%
extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
spread(question, value)
First gather all question columns, use extract()
to separate into question
and loop_number
, then spread()
question back into the columns.
首先收集所有的问题列,使用extract()将问题和loop_number分开,然后将()问题传播回列中。
#> id time loop_number Q3.2 Q3.3
#> 1 1 2009-01-01 1 0.142259203 -0.35842736
#> 2 1 2009-01-01 2 0.061034802 0.79354061
#> 3 1 2009-01-01 3 -0.525686204 -0.67456611
#> 4 2 2009-01-02 1 -1.044461185 -1.19662936
#> 5 2 2009-01-02 2 0.393808163 0.42384717
#2
23
This could be done using reshape
. It is possible with dplyr
though.
这可以用整形来完成。尽管dplyr是可能的。
colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
colnames(df)[2] <- "Date"
res <- reshape(df, idvar=c("id", "Date"), varying=3:8, direction="long", sep="_")
row.names(res) <- 1:nrow(res)
head(res)
# id Date time Q3.2 Q3.3
#1 1 2009-01-01 1 1.3709584 0.4554501
#2 2 2009-01-02 1 -0.5646982 0.7048373
#3 3 2009-01-03 1 0.3631284 1.0351035
#4 4 2009-01-04 1 0.6328626 -0.6089264
#5 5 2009-01-05 1 0.4042683 0.5049551
#6 6 2009-01-06 1 -0.1061245 -1.7170087
Or using dplyr
或者使用dplyr
library(tidyr)
library(dplyr)
colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
df %>%
gather(loop_number, "Q3", starts_with("Q3")) %>%
separate(loop_number,c("L1", "L2"), sep="_") %>%
spread(L1, Q3) %>%
select(-L2) %>%
head()
# id time Q3.2 Q3.3
#1 1 2009-01-01 1.3709584 0.4554501
#2 1 2009-01-01 1.3048697 0.2059986
#3 1 2009-01-01 -0.3066386 0.3219253
#4 2 2009-01-02 -0.5646982 0.7048373
#5 2 2009-01-02 2.2866454 -0.3610573
#6 2 2009-01-02 -1.7813084 -0.7838389
#3
16
With the recent update to melt.data.table
, we can now melt multiple columns. With that, we can do:
最近对melt.data的更新。表,我们现在可以熔化多列。有了它,我们可以:
require(data.table) ## 1.9.5
melt(setDT(df), id=1:2, measure=patterns("^Q3.2", "^Q3.3"),
value.name=c("Q3.2", "Q3.3"), variable.name="loop_number")
# id time loop_number Q3.2 Q3.3
# 1: 1 2009-01-01 1 -0.433978480 0.41227209
# 2: 2 2009-01-02 1 -0.567995351 0.30701144
# 3: 3 2009-01-03 1 -0.092041353 -0.96024077
# 4: 4 2009-01-04 1 1.137433487 0.60603396
# 5: 5 2009-01-05 1 -1.071498263 -0.01655584
# 6: 6 2009-01-06 1 -0.048376809 0.55889996
# 7: 7 2009-01-07 1 -0.007312176 0.69872938
You can get the development version from here.
您可以从这里获得开发版本。
#4
11
It's not at all related to "tidyr" and "dplyr", but here's another option to consider: merged.stack
from my "splitstackshape" package, V1.4.0 and above.
它与“tidyr”和“dplyr”没有任何关系,但是这里有另一个可考虑的选项:合并。堆叠从我的“splitstackshape”包,V1.4.0及以上。
library(splitstackshape)
merged.stack(df, id.vars = c("id", "time"),
var.stubs = c("Q3.2.", "Q3.3."),
sep = "var.stubs")
# id time .time_1 Q3.2. Q3.3.
# 1: 1 2009-01-01 1. -0.62645381 1.35867955
# 2: 1 2009-01-01 2. 1.51178117 -0.16452360
# 3: 1 2009-01-01 3. 0.91897737 0.39810588
# 4: 2 2009-01-02 1. 0.18364332 -0.10278773
# 5: 2 2009-01-02 2. 0.38984324 -0.25336168
# 6: 2 2009-01-02 3. 0.78213630 -0.61202639
# 7: 3 2009-01-03 1. -0.83562861 0.38767161
# <<:::SNIP:::>>
# 24: 8 2009-01-08 3. -1.47075238 -1.04413463
# 25: 9 2009-01-09 1. 0.57578135 1.10002537
# 26: 9 2009-01-09 2. 0.82122120 -0.11234621
# 27: 9 2009-01-09 3. -0.47815006 0.56971963
# 28: 10 2009-01-10 1. -0.30538839 0.76317575
# 29: 10 2009-01-10 2. 0.59390132 0.88110773
# 30: 10 2009-01-10 3. 0.41794156 -0.13505460
# id time .time_1 Q3.2. Q3.3.
#5
6
In case you are like me, and cannot work out how to use "regular expression with capturing groups" for extract
, the following code replicates the extract(...)
line in Hadleys' answer:
如果您像我一样,无法找到如何使用“带捕获组的正则表达式”进行提取,下面的代码将复制hadley的答案中的提取(…)行:
df %>%
gather(question_number, value, starts_with("Q3.")) %>%
mutate(loop_number = str_sub(question_number,-2,-2), question_number = str_sub(question_number,1,4)) %>%
select(id, time, loop_number, question_number, value) %>%
spread(key = question_number, value = value)
The problem here is that the initial gather forms a key column that is actually a combination of two keys. I chose to use mutate
in my original solution in the comments to split this column into two columns with equivalent info, a loop_number
column and a question_number
column. spread
can then be used to transform the long form data, which are key value pairs (question_number, value)
to wide form data.
这里的问题是,初始聚集形成一个键列,实际上是两个键的组合。我选择在注释中的原始解决方案中使用mutate,将这个列分为两个列,具有相同的信息,一个loop_number列和一个question_number列。然后可以使用spread将长表单数据转换为宽表单数据,长表单数据是键值对(question_number, value)。
#1
97
This approach seems pretty natural to me:
这种方法在我看来很自然:
df %>%
gather(key, value, -id, -time) %>%
extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
spread(question, value)
First gather all question columns, use extract()
to separate into question
and loop_number
, then spread()
question back into the columns.
首先收集所有的问题列,使用extract()将问题和loop_number分开,然后将()问题传播回列中。
#> id time loop_number Q3.2 Q3.3
#> 1 1 2009-01-01 1 0.142259203 -0.35842736
#> 2 1 2009-01-01 2 0.061034802 0.79354061
#> 3 1 2009-01-01 3 -0.525686204 -0.67456611
#> 4 2 2009-01-02 1 -1.044461185 -1.19662936
#> 5 2 2009-01-02 2 0.393808163 0.42384717
#2
23
This could be done using reshape
. It is possible with dplyr
though.
这可以用整形来完成。尽管dplyr是可能的。
colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
colnames(df)[2] <- "Date"
res <- reshape(df, idvar=c("id", "Date"), varying=3:8, direction="long", sep="_")
row.names(res) <- 1:nrow(res)
head(res)
# id Date time Q3.2 Q3.3
#1 1 2009-01-01 1 1.3709584 0.4554501
#2 2 2009-01-02 1 -0.5646982 0.7048373
#3 3 2009-01-03 1 0.3631284 1.0351035
#4 4 2009-01-04 1 0.6328626 -0.6089264
#5 5 2009-01-05 1 0.4042683 0.5049551
#6 6 2009-01-06 1 -0.1061245 -1.7170087
Or using dplyr
或者使用dplyr
library(tidyr)
library(dplyr)
colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
df %>%
gather(loop_number, "Q3", starts_with("Q3")) %>%
separate(loop_number,c("L1", "L2"), sep="_") %>%
spread(L1, Q3) %>%
select(-L2) %>%
head()
# id time Q3.2 Q3.3
#1 1 2009-01-01 1.3709584 0.4554501
#2 1 2009-01-01 1.3048697 0.2059986
#3 1 2009-01-01 -0.3066386 0.3219253
#4 2 2009-01-02 -0.5646982 0.7048373
#5 2 2009-01-02 2.2866454 -0.3610573
#6 2 2009-01-02 -1.7813084 -0.7838389
#3
16
With the recent update to melt.data.table
, we can now melt multiple columns. With that, we can do:
最近对melt.data的更新。表,我们现在可以熔化多列。有了它,我们可以:
require(data.table) ## 1.9.5
melt(setDT(df), id=1:2, measure=patterns("^Q3.2", "^Q3.3"),
value.name=c("Q3.2", "Q3.3"), variable.name="loop_number")
# id time loop_number Q3.2 Q3.3
# 1: 1 2009-01-01 1 -0.433978480 0.41227209
# 2: 2 2009-01-02 1 -0.567995351 0.30701144
# 3: 3 2009-01-03 1 -0.092041353 -0.96024077
# 4: 4 2009-01-04 1 1.137433487 0.60603396
# 5: 5 2009-01-05 1 -1.071498263 -0.01655584
# 6: 6 2009-01-06 1 -0.048376809 0.55889996
# 7: 7 2009-01-07 1 -0.007312176 0.69872938
You can get the development version from here.
您可以从这里获得开发版本。
#4
11
It's not at all related to "tidyr" and "dplyr", but here's another option to consider: merged.stack
from my "splitstackshape" package, V1.4.0 and above.
它与“tidyr”和“dplyr”没有任何关系,但是这里有另一个可考虑的选项:合并。堆叠从我的“splitstackshape”包,V1.4.0及以上。
library(splitstackshape)
merged.stack(df, id.vars = c("id", "time"),
var.stubs = c("Q3.2.", "Q3.3."),
sep = "var.stubs")
# id time .time_1 Q3.2. Q3.3.
# 1: 1 2009-01-01 1. -0.62645381 1.35867955
# 2: 1 2009-01-01 2. 1.51178117 -0.16452360
# 3: 1 2009-01-01 3. 0.91897737 0.39810588
# 4: 2 2009-01-02 1. 0.18364332 -0.10278773
# 5: 2 2009-01-02 2. 0.38984324 -0.25336168
# 6: 2 2009-01-02 3. 0.78213630 -0.61202639
# 7: 3 2009-01-03 1. -0.83562861 0.38767161
# <<:::SNIP:::>>
# 24: 8 2009-01-08 3. -1.47075238 -1.04413463
# 25: 9 2009-01-09 1. 0.57578135 1.10002537
# 26: 9 2009-01-09 2. 0.82122120 -0.11234621
# 27: 9 2009-01-09 3. -0.47815006 0.56971963
# 28: 10 2009-01-10 1. -0.30538839 0.76317575
# 29: 10 2009-01-10 2. 0.59390132 0.88110773
# 30: 10 2009-01-10 3. 0.41794156 -0.13505460
# id time .time_1 Q3.2. Q3.3.
#5
6
In case you are like me, and cannot work out how to use "regular expression with capturing groups" for extract
, the following code replicates the extract(...)
line in Hadleys' answer:
如果您像我一样,无法找到如何使用“带捕获组的正则表达式”进行提取,下面的代码将复制hadley的答案中的提取(…)行:
df %>%
gather(question_number, value, starts_with("Q3.")) %>%
mutate(loop_number = str_sub(question_number,-2,-2), question_number = str_sub(question_number,1,4)) %>%
select(id, time, loop_number, question_number, value) %>%
spread(key = question_number, value = value)
The problem here is that the initial gather forms a key column that is actually a combination of two keys. I chose to use mutate
in my original solution in the comments to split this column into two columns with equivalent info, a loop_number
column and a question_number
column. spread
can then be used to transform the long form data, which are key value pairs (question_number, value)
to wide form data.
这里的问题是,初始聚集形成一个键列,实际上是两个键的组合。我选择在注释中的原始解决方案中使用mutate,将这个列分为两个列,具有相同的信息,一个loop_number列和一个question_number列。然后可以使用spread将长表单数据转换为宽表单数据,长表单数据是键值对(question_number, value)。