I have a data frame which looks like this (simplified):
我有一个看起来像这样的数据框(简化):
data1.time1 data1.time2 data2.time1 data2.time2 data3.time1 group
1 1.53 2.01 6.49 5.22 3.46 A
...
24 2.12 3.14 4.96 4.89 3.81 C
where there are actually dataK.timeT
for K in 1..27 and T in some (but maybe not all) of 1..8.
其中实际上有数据K.timeT代表K在1..27中而T在某些(但可能不是全部)中为1..8。
I would like to rearrange the data into K data frames so that I can plot, for each K, the summary data (for now let's say mean and mean ± standard deviation) for each of the three groups A, B, and C. That is, I want 27 graphs with three lines per graph, and also marks for the deviations.
我想将数据重新排列成K个数据帧,这样我就可以为每个K绘制三个A,B和C组中的每一个的汇总数据(现在说平均值和平均值±标准偏差)。是的,我想要27个图形,每个图形有三条线,并且还标记偏差。
Once I rearrange the data it should be easy enough to collapse by group, compute summary statistics, etc. But I'm not really sure how to get the data into this form. I looked at the reshape
package, which suggests melting it into a key-value store format and rearranging from there, but it doesn't seem to support the columns containing the T values as I have here.
一旦我重新排列数据,它应该很容易按组折叠,计算汇总统计数据等。但我不确定如何将数据转换为这种形式。我查看了reshape包,建议将其熔化为键值存储格式并从那里重新排列,但它似乎不支持包含T值的列,就像我在这里一样。
Is there a good way to do this? I'm quite willing to use something other than R to do this, since I can just import the results into R after transforming.
有没有一个很好的方法来做到这一点?我非常愿意使用R之外的其他东西来做这件事,因为我可以在转换后将结果导入R中。
2 个解决方案
#1
5
After creating fake data with a structure similar to yours, we convert from wide to long format, making a "tidy" data frame that is ready for plotting with ggplot2
.
在创建具有与您的结构类似的假数据之后,我们将从宽格式转换为长格式,从而形成一个“整洁”的数据框,可以使用ggplot2进行绘图。
library(reshape2)
library(ggplot2)
library(dplyr)
Create fake data
set.seed(194)
dat = data.frame(replicate(27*8, cumsum(rnorm(24*3))))
names(dat) = paste0(rep(paste0("data",1:27), each=8), ".", rep(paste0("time",1:8), 27))
dat$group = rep(LETTERS[1:3], each=24)
Remove some columns so that number of time points will be different for different data sources:
删除一些列,以便不同数据源的时间点数不同:
dat = dat[ , -c(2,4,9,43,56,78,100:103,115:116,134:136,202,205)]
Reshape from wide to long format
datl = melt(dat, id.var="group")
Split data source and time point into separate columns:
将数据源和时间点拆分为单独的列:
datl$source = gsub("(.*)\\..*","\\1", datl$variable)
datl$time = as.numeric(gsub(".*time(.*)","\\1", datl$variable))
# Order data frame names by number (rather than alphabetically)
datl$source = factor(datl$source, levels=paste0("data",1:length(unique(datl$source))))
Plot the data using ggplot2
# Helper function for plotting standard deviation
sdFnc = function(x) {
vals = c(mean(x) - sd(x), mean(x) + sd(x))
names(vals) = c("ymin", "ymax")
vals
}
pd = position_dodge(0.7)
ggplot(datl, aes(time, value, group=group, color=group)) +
stat_summary(fun.y=mean, geom="line", position=pd) +
stat_summary(fun.data=sdFnc, geom="errorbar", width=0.4, position=pd) +
stat_summary(fun.y=mean, geom="point", position=pd) +
facet_wrap(~source, ncol=3) +
theme_bw()
Original (unnecessarily complicated) reshaping code. (Note, this code will no longer work with the updated (fake) data set, because the number of time columns is no longer uniform):
原始(不必要的复杂)重塑代码。 (注意,此代码将不再适用于更新的(假)数据集,因为时间列的数量不再一致):
# Convert data source from wide to long
datl = data.frame()
for (i in seq(1,27*8,8)) {
tmp.dat = dat[, c(i:(i+7),grep("group",names(dat)))]
tmp.dat$source = gsub("(.*)\\..*", "\\1", names(tmp.dat)[1])
names(tmp.dat)[1:8] = 1:8
#datl = rbind(datl, tmp.dat)
datl = bind_rows(datl, tmp.dat) # Updated based on comment
}
datl$source = factor(datl$source, levels=paste0("data",1:27))
# Convert time from wide to long
datl = melt(datl, id.var = c("source","group"), variable.name="time")
#2
1
Could do something like this with dplyr:
可以用dplyr做这样的事情:
for(i in 1:K){ ## for 1:27
my.data.ind <- paste0("data",i,"|group") ## "datai|group"
one.month <- select(data, contains(my.data.ind) %>% ## grab cols that have these
group_by(group) %>% ## group by your group
summarise_each(funs(mean), funs(sd)) ## find mean for each col within each group
}
That should leave you with a 3xT data frame that has the average value of each group over time T
这应该为您留下一个3xT数据帧,该数据帧具有随着时间T的每个组的平均值
#1
5
After creating fake data with a structure similar to yours, we convert from wide to long format, making a "tidy" data frame that is ready for plotting with ggplot2
.
在创建具有与您的结构类似的假数据之后,我们将从宽格式转换为长格式,从而形成一个“整洁”的数据框,可以使用ggplot2进行绘图。
library(reshape2)
library(ggplot2)
library(dplyr)
Create fake data
set.seed(194)
dat = data.frame(replicate(27*8, cumsum(rnorm(24*3))))
names(dat) = paste0(rep(paste0("data",1:27), each=8), ".", rep(paste0("time",1:8), 27))
dat$group = rep(LETTERS[1:3], each=24)
Remove some columns so that number of time points will be different for different data sources:
删除一些列,以便不同数据源的时间点数不同:
dat = dat[ , -c(2,4,9,43,56,78,100:103,115:116,134:136,202,205)]
Reshape from wide to long format
datl = melt(dat, id.var="group")
Split data source and time point into separate columns:
将数据源和时间点拆分为单独的列:
datl$source = gsub("(.*)\\..*","\\1", datl$variable)
datl$time = as.numeric(gsub(".*time(.*)","\\1", datl$variable))
# Order data frame names by number (rather than alphabetically)
datl$source = factor(datl$source, levels=paste0("data",1:length(unique(datl$source))))
Plot the data using ggplot2
# Helper function for plotting standard deviation
sdFnc = function(x) {
vals = c(mean(x) - sd(x), mean(x) + sd(x))
names(vals) = c("ymin", "ymax")
vals
}
pd = position_dodge(0.7)
ggplot(datl, aes(time, value, group=group, color=group)) +
stat_summary(fun.y=mean, geom="line", position=pd) +
stat_summary(fun.data=sdFnc, geom="errorbar", width=0.4, position=pd) +
stat_summary(fun.y=mean, geom="point", position=pd) +
facet_wrap(~source, ncol=3) +
theme_bw()
Original (unnecessarily complicated) reshaping code. (Note, this code will no longer work with the updated (fake) data set, because the number of time columns is no longer uniform):
原始(不必要的复杂)重塑代码。 (注意,此代码将不再适用于更新的(假)数据集,因为时间列的数量不再一致):
# Convert data source from wide to long
datl = data.frame()
for (i in seq(1,27*8,8)) {
tmp.dat = dat[, c(i:(i+7),grep("group",names(dat)))]
tmp.dat$source = gsub("(.*)\\..*", "\\1", names(tmp.dat)[1])
names(tmp.dat)[1:8] = 1:8
#datl = rbind(datl, tmp.dat)
datl = bind_rows(datl, tmp.dat) # Updated based on comment
}
datl$source = factor(datl$source, levels=paste0("data",1:27))
# Convert time from wide to long
datl = melt(datl, id.var = c("source","group"), variable.name="time")
#2
1
Could do something like this with dplyr:
可以用dplyr做这样的事情:
for(i in 1:K){ ## for 1:27
my.data.ind <- paste0("data",i,"|group") ## "datai|group"
one.month <- select(data, contains(my.data.ind) %>% ## grab cols that have these
group_by(group) %>% ## group by your group
summarise_each(funs(mean), funs(sd)) ## find mean for each col within each group
}
That should leave you with a 3xT data frame that has the average value of each group over time T
这应该为您留下一个3xT数据帧,该数据帧具有随着时间T的每个组的平均值