使用dplyr :: mutate计算R中的成对相关性

时间:2021-03-25 07:36:50

I have a large data frame with on every rows enough data to calculate a correlation using specific columns of this data frame and add a new column containing the correlations calculated.

我有一个大的数据框,每行有足够的数据来计算使用该数据帧的特定列的相关性,并添加一个包含计算的相关性的新列。

Here is a summary of what I would like to do (this one using dplyr):

以下是我想要做的总结(这个使用dplyr):

example_data %>%
mutate(pearsoncor = cor(x = X001_F5_000_A:X030_F5_480_C, y = X031_H5_000_A:X060_H5_480_C))

Obviously it is not working this way as I get only NA's in the pearsoncor column, does anyone has a suggestion? Is there an easy way to do this?

显然它不是这样工作,因为我在pearsoncor专栏中只获得了NA,是否有人有建议?是否有捷径可寻?

Best,

Example data frame

示例数据框

3 个解决方案

#1


1  

With tidyr, you can gather separately all x- and y-variables, you'd like to compare. You get a tibble containing the correlation coefficients and their p-values for every combination you provided.

使用tidyr,您可以单独收集所有x和y变量,您想比较。你得到一个包含你提供的每个组合的相关系数及其p值的tibble。

library(dplyr)
library(tidyr)

example_data %>%
  gather(x_var, x_val, X001_F5_000_A:X030_F5_480_C) %>% 
  gather(y_var, y_val, X031_H5_000_A:X060_H5_480_C) %>% 
  group_by(x_var, y_var) %>% 
  summarise(cor_coef = cor.test(x_val, y_val)$estimate,
            p_val = cor.test(x_val, y_val)$p.value)

#2


1  

Here is a solution using the reshape2 package to melt() the data frame into long form so that each value has its own row. The original wide-form data has 60 values per row for each of the 6 genes, while the melted long-form data frame has 360 rows, one for each value. Then we can easily use summarize() from dplyr to calculate the correlations without loops.

这是一个解决方案,使用reshape2包将数据帧熔化()为长格式,以便每个值都有自己的行。对于6个基因中的每一个,原始宽格式数据每行具有60个值,而熔化的长形数据帧具有360行,每个值一个。然后我们可以很容易地使用dplyr中的summarize()来计算没有循环的相关性。

library(reshape2)
library(dplyr)

names1 <- names(example_data)[4:33]
names2 <- names(example_data)[34:63]

example_data_longform <- melt(example_data, id.vars = c('Gene','clusterFR','clusterHR'))

example_data_longform %>%
  group_by(Gene, clusterFR, clusterHR) %>%
  summarize(pearsoncor = cor(x = value[variable %in% names1],
                             y = value[variable %in% names2]))

You could also generate more detailed results, as in Eudald's answer, using do():

你也可以使用do()在Eudald的答案中生成更详细的结果:

detailed_r <- example_data_longform %>%
  group_by(Gene, clusterFR, clusterHR) %>%
  do(cor = cor.test(x = .$value[.$variable %in% names1],
                    y = .$value[.$variable %in% names2]))

This outputs a tibble with the cor column being a list with the results of cor.test() for each gene. We can use lapply() to extract output from the list.

这将输出一个tibble,其中cor列是一个列表,其中包含每个基因的cor.test()结果。我们可以使用lapply()从列表中提取输出。

lapply(detailed_r$cor, function(x) c(x$estimate, x$p.value))

#3


0  

I had the same problem a few days back, and I know loops are not optimal in R but that's the only thing I could think of:

几天前我遇到了同样的问题,我知道循环在R中并不是最优的,但这是我唯一能想到的:

df$r = rep(0,nrow(df))
df$cor_p = rep(0,nrow(df))

for (i in 1:nrow(df)){
  ct = cor.test(as.numeric(df[i,cols_A]),as.numeric(df[i,cols_B]))
df$r[i] = ct$estimate
df$cor_p[i] = ct$p.value
}

#1


1  

With tidyr, you can gather separately all x- and y-variables, you'd like to compare. You get a tibble containing the correlation coefficients and their p-values for every combination you provided.

使用tidyr,您可以单独收集所有x和y变量,您想比较。你得到一个包含你提供的每个组合的相关系数及其p值的tibble。

library(dplyr)
library(tidyr)

example_data %>%
  gather(x_var, x_val, X001_F5_000_A:X030_F5_480_C) %>% 
  gather(y_var, y_val, X031_H5_000_A:X060_H5_480_C) %>% 
  group_by(x_var, y_var) %>% 
  summarise(cor_coef = cor.test(x_val, y_val)$estimate,
            p_val = cor.test(x_val, y_val)$p.value)

#2


1  

Here is a solution using the reshape2 package to melt() the data frame into long form so that each value has its own row. The original wide-form data has 60 values per row for each of the 6 genes, while the melted long-form data frame has 360 rows, one for each value. Then we can easily use summarize() from dplyr to calculate the correlations without loops.

这是一个解决方案,使用reshape2包将数据帧熔化()为长格式,以便每个值都有自己的行。对于6个基因中的每一个,原始宽格式数据每行具有60个值,而熔化的长形数据帧具有360行,每个值一个。然后我们可以很容易地使用dplyr中的summarize()来计算没有循环的相关性。

library(reshape2)
library(dplyr)

names1 <- names(example_data)[4:33]
names2 <- names(example_data)[34:63]

example_data_longform <- melt(example_data, id.vars = c('Gene','clusterFR','clusterHR'))

example_data_longform %>%
  group_by(Gene, clusterFR, clusterHR) %>%
  summarize(pearsoncor = cor(x = value[variable %in% names1],
                             y = value[variable %in% names2]))

You could also generate more detailed results, as in Eudald's answer, using do():

你也可以使用do()在Eudald的答案中生成更详细的结果:

detailed_r <- example_data_longform %>%
  group_by(Gene, clusterFR, clusterHR) %>%
  do(cor = cor.test(x = .$value[.$variable %in% names1],
                    y = .$value[.$variable %in% names2]))

This outputs a tibble with the cor column being a list with the results of cor.test() for each gene. We can use lapply() to extract output from the list.

这将输出一个tibble,其中cor列是一个列表,其中包含每个基因的cor.test()结果。我们可以使用lapply()从列表中提取输出。

lapply(detailed_r$cor, function(x) c(x$estimate, x$p.value))

#3


0  

I had the same problem a few days back, and I know loops are not optimal in R but that's the only thing I could think of:

几天前我遇到了同样的问题,我知道循环在R中并不是最优的,但这是我唯一能想到的:

df$r = rep(0,nrow(df))
df$cor_p = rep(0,nrow(df))

for (i in 1:nrow(df)){
  ct = cor.test(as.numeric(df[i,cols_A]),as.numeric(df[i,cols_B]))
df$r[i] = ct$estimate
df$cor_p[i] = ct$p.value
}