在将数字类转换为data.frame之后尝试合并data.frames

时间:2021-07-08 22:47:05

I'm having some trouble trying to merge two data.frames in R and I believe this is caused by converting class numeric to a data.frame.

我在尝试合并R中的两个data.frame时遇到了一些麻烦,我相信这是由将类数字转换为data.frame引起的。

Background: I want to see the proportion of protein expression for different subcellular locations compared to the total expression of all proteins in a number of cell lines. I got the following datasets by doing colSums() resulting in:

背景:我想看到不同亚细胞位置的蛋白质表达比例与许多细胞系中所有蛋白质的总表达相比。我通过执行colSums()得到以下数据集:

dput(actin_expression)

structure(c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 
1, 3, 24.00000001, 27.00000001, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0), .Names = c("HAP1.wt_P8255.1", "HAP1.wt_P8255.2", 
"HAP1.wt_P8254.1", "HAP1.wt_P8254.2", "HAP1.kd_P8253.1", "HAP1.kd_P8253.2", 
"HAP1.kd_P8252.1", "HAP1.kd_P8252.2", "HAP1.kd_P8249.1", "HAP1.kd_P8249.2", 
"HAP1.kd_P8248.1", "HAP1.kd_P8248.2", "HAP1.wt_P8247.1", "HAP1.wt_P8247.2", 
"HAP1.wt_P8246.1", "HAP1.wt_P8246.2", "HAP1_P7964.1", "MDS_P7246.1", 
"A673_P6591.1", "K562__P5494.1", "K562_P5464.1", "K562_P5359.1", 
"K562_P5359.2", "K562_P5358.1", "K562_P5358.2", "K562_P5357.1", 
"K562_P5357.2", "K562_P5356.1", "K562_P5356.2", "K562_P5355.1", 
"K562_P5355.2", "K562_P5269.1", "K562_P5269.2", "K562_P5268.1", 
"K562_P5268.2"))

dput(aggresome_expression)

structure(c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 
1, 3, 24.00000001, 27.00000001, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0), .Names = c("HAP1.wt_P8255.1", "HAP1.wt_P8255.2", 
"HAP1.wt_P8254.1", "HAP1.wt_P8254.2", "HAP1.kd_P8253.1", "HAP1.kd_P8253.2", 
"HAP1.kd_P8252.1", "HAP1.kd_P8252.2", "HAP1.kd_P8249.1", "HAP1.kd_P8249.2", 
"HAP1.kd_P8248.1", "HAP1.kd_P8248.2", "HAP1.wt_P8247.1", "HAP1.wt_P8247.2", 
"HAP1.wt_P8246.1", "HAP1.wt_P8246.2", "HAP1_P7964.1", "MDS_P7246.1", 
"A673_P6591.1", "K562__P5494.1", "K562_P5464.1", "K562_P5359.1", 
"K562_P5359.2", "K562_P5358.1", "K562_P5358.2", "K562_P5357.1", 
"K562_P5357.2", "K562_P5356.1", "K562_P5356.2", "K562_P5355.1", 
"K562_P5355.2", "K562_P5269.1", "K562_P5269.2", "K562_P5268.1", 
"K562_P5268.2"))

dput(whole_protein_expression)

structure(c(5792.666666662, 5696.833333328, 5926.333333331, 5698.499999993, 
91.5, 5491.999999989, 5905.99999999, 5875.166666664, 6283.666666659, 
6221.333333328, 6461.833333324, 6551.999999995, 6162.499999993, 
6291.333333332, 6092.333333334, 5860.666666665, 66602.24999992, 
102735.516666836, 128849.166666626, 161552.66666675, 162444.416666818, 
22056.083333343, 21648.08333335, 21857.000000007, 21648.500000005, 
20084.166666684, 20250.333333338, 19233.750000023, 19152.416666677, 
18134.916666664, 18319.833333336, 21743.00000001, 21708.41666667, 
21191.500000012, 20974.833333327), .Names = c("HAP1.wt_P8255.1", 
"HAP1.wt_P8255.2", "HAP1.wt_P8254.1", "HAP1.wt_P8254.2", "HAP1.kd_P8253.1", 
"HAP1.kd_P8253.2", "HAP1.kd_P8252.1", "HAP1.kd_P8252.2", "HAP1.kd_P8249.1", 
"HAP1.kd_P8249.2", "HAP1.kd_P8248.1", "HAP1.kd_P8248.2", "HAP1.wt_P8247.1", 
"HAP1.wt_P8247.2", "HAP1.wt_P8246.1", "HAP1.wt_P8246.2", "HAP1_P7964.1", 
"MDS_P7246.1", "A673_P6591.1", "K562__P5494.1", "K562_P5464.1", 
"K562_P5359.1", "K562_P5359.2", "K562_P5358.1", "K562_P5358.2", 
"K562_P5357.1", "K562_P5357.2", "K562_P5356.1", "K562_P5356.2", 
"K562_P5355.1", "K562_P5355.2", "K562_P5269.1", "K562_P5269.2", 
"K562_P5268.1", "K562_P5268.2"))

Divide the colSums of actin_expression over the whole_protein_expression. Divide the colSums of aggresome_expression over the whole_protein_expression.

将actin_expression的colSums除以whole_protein_expression。将aggresome_expression的colSums除以whole_protein_expression。

actin <- actin_expression/whole_protein_expression*100
aggresome <- aggresome_expression/whole_protein_expression*100
class(actin) # Class numeric
actin <- as.data.frame(actin) # Change to a data.frame 
aggresome <- as.data.frame(aggresome)
head(actin)

I would like to name the columns so I can do a new-df <- merge(actin,aggresome=by="cell_line")

我想命名列,以便我可以做一个new-df < - merge(actin,aggresome = by =“cell_line”)

I try to name the first column cell_line as follows:

我尝试将第一列cell_line命名如下:

names(actin) <- c("cell_line", "actinFilaments")
Error in names(actin) <- c("cell_line", "actinFilaments") : 
'names' attribute [2] must be the same length as the vector [1]

Something is odd here - I believe it's telling me I only have one column?

这里有点奇怪 - 我相信它告诉我我只有一个专栏?

Usually when you do write.csv() the first column is like an index 1:nrow (not sure if that is the right term) but when I write.csv(actin, "actin.csv") this is not the case.

通常当你执行write.csv()时,第一列就像索引1:nrow(不确定这是否是正确的术语)但是当我写write.csv(actin,“actin.csv”)时,情况并非如此。

What's the explanation for why most times writing a csv file results in the first column being an index (and how could one prevent this)? Why are my cell lines (possibly) being considered an index (and how could I prevent this)?

为什么大多数时候编写csv文件导致第一列成为索引(以及如何防止这种情况)的原因是什么?为什么我的细胞系(可能)被认为是一个索引(我怎么能阻止它)?

Many thanks to any R-wizards who can share some knowledge on class conversions :)

非常感谢任何可以分享课堂转换知识的R-Wizards :)

2 个解决方案

#1


1  

You're working with a lot of vectors instead of putting everything in a dataframe:

您正在使用大量向量而不是将所有内容放在数据帧中:

Create a df:

创建一个df:

df <- data.frame(actin_expression, aggresome_expression, whole_protein_expression)

put the names in a column:

将名称放在一列中:

df <- data.frame(Names = rownames(df), df, row.names = NULL)

create new columns:

创建新列:

library(dplyr)

df2 <- df %>%
       mutate(actin = actin_expression/whole_protein_expression*100,
              aggresome = aggresome_expression/whole_protein_expression*100)  

Let me know if that is along what you're looking for?

如果您正在寻找的话,请告诉我一下?

also: you're using vectors still when you were trying to write.csv. If its a dataframe, it will show up as the first column, and not an index like you're referencing.

另外:当你尝试write.csv时,你仍在使用向量。如果它是一个数据帧,它将显示为第一列,而不是您正在引用的索引。

actin <- data.frame(df2$Names, df2$actin)
write.csv(actin, "actin.csv")

#2


1  

I think your issue is just that your cell_line values are stored as row names, rather than an actual column. I didn't meticulously review your analysis to make sure it didn't cause issues anywhere else, but to fix the issue with the final data frame:

我认为您的问题只是您的cell_line值存储为行名,而不是实际列。我没有仔细检查您的分析,以确保它不会在其他任何地方引起问题,但要解决最终数据框的问题:

require(tibble)

df <- rownames_to_column(actin, "cell_line")

#1


1  

You're working with a lot of vectors instead of putting everything in a dataframe:

您正在使用大量向量而不是将所有内容放在数据帧中:

Create a df:

创建一个df:

df <- data.frame(actin_expression, aggresome_expression, whole_protein_expression)

put the names in a column:

将名称放在一列中:

df <- data.frame(Names = rownames(df), df, row.names = NULL)

create new columns:

创建新列:

library(dplyr)

df2 <- df %>%
       mutate(actin = actin_expression/whole_protein_expression*100,
              aggresome = aggresome_expression/whole_protein_expression*100)  

Let me know if that is along what you're looking for?

如果您正在寻找的话,请告诉我一下?

also: you're using vectors still when you were trying to write.csv. If its a dataframe, it will show up as the first column, and not an index like you're referencing.

另外:当你尝试write.csv时,你仍在使用向量。如果它是一个数据帧,它将显示为第一列,而不是您正在引用的索引。

actin <- data.frame(df2$Names, df2$actin)
write.csv(actin, "actin.csv")

#2


1  

I think your issue is just that your cell_line values are stored as row names, rather than an actual column. I didn't meticulously review your analysis to make sure it didn't cause issues anywhere else, but to fix the issue with the final data frame:

我认为您的问题只是您的cell_line值存储为行名,而不是实际列。我没有仔细检查您的分析,以确保它不会在其他任何地方引起问题,但要解决最终数据框的问题:

require(tibble)

df <- rownames_to_column(actin, "cell_line")