In attempt to extract mismatches between the two data frames below I've already managed to create a new data frame in which mismatches are replaced.
What I need now is a list of mismatches:
为了提取下面两个数据帧之间的不匹配,我已经设法创建了一个新的数据帧,其中不匹配被替换。我现在需要的是一系列不匹配:
dfA <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "CA"), animal3 = c("AA", "TT", "AG", "CA")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
# > dfA
# animal1 animal2 animal3
# snp1 AA AA AA
# snp2 TT TB TT
# snp3 AG AG AG
# snp4 CA CA CA
dfB <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "DF"), animal3 = c("AA", "TB", "AG", "DF")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
#> dfB
# animal1 animal2 animal3
#snp1 AA AA AA
#snp2 TT TB TB
#snp3 AG AG AG
#snp4 CA DF DF
To clarify the mismatches, here they are marked as 00's:
为了澄清不匹配,这里将它们标记为00:
# animal1 animal2 animal3
# snp1 AA AA AA
# snp2 TT TB 00
# snp3 AG AG AG
# snp4 CA 00 00
I need the following output:
我需要以下输出:
structure(list(snpname = structure(c(1L, 2L, 2L), .Label = c("snp2", "snp4"), class = "factor"), animalname = structure(c(2L, 1L, 2L), .Label = c("animal2", "animal3"), class = "factor"), alleledfA = structure(c(2L, 1L, 1L), .Label = c("CA", "TT"), class = "factor"), alleledfB = structure(c(2L, 1L, 1L), .Label = c("DF", "TB"), class = "factor")), .Names = c("snpname", "animalname", "alleledfA", "alleledfB"), class = "data.frame", row.names = c(NA, -3L))
# snpname animalname alleledfA alleledfB
#1 snp2 animal3 TT TB
#2 snp4 animal2 CA DF
#3 snp4 animal3 CA DF
So far I've been trying to extract additional data out of my lapply
function which I use to replace the mismatches by zero, without success though. I also tried to write an ifelse function without success. Hope you guys can help me out here!
到目前为止,我一直试图从我的lapply函数中提取额外的数据,我用它来将不匹配替换为零,但没有成功。我也尝试编写一个ifelse函数但没有成功。希望你们能在这里帮助我!
Eventually this will be run for data sets with a dimension of 100K by 1000, so efficiency is a pro
最终,这将针对尺寸为100K×1000的数据集运行,因此效率非常高
3 个解决方案
#1
6
This question has data.table
tag, so here's my attempt using this package. First step is to convert row names to columns as data.table
don't like those, then converting to long format after rbind
ing and setting an id per data set, finding where there are more than one unique value and converting back to a wide format
这个问题有data.table标签,所以这是我尝试使用这个包。第一步是将行名称转换为列,因为data.table不喜欢这些,然后在rbinding之后转换为长格式并为每个数据集设置id,找到有多个唯一值的位置并转换回宽格式
library(data.table)
setDT(dfA, keep.rownames = TRUE)
setDT(dfB, keep.rownames = TRUE)
dcast(melt(rbind(dfA,
dfB,
idcol = TRUE),
id = 1:2
)[,
if(uniqueN(value) > 1L) .SD,
by = .(rn, variable)],
rn + variable ~ .id)
# rn variable 1 2
# 1: snp2 animal3 TT TB
# 2: snp4 animal2 CA DF
# 3: snp4 animal3 CA DF
#2
4
Here is a solution using array.indices of a matrix:
这是一个使用矩阵的array.indices的解决方案:
i.arr <- which(dfA != dfB, arr.ind=TRUE)
data.frame(snp=rownames(dfA)[i.arr[,1]], animal=colnames(dfA)[i.arr[,2]],
A=dfA[i.arr], B=dfB[i.arr])
# snp animal A B
#1 snp4 animal2 CA DF
#2 snp2 animal3 TT TB
#3 snp4 animal3 CA DF
#3
3
This can be done with dplyr/tidyr
using a similar approach as in @David Arenburg's post.
这可以使用与@David Arenburg的帖子类似的方法使用dplyr / tidyr完成。
library(dplyr)
library(tidyr)
bind_rows(add_rownames(dfA), add_rownames(dfB)) %>%
gather(Var, Val, -rowname) %>%
group_by(rowname, Var) %>%
filter(n_distinct(Val)>1) %>%
mutate(id = 1:2) %>%
spread(id, Val)
# rowname Var 1 2
# (chr) (chr) (chr) (chr)
#1 snp2 animal3 TT TB
#2 snp4 animal2 CA DF
#3 snp4 animal3 CA DF
#1
6
This question has data.table
tag, so here's my attempt using this package. First step is to convert row names to columns as data.table
don't like those, then converting to long format after rbind
ing and setting an id per data set, finding where there are more than one unique value and converting back to a wide format
这个问题有data.table标签,所以这是我尝试使用这个包。第一步是将行名称转换为列,因为data.table不喜欢这些,然后在rbinding之后转换为长格式并为每个数据集设置id,找到有多个唯一值的位置并转换回宽格式
library(data.table)
setDT(dfA, keep.rownames = TRUE)
setDT(dfB, keep.rownames = TRUE)
dcast(melt(rbind(dfA,
dfB,
idcol = TRUE),
id = 1:2
)[,
if(uniqueN(value) > 1L) .SD,
by = .(rn, variable)],
rn + variable ~ .id)
# rn variable 1 2
# 1: snp2 animal3 TT TB
# 2: snp4 animal2 CA DF
# 3: snp4 animal3 CA DF
#2
4
Here is a solution using array.indices of a matrix:
这是一个使用矩阵的array.indices的解决方案:
i.arr <- which(dfA != dfB, arr.ind=TRUE)
data.frame(snp=rownames(dfA)[i.arr[,1]], animal=colnames(dfA)[i.arr[,2]],
A=dfA[i.arr], B=dfB[i.arr])
# snp animal A B
#1 snp4 animal2 CA DF
#2 snp2 animal3 TT TB
#3 snp4 animal3 CA DF
#3
3
This can be done with dplyr/tidyr
using a similar approach as in @David Arenburg's post.
这可以使用与@David Arenburg的帖子类似的方法使用dplyr / tidyr完成。
library(dplyr)
library(tidyr)
bind_rows(add_rownames(dfA), add_rownames(dfB)) %>%
gather(Var, Val, -rowname) %>%
group_by(rowname, Var) %>%
filter(n_distinct(Val)>1) %>%
mutate(id = 1:2) %>%
spread(id, Val)
# rowname Var 1 2
# (chr) (chr) (chr) (chr)
#1 snp2 animal3 TT TB
#2 snp4 animal2 CA DF
#3 snp4 animal3 CA DF