I would like to "extract" rows of data from a tab separated data frame (df1) where the entries only in columns 1 2 and 3 are not found in a second data frame (df2), but keeping all of the column values in df1.
我想从制表符分隔数据框(df1)中“提取”数据行,其中仅在第1和第3列中的条目未在第二个数据框(df2)中找到,但是将所有列值保留在df1中。
Here is a minimal example
这是一个最小的例子
df1
chr start end pvalue S1 S2
chr10 100028205 100028508 8.97E-01 3.0373832 3.6170213
chr10 100227439 100227832 5.04E-14 10.6730769 2.7279813
chr10 100992157 100992687 6.66E-03 12.6997477 17.3807599
chr10 100993821 100994188 9.94E-01 2.4369017 2.2819886
chr10 101089011 101090655 1.48E-07 6.6696846 9.3321407
chr10 101190452 101190925 5.37E-01 0.9708738 0.5974608
chr10 101279942 101280382 4.72E-03 7.2614108 11.8119266
chr10 101281182 101282116 1.34E-01 20.0733945 22.3736969
chr10 101282726 101282934 3.02E-01 15.7142857 19.6261682
chr10 101287163 101287920 6.95E-01 24.543379 25.7190265
my actual data set will have a variety of chr numbers in "chr" and thousands more rows and a few more columns of data
我的实际数据集将在“chr”中包含各种chr数字,还有数千行和几列数据
df2
chr start end
chr10 100227439 100227832
chr10 100992157 100992687
chr10 101089011 101090655
chr10 101287163 101287920
Desired output:
期望的输出:
df3
chr start end pvalue S1 S2
chr10 100028205 100028508 8.97E-01 3.0373832 3.6170213
chr10 100993821 100994188 9.94E-01 2.4369017 2.2819886
chr10 101190452 101190925 5.37E-01 0.9708738 0.5974608
chr10 101279942 101280382 4.72E-03 7.2614108 11.8119266
chr10 101281182 101282116 1.34E-01 20.0733945 22.3736969
chr10 101282726 101282934 3.02E-01 15.7142857 19.6261682
I have tried a variety of commands including:
我尝试了各种命令,包括:
df3 <- df1[!(df1[,1:3] %in% df2[,1:3])]
which returns all of df1
返回所有df1
df3 <- df1[!(df1$chr & df1$start & df1$end) %in% df2$chr & df2$start & df2$end]
errors
错误
1 个解决方案
#1
1
Assuming both df1 and df2 are data frames.
假设df1和df2都是数据帧。
library(dplyr)
anti_join(df1, df2)
# Joining by: c("chr", "start", "end")
# chr start end pvalue S1 S2
# 1 chr10 101282726 101282934 0.30200 15.7142857 19.6261682
# 2 chr10 101281182 101282116 0.13400 20.0733945 22.3736969
# 3 chr10 101279942 101280382 0.00472 7.2614108 11.8119266
# 4 chr10 101190452 101190925 0.53700 0.9708738 0.5974608
# 5 chr10 100993821 100994188 0.99400 2.4369017 2.2819886
# 6 chr10 100028205 100028508 0.89700 3.0373832 3.6170213
#1
1
Assuming both df1 and df2 are data frames.
假设df1和df2都是数据帧。
library(dplyr)
anti_join(df1, df2)
# Joining by: c("chr", "start", "end")
# chr start end pvalue S1 S2
# 1 chr10 101282726 101282934 0.30200 15.7142857 19.6261682
# 2 chr10 101281182 101282116 0.13400 20.0733945 22.3736969
# 3 chr10 101279942 101280382 0.00472 7.2614108 11.8119266
# 4 chr10 101190452 101190925 0.53700 0.9708738 0.5974608
# 5 chr10 100993821 100994188 0.99400 2.4369017 2.2819886
# 6 chr10 100028205 100028508 0.89700 3.0373832 3.6170213