This question already has an answer here:
这个问题在这里已有答案:
- How to remove the lines which appear on file B from another file A? 8 answers
- 如何从另一个文件A中删除文件B上出现的行? 8个答案
I have a very large text file, myReads.sam, that looks like this:
我有一个非常大的文本文件myReads.sam,看起来像这样:
J00118:315:HMJWTBBXX:4:1118:21684:2246 4 * 0 0 * * 0 0 CR:Z:TTTGTCATCTGTTTGT
J00118:315:HMJWTBBXX:4:2211:19532:14449 4 * 0 0 * * 0 0 CR:Z:TATGTCATCTTTCCTC
I have another 500 line text file, myIDs.txt, that looks like this:
我有另一个500行文本文件myIDs.txt,如下所示:
CR:Z:TTTGTCATCTGTTTGT
CB:Z:CTACCCAGTCGACTGC
QT:Z:AAFFFJJJ
I want to create a third text document, myFilteredReads.sam, that excludes any line that does not contain one of the character strings in myIDs.txt . So, for example, if I applied this filter using the snippet of myReads.sam and myIDs.txt above, the new file would look like:
我想创建第三个文本文档myFilteredReads.sam,它排除任何不包含myIDs.txt中的一个字符串的行。因此,例如,如果我使用上面的myReads.sam和myIDs.txt片段应用此过滤器,则新文件将如下所示:
J00118:315:HMJWTBBXX:4:1118:21684:2246 4 * 0 0 * * 0 0 CR:Z:TTTGTCATCTGTTTGT
I know if I was only filtering on a single string (e.g. 'CR:Z:TTTGTCATCTGTTTGT'), I could use awk like this:
我知道如果我只是在一个字符串上过滤(例如'CR:Z:TTTGTCATCTGTTTGT'),我可以像这样使用awk:
cat myReads.sam | awk '!/CR:Z:TTTGTCATCTGTTTGT/' > myPartiallyFilteredReads.sam
I'm not sure how to command awk to replace the part in quotes with each line of file, though. I thought I might try looping through the files:
我不知道如何命令awk用引号替换每行文件中的部分。我以为我可能会尝试循环遍历文件:
cat myIDs.txt | awk 'BEGIN {i = 1; do { !/i/; ++i } while (i < 500) }' myReads.sam > myFilteredReads.sam
...but that hasn't worked for me.
......但这对我没用。
Any suggestions? Thanks in advance.
有什么建议么?提前致谢。
2 个解决方案
#1
2
You have a very simple way to accomplish what you are attempting. grep
allows reading patterns from a file, and the -v
option reverses the match. So you can simply find all lines in your myFilteredReads.sam
that do not contain patterns in myIDs.txt
with
你有一个非常简单的方法来完成你正在尝试的。 grep允许从文件中读取模式,-v选项可以反转匹配。因此,您只需在myFilteredReads.sam中找到myIDs.txt中不包含模式的所有行
grep -v -f myIDs.txt myFilteredReads.sam
Example Use/Output
示例使用/输出
Using your data in data.txt
and your IDs in filter.txt
, you get your desired results, e.g.
使用data.txt中的数据和filter.txt中的ID,可以获得所需的结果,例如
$ grep -v -f filter.txt data.txt
J00118:315:HMJWTBBXX:4:2211:19532:14449 4 * 0 0 * * 0 0 CR:Z:TATGTCATCTTTCCTC
Edit -- If you Want Only Lines that ARE in myIDs.txt
编辑 - 如果您只想要myIDs.txt中的行
Then remove the -v
, e.g.
然后删除-v,例如
$ grep -f filter.txt data.txt
J00118:315:HMJWTBBXX:4:1118:21684:2246 4 * 0 0 * * 0 0 CR:Z:TTTGTCATCTGTTTGT
Sorry I misunderstood what you intended to include/exclude.
对不起,我误解了你打算包含/排除的内容。
#2
0
main is the file with the content
main是包含内容的文件
str is the file with the 'interesting strings'
str是带有'有趣字符串'的文件
out is the output file
out是输出文件
#!/bin/bash
while read line; do
grep ${line} main >> out
done < str
#1
2
You have a very simple way to accomplish what you are attempting. grep
allows reading patterns from a file, and the -v
option reverses the match. So you can simply find all lines in your myFilteredReads.sam
that do not contain patterns in myIDs.txt
with
你有一个非常简单的方法来完成你正在尝试的。 grep允许从文件中读取模式,-v选项可以反转匹配。因此,您只需在myFilteredReads.sam中找到myIDs.txt中不包含模式的所有行
grep -v -f myIDs.txt myFilteredReads.sam
Example Use/Output
示例使用/输出
Using your data in data.txt
and your IDs in filter.txt
, you get your desired results, e.g.
使用data.txt中的数据和filter.txt中的ID,可以获得所需的结果,例如
$ grep -v -f filter.txt data.txt
J00118:315:HMJWTBBXX:4:2211:19532:14449 4 * 0 0 * * 0 0 CR:Z:TATGTCATCTTTCCTC
Edit -- If you Want Only Lines that ARE in myIDs.txt
编辑 - 如果您只想要myIDs.txt中的行
Then remove the -v
, e.g.
然后删除-v,例如
$ grep -f filter.txt data.txt
J00118:315:HMJWTBBXX:4:1118:21684:2246 4 * 0 0 * * 0 0 CR:Z:TTTGTCATCTGTTTGT
Sorry I misunderstood what you intended to include/exclude.
对不起,我误解了你打算包含/排除的内容。
#2
0
main is the file with the content
main是包含内容的文件
str is the file with the 'interesting strings'
str是带有'有趣字符串'的文件
out is the output file
out是输出文件
#!/bin/bash
while read line; do
grep ${line} main >> out
done < str