使用bash和awk删除不包含字符串列表之一的行[重复]

This question already has an answer here:

这个问题在这里已有答案：

How to remove the lines which appear on file B from another file A? 8 answers
如何从另一个文件A中删除文件B上出现的行？ 8个答案

I have a very large text file, myReads.sam, that looks like this:

我有一个非常大的文本文件myReads.sam，看起来像这样：

J00118:315:HMJWTBBXX:4:1118:21684:2246  4   *   0   0   *   *   0   0   CR:Z:TTTGTCATCTGTTTGT   
J00118:315:HMJWTBBXX:4:2211:19532:14449 4   *   0   0   *   *   0   0   CR:Z:TATGTCATCTTTCCTC

I have another 500 line text file, myIDs.txt, that looks like this:

我有另一个500行文本文件myIDs.txt，如下所示：

CR:Z:TTTGTCATCTGTTTGT
CB:Z:CTACCCAGTCGACTGC
QT:Z:AAFFFJJJ

I want to create a third text document, myFilteredReads.sam, that excludes any line that does not contain one of the character strings in myIDs.txt . So, for example, if I applied this filter using the snippet of myReads.sam and myIDs.txt above, the new file would look like:

我想创建第三个文本文档myFilteredReads.sam，它排除任何不包含myIDs.txt中的一个字符串的行。因此，例如，如果我使用上面的myReads.sam和myIDs.txt片段应用此过滤器，则新文件将如下所示：

J00118:315:HMJWTBBXX:4:1118:21684:2246  4   *   0   0   *   *   0   0   CR:Z:TTTGTCATCTGTTTGT

I know if I was only filtering on a single string (e.g. 'CR:Z:TTTGTCATCTGTTTGT'), I could use awk like this:

我知道如果我只是在一个字符串上过滤（例如'CR：Z：TTTGTCATCTGTTTGT'），我可以像这样使用awk：

cat myReads.sam | awk '!/CR:Z:TTTGTCATCTGTTTGT/' > myPartiallyFilteredReads.sam

I'm not sure how to command awk to replace the part in quotes with each line of file, though. I thought I might try looping through the files:

我不知道如何命令awk用引号替换每行文件中的部分。我以为我可能会尝试循环遍历文件：

cat myIDs.txt | awk 'BEGIN {i = 1; do { !/i/; ++i } while (i < 500) }' myReads.sam > myFilteredReads.sam

...but that hasn't worked for me.

......但这对我没用。

Any suggestions? Thanks in advance.

有什么建议么？提前致谢。

2 个解决方案

#1

You have a very simple way to accomplish what you are attempting. grep allows reading patterns from a file, and the -v option reverses the match. So you can simply find all lines in your myFilteredReads.sam that do not contain patterns in myIDs.txt with

你有一个非常简单的方法来完成你正在尝试的。 grep允许从文件中读取模式，-v选项可以反转匹配。因此，您只需在myFilteredReads.sam中找到myIDs.txt中不包含模式的所有行

grep -v -f myIDs.txt myFilteredReads.sam

Example Use/Output

示例使用/输出

Using your data in data.txt and your IDs in filter.txt, you get your desired results, e.g.

使用data.txt中的数据和filter.txt中的ID，可以获得所需的结果，例如

$ grep -v -f filter.txt data.txt
J00118:315:HMJWTBBXX:4:2211:19532:14449 4   *   0   0   *   *   0   0   CR:Z:TATGTCATCTTTCCTC

Edit -- If you Want Only Lines that ARE in myIDs.txt

编辑 - 如果您只想要myIDs.txt中的行

Then remove the -v, e.g.

然后删除-v，例如

$ grep -f filter.txt data.txt
J00118:315:HMJWTBBXX:4:1118:21684:2246  4   *   0   0   *   *   0   0   CR:Z:TTTGTCATCTGTTTGT

Sorry I misunderstood what you intended to include/exclude.

对不起，我误解了你打算包含/排除的内容。

#2

main is the file with the content

main是包含内容的文件

str is the file with the 'interesting strings'

str是带有'有趣字符串'的文件

out is the output file

out是输出文件

#!/bin/bash

while read line; do
  grep ${line} main >> out
done < str

#1

grep -v -f myIDs.txt myFilteredReads.sam