For example I have a fasta file with the following sequences:
例如,我有一个包含以下序列的fasta文件:
>human1
AGGGCGSTGC
>human2
GCTTGCGCTAG
>human3
TTCGCTAG
How to use python to read a text file with the following content to extract the sequences? 1 represents true and 0 represents false. Only sequence with value 1 will be extracted.
如何使用python读取具有以下内容的文本文件来提取序列? 1表示真,0表示假。仅提取值为1的序列。
Example text file:
示例文本文件:
0
1
1
Expected output:
预期产量:
>human2
GCTTGCGCTAG
>human3
TTCGCTAG
4 个解决方案
#1
5
for this is better to use biopython
为此,最好使用biopython
from Bio import SeqIO
mask = ["1"==_.strip() for _ in open("mask.txt")]
seqs = [seq for seq in SeqIO.parse(open("input.fasta"), "fasta")]
seqs_filter = [seq for flag, seq in zip(mask, seqs) if flag]
for seq in seqs_filter:
print seq.format("fasta")
you get:
你得到:
>human2 GCTTGCGCTAG >human3 TTCGCTAG
explanation
说明
parse fasta: the format fasta may to have several lines of sequences (check fasta format), is better to use a specialized library to read (parser) and write the output
解析fasta:格式fasta可能有几行序列(检查fasta格式),最好使用专门的库来读取(解析器)并写入输出
mask: I read de mask file and cast to boolean [False, True, True]
mask:我读取de mask文件并转换为boolean [False,True,True]
filter: use zip function for each sequence match with his mask, and following i use list Comprehensions to filter
filter:使用zip函数为每个序列匹配他的掩码,然后我使用list comprehensions进行过滤
#2
3
I think this may help you and I really think you should take some time learn Python. Python is a good language for bioinformatics.
我认为这可能会对你有所帮助,我认为你应该花些时间学习Python。 Python是生物信息学的好语言。
display = []
with open('test.txt') as f:
for line in f.readlines():
display.append(int(line.strip()))
output_DNA = []
with open('XX.fasta') as f:
index = -1
for line in f.readlines():
if line[0] == '>':
index = index + 1
if display[index]:
output_DNA.append(line)
print output_DNA
#3
1
You can create an list to act like a mask for when you read your fasta file:
您可以创建一个列表,以便在您阅读fasta文件时充当掩码:
with open('mask.txt') as mf:
mask = [ s.strip() == '1' for s in mf.readlines() ]
Then:
然后:
with open('seq.fasta') as f:
for i, line in enumerate(f):
if mask[i]:
*something* line
or:
要么:
from itertools import izip
for b, line in izip(open(mask_file), open(seq_file)):
if b.strip() == '1':
*something* line
#4
0
I am unfamiliar with the fasta file format specifically but hopefully this helps. You can open your file in python the following way and extract the valid line entries in a list.
我特别不熟悉fasta文件格式,但希望这会有所帮助。您可以通过以下方式在python中打开文件,并在列表中提取有效的行条目。
valid = []
with open('test.txt') as f:
all_lines = f.readlines() # get all the lines
all_lines = [x.strip() for x in all_lines] # strip away newline chars
for i in range(len(all_lines)):
if all_lines[i] == '1': # if it matches our condition
valid.append(i) # add the index to our list
print valid # or get only the fasta file contents on these lines
I ran it with the following text file test.txt:
我使用以下文本文件test.txt运行它:
0
1
1
1
0
0
1
1
And got output when printing valid
:
打印有效时得到输出:
[1, 2, 3, 6, 7]
I think this will help you move along, but please let me know in the comments if you need an expanded answer.
我想这会帮助您继续前进,但如果您需要扩大答案,请在评论中告诉我。
#1
5
for this is better to use biopython
为此,最好使用biopython
from Bio import SeqIO
mask = ["1"==_.strip() for _ in open("mask.txt")]
seqs = [seq for seq in SeqIO.parse(open("input.fasta"), "fasta")]
seqs_filter = [seq for flag, seq in zip(mask, seqs) if flag]
for seq in seqs_filter:
print seq.format("fasta")
you get:
你得到:
>human2 GCTTGCGCTAG >human3 TTCGCTAG
explanation
说明
parse fasta: the format fasta may to have several lines of sequences (check fasta format), is better to use a specialized library to read (parser) and write the output
解析fasta:格式fasta可能有几行序列(检查fasta格式),最好使用专门的库来读取(解析器)并写入输出
mask: I read de mask file and cast to boolean [False, True, True]
mask:我读取de mask文件并转换为boolean [False,True,True]
filter: use zip function for each sequence match with his mask, and following i use list Comprehensions to filter
filter:使用zip函数为每个序列匹配他的掩码,然后我使用list comprehensions进行过滤
#2
3
I think this may help you and I really think you should take some time learn Python. Python is a good language for bioinformatics.
我认为这可能会对你有所帮助,我认为你应该花些时间学习Python。 Python是生物信息学的好语言。
display = []
with open('test.txt') as f:
for line in f.readlines():
display.append(int(line.strip()))
output_DNA = []
with open('XX.fasta') as f:
index = -1
for line in f.readlines():
if line[0] == '>':
index = index + 1
if display[index]:
output_DNA.append(line)
print output_DNA
#3
1
You can create an list to act like a mask for when you read your fasta file:
您可以创建一个列表,以便在您阅读fasta文件时充当掩码:
with open('mask.txt') as mf:
mask = [ s.strip() == '1' for s in mf.readlines() ]
Then:
然后:
with open('seq.fasta') as f:
for i, line in enumerate(f):
if mask[i]:
*something* line
or:
要么:
from itertools import izip
for b, line in izip(open(mask_file), open(seq_file)):
if b.strip() == '1':
*something* line
#4
0
I am unfamiliar with the fasta file format specifically but hopefully this helps. You can open your file in python the following way and extract the valid line entries in a list.
我特别不熟悉fasta文件格式,但希望这会有所帮助。您可以通过以下方式在python中打开文件,并在列表中提取有效的行条目。
valid = []
with open('test.txt') as f:
all_lines = f.readlines() # get all the lines
all_lines = [x.strip() for x in all_lines] # strip away newline chars
for i in range(len(all_lines)):
if all_lines[i] == '1': # if it matches our condition
valid.append(i) # add the index to our list
print valid # or get only the fasta file contents on these lines
I ran it with the following text file test.txt:
我使用以下文本文件test.txt运行它:
0
1
1
1
0
0
1
1
And got output when printing valid
:
打印有效时得到输出:
[1, 2, 3, 6, 7]
I think this will help you move along, but please let me know in the comments if you need an expanded answer.
我想这会帮助您继续前进,但如果您需要扩大答案,请在评论中告诉我。