I have around 500 protein sequences in fasta format, I got from a blastp search. From those sequences, I need to have the protein name, organism, Uniprot ID and if possible the protein family, so that I can build a table with that information.
我有大约500个fasta格式的蛋白质序列,我来自blastp搜索。从这些序列中,我需要获得蛋白质名称,生物体,Uniprot ID以及可能的蛋白质家族,以便我可以使用该信息构建一个表格。
Is there any way I can do it using python? some function that comunicate with Uniprot? how can I parse the information from the fasta header?
有什么办法可以用python做到吗?一些与Uniprot交流的功能?如何解析fasta标题中的信息?
1 个解决方案
#1
4
You should take a look at Biopython that has a FASTA parser. After parsing you can use pandas DataFrame
to build a table. Without a snippet of example data it is difficult to provide a more thourogh answer, but it should be doable with about 5 lines of code :)
你应该看看有一个FASTA解析器的Biopython。解析后,您可以使用pandas DataFrame构建表。如果没有示例数据的片段,很难提供更多的thourogh答案,但它应该可以使用大约5行代码:)
from Bio import SeqIO
with open("example.fasta", "rU") as handle:
print list(SeqIO.parse(handle, "fasta"))
#1
4
You should take a look at Biopython that has a FASTA parser. After parsing you can use pandas DataFrame
to build a table. Without a snippet of example data it is difficult to provide a more thourogh answer, but it should be doable with about 5 lines of code :)
你应该看看有一个FASTA解析器的Biopython。解析后,您可以使用pandas DataFrame构建表。如果没有示例数据的片段,很难提供更多的thourogh答案,但它应该可以使用大约5行代码:)
from Bio import SeqIO
with open("example.fasta", "rU") as handle:
print list(SeqIO.parse(handle, "fasta"))