I need to create a program that can take some text file called a fasta file and transform it to give the sequence_name, Domain_names, Start of Domain, end Of Domain.
我需要创建一个程序,可以获取一些称为fasta文件的文本文件并将其转换为sequence_name,Domain_names,Domain of Domain,End Of Domain。
So a fasta file is just a text file that looks like this
因此,fasta文件只是一个看起来像这样的文本文件
>MICE_8
ATTCGATCGATCGATTTCGATCGATCGATCGATCGGGATCGATCGATCGATCGATC
>MICE_59
ATTTTTCGGCATCGATAGCTAGCTAGCTAG
My program needs to take one command argument which is the file name of the fasta and give an output like this:
我的程序需要一个命令参数,它是fasta的文件名,并给出如下输出:
MICE_8 gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260056 4 93
MICE_59
here is a decription of the output for more information:
以下是输出的描述以获取更多信息:
- MICE_8 is the name of the first sequence in the fasta file
- MICE_8是fasta文件中第一个序列的名称
- gnl|CDD|256537 is the name of the first protein domain
- gnl | CDD | 256537是第一个蛋白质结构域的名称
- 819 this is where the domain stats
- 819这是域名统计
- 923 this is where it ends
- 923这就是结束的地方
- gnl|CDD|260076 is the name of the second protein domain for the first sequence and so on it starts at 111 and end at position 189.
- gnl | CDD | 260076是第一个序列的第二个蛋白质结构域的名称,依此类推,它从111开始,到189位结束。
Also since the last sequence did not get a hit the program still needs to display the name of the sequence.
此外,由于最后一个序列没有得到命中,程序仍然需要显示序列的名称。
OK so here is my code so far and what it outputs so far
好的,所以这是我到目前为止的代码以及它到目前为止输出的内容
import sys
import os
fastaname = sys.argv[1]
rpsblastname = "rpsblast.out"
cmd = "rpsblast+ -db /home/bryan/data/cdd/cdd -query %s -outfmt 6 -evalue 0.05 > %s" % (fastaname,rpsblastname)
os.system(cmd)
handle = open(rpsblastname, "r")
seqname = ""
for line in handle:
linearr = line.split()
# seqname = linearr [0]
domain = linearr[1]
start = linearr[6]
end = linearr[7]
# If sequence name is the same as last time, don't print it
if seqname == linearr[0]:
sys.stdout.write("%s %s %s" % (domain, start, end))
# Otherwise do print the sequence name, and update seqname
else:
seqname = linearr[0]
print
sys.stdout.write("%s %s %s %s" % (seqname,domain,start,end))
here is what my output looks like so far:
这是我的输出到目前为止的样子:
mel@roswald:~$ ./Domainfinder.py bioinformation.fasta
MICE_8 gnl|CDD|256537 819 923gnl|CDD|260076 111 189gnl|CDD|260056 4 93
The program i created is almost to the required specification. * only have 3 problems that * need be to address:
我创建的程序几乎达到了所需的规格。 *只有3个需要解决的问题:
- there is an extra space between where I run the program and the result
- 在我运行程序和结果之间有一个额外的空间
- my program does not write out the name of the sequence which has zero hits
- 我的程序没有写出命中为零的序列的名称
- my program does not separate the domain names by a space.
- 我的程序没有用空格分隔域名。
the correct output should look like this
正确的输出应该是这样的
mel@roswald:~$ ./Domainfinder.py bioinformation.fasta
MICE_8 gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260056 4 93
MICE_59
1 个解决方案
#1
0
solved the issue. Mainly what needed to be done is use a dictionary to hold the sequence names as keys instead of using lists. Than since dictionaries are random we need to be able to create a list from the dictionary to order the sequence names as they are read. Also we extract the sequence names from the rpsblast out. if anyone has any question feel free to pm.
解决了这个问题。主要需要做的是使用字典将序列名称保存为键而不是使用列表。由于字典是随机的,我们需要能够从字典中创建一个列表,以便在读取时对序列名称进行排序。我们还从rpsblast中提取序列名称。如果有人有任何问题随时可以下午。
#1
0
solved the issue. Mainly what needed to be done is use a dictionary to hold the sequence names as keys instead of using lists. Than since dictionaries are random we need to be able to create a list from the dictionary to order the sequence names as they are read. Also we extract the sequence names from the rpsblast out. if anyone has any question feel free to pm.
解决了这个问题。主要需要做的是使用字典将序列名称保存为键而不是使用列表。由于字典是随机的,我们需要能够从字典中创建一个列表,以便在读取时对序列名称进行排序。我们还从rpsblast中提取序列名称。如果有人有任何问题随时可以下午。