我需要帮助完成这个生物信息学计划

I need to create a program that can take some text file called a fasta file and transform it to give the sequence_name, Domain_names, Start of Domain, end Of Domain.

我需要创建一个程序，可以获取一些称为fasta文件的文本文件并将其转换为sequence_name，Domain_names，Domain of Domain，End Of Domain。

So a fasta file is just a text file that looks like this

因此，fasta文件只是一个看起来像这样的文本文件

>MICE_8
ATTCGATCGATCGATTTCGATCGATCGATCGATCGGGATCGATCGATCGATCGATC
>MICE_59 
ATTTTTCGGCATCGATAGCTAGCTAGCTAG

My program needs to take one command argument which is the file name of the fasta and give an output like this:

我的程序需要一个命令参数，它是fasta的文件名，并给出如下输出：

MICE_8 gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260056 4 93                                          
MICE_59

here is a decription of the output for more information:

以下是输出的描述以获取更多信息：

MICE_8 is the name of the first sequence in the fasta file
MICE_8是fasta文件中第一个序列的名称
gnl|CDD|256537 is the name of the first protein domain
gnl | CDD | 256537是第一个蛋白质结构域的名称
819 this is where the domain stats
819这是域名统计
923 this is where it ends
923这就是结束的地方
gnl|CDD|260076 is the name of the second protein domain for the first sequence and so on it starts at 111 and end at position 189.
gnl | CDD | 260076是第一个序列的第二个蛋白质结构域的名称，依此类推，它从111开始，到189位结束。

Also since the last sequence did not get a hit the program still needs to display the name of the sequence.

此外，由于最后一个序列没有得到命中，程序仍然需要显示序列的名称。

OK so here is my code so far and what it outputs so far

好的，所以这是我到目前为止的代码以及它到目前为止输出的内容

import sys
import os

fastaname = sys.argv[1]
rpsblastname = "rpsblast.out"

cmd = "rpsblast+ -db /home/bryan/data/cdd/cdd -query %s -outfmt 6 -evalue 0.05 > %s" % (fastaname,rpsblastname)
os.system(cmd)

handle = open(rpsblastname, "r")
seqname = ""
for line in handle:
    linearr = line.split()
    # seqname = linearr [0]
    domain = linearr[1]
    start = linearr[6]
    end = linearr[7]
    # If sequence name is the same as last time, don't print it
    if seqname == linearr[0]:
        sys.stdout.write("%s %s %s" % (domain, start, end))
    # Otherwise do print the sequence name, and update seqname
    else:
        seqname = linearr[0]
        print
        sys.stdout.write("%s %s %s %s" % (seqname,domain,start,end))

here is what my output looks like so far:

这是我的输出到目前为止的样子：

mel@roswald:~$ ./Domainfinder.py bioinformation.fasta 

MICE_8 gnl|CDD|256537 819 923gnl|CDD|260076 111 189gnl|CDD|260056 4 93

The program i created is almost to the required specification. * only have 3 problems that * need be to address:

我创建的程序几乎达到了所需的规格。 *只有3个需要解决的问题：

there is an extra space between where I run the program and the result
在我运行程序和结果之间有一个额外的空间
my program does not write out the name of the sequence which has zero hits
我的程序没有写出命中为零的序列的名称
my program does not separate the domain names by a space.
我的程序没有用空格分隔域名。

the correct output should look like this

正确的输出应该是这样的

mel@roswald:~$ ./Domainfinder.py bioinformation.fasta                                        
MICE_8 gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260056 4 93                                          
MICE_59

1 个解决方案

#1

solved the issue. Mainly what needed to be done is use a dictionary to hold the sequence names as keys instead of using lists. Than since dictionaries are random we need to be able to create a list from the dictionary to order the sequence names as they are read. Also we extract the sequence names from the rpsblast out. if anyone has any question feel free to pm.

解决了这个问题。主要需要做的是使用字典将序列名称保存为键而不是使用列表。由于字典是随机的，我们需要能够从字典中创建一个列表，以便在读取时对序列名称进行排序。我们还从rpsblast中提取序列名称。如果有人有任何问题随时可以下午。

秒客网

我需要帮助完成这个生物信息学计划

1 个解决方案

#1

#1

相关文章