14 Finding a Shared Motif

Problem

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, "CG" is a common substring of "ACGTACGT" and "AACCGTATA", but it is not as long as possible; in this case, "CGTA" is a longest common substring of "ACGTACGT" and "AACCGTATA".

Note that the longest common substring is not necessarily unique; for a simple example, "AA" and "CC" are both longest common substrings of "AACC" and "CCAA".

Given: A collection of kk (k≤100k≤100) DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

Sample Dataset

>Rosalind_1

GATTACA

>Rosalind_2

TAGACCA

>Rosalind_3

ATACA

Sample Output

AC

# 方法一

# coding=utf-8

'''

>Rosalind_1

GATTACA

>Rosalind_2

TAGACCA

>Rosalind_3

ATACA

'''

def readfasta(filename, sample):

    fa = open(filename, 'r')

    fo = open(sample, 'w')

    res = {}

    rres = []

    ID = ''

    for line in fa:

        if line.startswith('>'):

            ID = line.strip('\n')

            res[ID] = ''

        else:

            res[ID] += line.strip('\n')

    for key in res.values():

        rres.append(key)

        fo.write(key + '\n')

    return rres

def fragement(seq_list):

    res = []

    seq = seq_list[0]

    for i in range(len(seq)):

        s_seq = seq[i:]

        #print s_seq

        for j in range(len(s_seq)):

            res.append(s_seq[:(len(s_seq) - j)])

            #print res

    return res

def main(infile, sample):

    seq_list = readfasta(infile, sample)   #['TAGACCA','ATACA','GATTACA']

    frags = fragement(seq_list)

    frags.sort(key=len, reverse=True)     # 从长到短排列

    for i in range(len(frags)):

        ans = []

        # s = 0

        # m+=1

        # print(m)

        # res[frags[i]] = 0

        for j in seq_list:

            r = j.count(frags[i])

            if r != 0:

                ans.append(r)

        if len(ans) >= len(seq_list):

            print(frags[i])

            break

main('14.txt', 'sample.txt')

　　方法二：（没看懂）

# coding=utf-8

'''

A solution to a ROSALIND bioinformatics problem.

Problem Title: Finding a Shared Motif

Rosalind ID: LCSM

Rosalind #: 014

URL: [url]http://rosalind.info/problems/lcsm/[/url]

'''

def LongestSubstring(string_list):

    '''Extracts all substrings from the first string in a list, and sends longest substring candidates to be checked.'''

    longest = ''

    for start_index in range(len(string_list[0])):

        for end_index in range(len(string_list[0]), start_index, -1):

            # Break if the length becomes too small, as it will only get smaller.

            if end_index - start_index <= len(longest):

                break

            elif CheckSubstring(string_list[0][start_index:end_index], string_list):

                longest = string_list[0][start_index:end_index]

    return longest

def CheckSubstring(find_string, string_list):

    'Checks if a given substring appears in all members of a given collection of strings and returns True/False.'

    for string in string_list:

        if (len(string) < len(find_string)) or (find_string not in string):

            return False

    return True

seq = {}

seq_name = ''

with open('14.txt') as f:

    for line in f:

        if line[0] == '>':

            seq_name = line.rstrip()

            seq[seq_name] = ''

            continue

        else:

            seq[seq_name] += (line.rstrip()).upper()

print(seq)

if __name__ == '__main__':

    dna = []

    for seq_name in seq:

        dna.append(seq[seq_name])

    lcsm = LongestSubstring(dna)

    print(lcsm)

    with open('014_LCSM.txt', 'w') as output_data:

        output_data.write(lcsm)

秒客网

14 Finding a Shared Motif

Problem

Sample Dataset

Sample Output

相关文章