如何在python中使用序列文件创建数据集

时间:2022-08-31 22:52:54

I have a protein sequence file looks like this:

我有一个蛋白质序列文件,如下所示:

>102L:A       MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL       -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX

The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.

第一个是序列的名称,第二个是实际的蛋白质序列,第一个是指示是否有任何缺失的坐标。在这种情况下,请注意最后有两个“X”。这意味着在这种情况下,序列的最后两个残差是“NL”,缺少坐标。

By coding in Python I would like to generate a table which should look like this:

通过在Python中编码,我想生成一个表应该如下所示:

  1. name of the sequence
  2. 序列的名称
  3. total number of missing coordinates (which is the number of X)
  4. 缺失坐标总数(即X的数量)
  5. the range of these missing coordinates (which is the range of the position of those X) 4)the length of the sequence 5)the actual sequence
  6. 这些缺失坐标的范围(这是X的位置范围)4)序列的长度5)实际序列

So the final results should looks like this:

所以最终结果应如下所示:

>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

And my code looks like this so far:

到目前为止我的代码看起来像这样:

total_seq = []
with open('sample.txt') as lines:
    for l in lines:
        split_list = l.split()

        # Assign the list number
        header = split_list[0]                                # 1
        seq = split_list[1]                                   # 5
        disorder = split_list[2]

        # count sequence length and total residue of missing coordinates
        sequence_length = len(seq)                            # 4

        for x in disorder:
            counts = 0
            if x == 'X':
                counts = counts + 1

        total_seq.append([header, seq, str(counts)])   # obviously I haven't finish coding 2 & 3

with open('new_sample.txt', 'a') as f:
    for lol in total_seq:
        f.write('\n'.join(lol))

I'm new in python, would anyone help please?

我是python的新手,有人会帮忙吗?

1 个解决方案

#1


0  

Here's your modified code. It now produces your desired output.

这是你修改过的代码。它现在产生您想要的输出。

with open("sample.txt") as infile:
    matrix  = [line.split() for line in infile.readlines()]

    header_list  = [row[0] for row in matrix]
    seq_list = [str(row[1]) for row in matrix]
    disorder_list = [str(row[2]) for row in matrix]

f = open('new_sample.txt', 'a')

for i in range(len(header_list)):
    header = header_list[i]
    seq = seq_list[i]
    disorder = disorder_list[i]

    # count sequence length and total residue of missing coordinates
    sequence_length = len(seq)                            

    # get total number of missing coordinates
    num_missing = disorder.count('X')             

    # get the range of these missing coordinates
    first_X_pos = disorder.find('X')
    last_X_pos = disorder.rfind('X')
    range_missing = '-'.join([str(first_X_pos), str(last_X_pos)])

    reformat_seq=" ".join([header, str(num_missing), range_missing, str(sequence_length), seq, '\n'])  
    f.write(reformat_seq)

f.close()

Some more tips:

更多提示:

Don't forget about python's string functions. They will solve a lot of your problems automatically. The documentation is very good.

不要忘记python的字符串函数。他们会自动解决你的很多问题。文档非常好。

If you searched for how to do just part 2 or just part 3 in your question, you would find the results elsewhere.

如果您在问题中搜索了如何仅执行第2部分或仅执行第3部分,您会在其他地方找到结果。

#1


0  

Here's your modified code. It now produces your desired output.

这是你修改过的代码。它现在产生您想要的输出。

with open("sample.txt") as infile:
    matrix  = [line.split() for line in infile.readlines()]

    header_list  = [row[0] for row in matrix]
    seq_list = [str(row[1]) for row in matrix]
    disorder_list = [str(row[2]) for row in matrix]

f = open('new_sample.txt', 'a')

for i in range(len(header_list)):
    header = header_list[i]
    seq = seq_list[i]
    disorder = disorder_list[i]

    # count sequence length and total residue of missing coordinates
    sequence_length = len(seq)                            

    # get total number of missing coordinates
    num_missing = disorder.count('X')             

    # get the range of these missing coordinates
    first_X_pos = disorder.find('X')
    last_X_pos = disorder.rfind('X')
    range_missing = '-'.join([str(first_X_pos), str(last_X_pos)])

    reformat_seq=" ".join([header, str(num_missing), range_missing, str(sequence_length), seq, '\n'])  
    f.write(reformat_seq)

f.close()

Some more tips:

更多提示:

Don't forget about python's string functions. They will solve a lot of your problems automatically. The documentation is very good.

不要忘记python的字符串函数。他们会自动解决你的很多问题。文档非常好。

If you searched for how to do just part 2 or just part 3 in your question, you would find the results elsewhere.

如果您在问题中搜索了如何仅执行第2部分或仅执行第3部分,您会在其他地方找到结果。