寻找优雅的水珠状DNA串扩展

I'm trying to make a glob-like expansion of a set of DNA strings that have multiple possible bases.

我正在尝试对一组具有多个可能碱基的DNA串进行类似全局扩展。

The base of my DNA strings contains the letters A, C, G, and T. However, I can have special characters like M which could be an A or a C.

我的DNA字符串的基数包含字母A,C,G和T.但是,我可以有像M这样的特殊字符,可以是A或C.

For example, say I have the string:

例如,假设我有字符串:

ATMM

I would like to take this string as input and output the four possible matching strings:

我想把这个字符串作为输入并输出四个可能的匹配字符串:

ATAA ATAC ATCA ATCC

ATAA ATAC ATCA ATCC

Rather than brute force a solution, I feel like there must be some elegant Python/Perl/Regular Expression trick to do this.

我觉得必须有一些优雅的Python / Perl / Regular Expression技巧才能做到这一点,而不是强力解决问题。

Thank you for any advice.

谢谢你的任何建议。

Edit, thanks cortex for the product operator. This is my solution:

编辑,感谢cortex为产品运营商。这是我的解决方案:

Still a Python newbie, so I bet there's a better way to handle each dictionary key than another for loop. Any suggestions would be great.

仍然是一个Python新手,所以我敢打赌,有一个更好的方法来处理每个字典键而不是另一个循环。任何建议都会很棒。

import sys
from itertools import product

baseDict = dict(M=['A','C'],R=['A','G'],W=['A','T'],S=['C','G'],
                  Y=['C','T'],K=['G','T'],V=['A','C','G'],
                  H=['A','C','T'],D=['A','G','T'],B=['C','G','T'])
def glob(str):
    strings = [str]

    ## this loop visits very possible base in the dictionary
    ## probably a cleaner way to do it
    for base in baseDict:
        oldstrings = strings
        strings = []
        for string in oldstrings:
            strings += map("".join,product(*[baseDict[base] if x == base 
                                 else [x] for x in string]))
    return strings

for line in sys.stdin.readlines():
    line = line.rstrip('\n')
    permutations = glob(line)
    for x in permutations:
        print x

5 个解决方案

#1

Agree with other posters that it seems like a strange thing to want to do. Of course, if you really want to, there is (as always) an elegant way to do it in Python (2.6+):

同意其他海报,这似乎是一件奇怪的事情。当然,如果你真的想,有(一如既往)优雅的方式在Python(2.6+)中做到这一点:

from itertools import product
map("".join, product(*[['A', 'C'] if x == "M" else [x] for x in "GMTTMCA"]))

Full solution with input handling:

输入处理的完整解决方案:

import sys
from itertools import product

base_globs = {"M":['A','C'], "R":['A','G'], "W":['A','T'],
              "S":['C','G'], "Y":['C','T'], "K":['G','T'],

              "V":['A','C','G'], "H":['A','C','T'],
              "D":['A','G','T'], "B":['C','G','T'],
              }

def base_glob(glob_sequence):
    production_sequence = [base_globs.get(base, [base]) for base in glob_sequence]
    return map("".join, product(*production_sequence))

for line in sys.stdin.readlines():
    productions = base_glob(line.strip())
    print "\n".join(productions)

#2

You probably could do something like this in python using the yield operator

您可能可以使用yield运算符在python中执行类似的操作

def glob(str):
      if str=='':           
          yield ''
          return      

      if str[0]!='M':
          for tail in glob(str[1:]): 
              yield str[0] + tail                  
      else:
         for c in ['A','G','C','T']:
             for tail in glob(str[1:]):
                 yield c + tail                 
      return

EDIT: As correctly pointed out I was making a few mistakes. Here is a version which I tried out and works.

编辑:正确地指出我犯了一些错误。这是我试用并运行的版本。

#3

This isn't really an "expansion" problem and it's almost certainly not doable with any sensible regular expression.

这实际上不是一个“扩展”问题,而且几乎肯定不能用任何明智的正则表达式来表达。

I believe what you're looking for is "how to generate permutations".

我相信你所寻找的是“如何产生排列”。

#4

You could for example do this recursively. Pseudo-code:

例如,您可以递归地执行此操作。伪代码:

printSequences(sequence s)
  switch "first special character in sequence"
    case ...
    case M:
      s1 = s, but first M replaced with A
      printSequences(s1)
      s2 = s, but first M replaced with C
      printSequences(s2)
    case none:
      print s;

#5

Regexps match strings, they're not intended to be turned into every string they might match.

Regexps匹配字符串,它们不打算转换为它们可能匹配的每个字符串。

Also, you're looking at a lot of strings being output from this - for instance:

此外,您正在查看从此输出的大量字符串 - 例如:

MMMMMMMMMMMMMMMM (16 M's)

produces 65,536 16 character strings - and I'm guessing that DNA sequences are usually longer than that.

产生65,536个16字符串 - 我猜测DNA序列通常比这长。

Arguably any solution to this is pretty much 'brute force' from a computer science perspective, because your algorithm is O(2^n) on the original string length. There's actually quite a lot of work to be done.

可以说,从计算机科学的角度来看,任何解决方案都是“暴力”,因为你的算法在原始字符串长度上是O(2 ^ n)。实际上还有很多工作要做。

Why do you want to produce all the combinations? What are you going to do with them? (If you're thinking to produce every string possibility and then look for it in a large DNA sequence, then there are much better ways of doing that.)

为什么要生产所有组合?你打算怎么处理他们? (如果你想要产生每一个字符串的可能性,然后在一个大的DNA序列中寻找它,那么有更好的方法来做到这一点。)

#1