如何匹配dna序列模式?

时间:2022-06-01 20:45:28

I am getting a trouble finding an approach to solve this problem.

我很难找到解决这个问题的方法。

Input-output sequences are as follows :

输入输出序列如下:

 **input1 :** aaagctgctagag 
 **output1 :** a3gct2ag2

 **input2 :** aaaaaaagctaagctaag 
 **output2 :** a6agcta2ag

Input nsequence can be of 10^6 characters and largest continuous patterns will be considered.

输入的n序列可以是10个6个字符,最大的连续模式将被考虑。

For example for input2 "agctaagcta" output will not be "agcta2gcta" but it will be "agcta2".

例如,input2“agctaagcta”输出不会是“agcta2gcta”,但它将是“agcta2”。

Any help appreciated.

任何帮助表示赞赏。

3 个解决方案

#1


10  

Explanation of the algorithm:

算法的解释:

  • Having a sequence S with symbols s(1), s(2),…, s(N).
  • 具有符号S(1)、S(2)、S (N)的序列S。
  • Let B(i) be the best compressed sequence with elements s(1), s(2),…,s(i).
  • 让B(i)是与元素s(1)、s(2)、s(i)的最佳压缩序列。
  • So, for example, B(3) will be the best compressed sequence for s(1), s(2), s(3).
  • 因此,例如,B(3)将是s(1)、s(2)、s(3)的最佳压缩序列。
  • What we want to know is B(N).
  • 我们想知道的是B(N)

To find it, we will proceed by induction. We want to calculate B(i+1), knowing B(i), B(i-1), B(i-2), …, B(1), B(0), where B(0) is empty sequence, and and B(1) = s(1). At the same time, this constitutes a proof that the solution is optimal. ;)

为了找到它,我们将采用归纳法。我们要计算B(i+1),知道B(i), B(i-1), B(i-2), B(1), B(0), B(0)为空序列,B(1) = s(1)。与此同时,这也证明了解决方案是最优的。,)

To calculate B(i+1), we will pick the best sequence among the candidates:

为了计算B(i+1),我们将选出候选的最佳序列:

  1. Candidate sequences where the last block has one element:

    最后一个块有一个元素的候选序列:

    B(i )s(i+1)1 B(i-1)s(i+1)2 ; only if s(i) = s(i+1) B(i-2)s(i+1)3 ; only if s(i-1) = s(i) and s(i) = s(i+1) … B(1)s(i+1)[i-1] ; only if s(2)=s(3) and s(3)=s(4) and … and s(i) = s(i+1) B(0)s(i+1)i = s(i+1)i ; only if s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)

    B(i)s(i + 1)1 B(张)(i + 1)2;只有s(i) = s(i+1) B(i-2)s(i+1)3;只有(张)=年代(我),(我)=年代(i + 1)…B(1)s(i + 1)(张);只有在(2)=年代(3)和s(3)=(4)和……和年代(i)= s(i + 1)B(0)(i + 1)i =年代(i + 1)我;只有(1)=年代(2)和s(2)=(3)和……(我)=年代(i + 1)

  2. Candidate sequences where the last block has 2 elements:

    最后一个块有两个元素的候选序列:

    B(i-1)s(i)s(i+1)1 B(i-3)s(i)s(i+1)2 ; only if s(i-2)s(i-1)=s(i)s(i+1) B(i-5)s(i)s(i+1)3 ; only if s(i-4)s(i-3)=s(i-2)s(i-1) and s(i-2)s(i-1)=s(i)s(i+1) …

    B(张)(i)年代(i + 1)1 B(我)(i)年代(i + 1)2;只有在(我2)年代(张)=(i)年代B(i + 1)(我)(i)年代(i + 1)3;只有在(我)年代(我)=(我2)年代(张)和(我2)年代(张)=(i)年代(i + 1)…

  3. Candidate sequences where the last block has 3 elements:

    最后一个块有3个元素的候选序列:

  4. Candidate sequences where the last block has 4 elements:

    最后一个块有4个元素的候选序列:

  5. Candidate sequences where last block has n+1 elements:

    最后一个区块有n+1个元素的候选序列:

    s(1)s(2)s(3)………s(i+1)

    (1)年代(2)(3).........年代(i + 1)

For each possibility, the algorithm stops when the sequence block is no longer repeated. And that’s it.

对于每一种可能性,当序列块不再重复时,算法会停止。就是这样。

The algorithm will be some thing like this in psude-c code:

在psude-c代码中,算法是这样的:

B(0) = “”
for (i=1; i<=N; i++) {
    // Calculate all the candidates for B(i)
    BestCandidate=null
    for (j=1; j<=i; j++) {
        Calculate all the candidates of length (i)

        r=1;
        do {
             Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
             If (   (BestCandidate==null) 
                      || (Candidate is shorter that BestCandidate)) 
                 {
            BestCandidate=Candidate.
                 }
             r++;
        } while (  ([i-j]*r <= i) 
             &&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))

    }
    B(i)=BestCandidate
}

Hope that this can help a little more.

希望这能帮助更多的人。

The full C program performing the required task is given below. It runs in O(n^2). The central part is only 30 lines of code.

执行所需任务的完整C程序如下所示。它运行在O(n ^ 2)。中心部分只有30行代码。

EDIT I have restructured a little bit the code, changed the names of the variables and added some comment in order to be more readable.

编辑我已经修改了一点代码,更改了变量的名称,并添加了一些注释以使其可读性更好。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>


// This struct represents a compressed segment like atg4, g3,  agc1
    struct Segment {
        char *elements;
        int nElements;
        int count;
    };
         // As an example, for the segment agagt3  elements would be:
         // {
         //      elements: "agagt",
         //      nElements: 5,
         //      count: 3
         // }

    struct Sequence {
        struct Segment lastSegment;
        struct Sequence *prev;      // Points to a sequence without the last segment or NULL if it is the first segment
        int totalLen;  // Total length of the compressed sequence.
    };
       // as an example, for the sequence agt32ta5, the representation will be:
       // {
       //     lastSegment:{"ta" , 2 , 5},
       //     prev: @A,
       //     totalLen: 8
       // }
       // and A will be
       // {
       //     lastSegment{ "agt", 3, 32},
       //     prev: NULL,
       //     totalLen: 5
       // }


// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.

char *sequence2string(struct Sequence *S) {
    char *Res=malloc(S->totalLen + 1);
    char *digits="0123456789";

    int p= S->totalLen;
    Res[p]=0;

    while (S!=NULL) {
            // first we insert the count of the last element.
            // We do digit by digit starting with the units.
            int C = S->lastSegment.count;
            while (C) {
                p--;
                Res[p] = digits[ C % 10 ];
                C /= 10;
            }

            p -= S->lastSegment.nElements;
            strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);

            S = S ->prev;
    }


    return Res;
}


// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
    int i,j;

    int N = strlen(in);;            // Number of elements of a in sequence.



    // B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
    // What we want to return is B[N];
    struct Sequence *B;
    B = malloc((N+1) * sizeof (struct Sequence));

    // We first do an initialization for i=0

    B[0].lastSegment.elements="";
    B[0].lastSegment.nElements=0;
    B[0].lastSegment.count=0;
    B[0].prev = NULL;
    B[0].totalLen=0;

    // and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth,  We will try different sequences and keep the minimum one.
    for (i=1; i<=N; i++) B[i].totalLen = INT_MAX;   // A very high value

    for (i=1; i<=N; i++) {

        // at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
        for (j=1; j<=i; j++) {

            // Here we will check all the candidates where the last segment has j elements

            int r=1;                  // number of times the last segment is repeated
            int rNDigits=1;           // Number of digits of r
            int rNDigitsBound=10;     // We will increment r, so this value is when r will have an extra digit.
                                      // when r = 0,1,...,9 => rNDigitsBound = 10
                                      // when r = 10,11,...,99 => rNDigitsBound = 100
                                      // when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.

            do {

                // Here we analitze a candidate B(i).
                // where the las segment has j elements repeated r times.

                int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
                if (CandidateLen < B[i].totalLen) {
                    B[i].lastSegment.elements = in + i - j*r;
                    B[i].lastSegment.nElements = j;
                    B[i].lastSegment.count = r;
                    B[i].prev = &(B[i-j*r]);
                    B[i].totalLen = CandidateLen;
                }

                r++;
                if (r == rNDigitsBound ) {
                    rNDigits++;
                    rNDigitsBound *= 10;
                }
            } while (   (i - j*r >= 0)
                     && (strncmp(in + i -j, in + i - j*r, j)==0));

        }
    }

    char *Res=sequence2string(&(B[N]));
    free(B);

    return Res;
}

int main(int argc, char** argv) {
    char *compressedDNA=dnaCompress(argv[1]);
    puts(compressedDNA);
    free(compressedDNA);
    return 0;
}

#2


2  

Forget Ukonnen. Dynamic programming it is. With 3-dimensional table:

忘记Ukonnen。动态编程。与三维表:

  1. sequence position
  2. 序列的位置
  3. subsequence size
  4. 子序列的大小
  5. number of segments
  6. 段数

TERMINOLOGY: For example, having a = "aaagctgctagag", sequence position coordinate would run from 1 to 13. At sequence position 3 (letter 'g'), having subsequence size 4, the subsequence would be "gctg". Understood? And as for the number of segments, then expressing a as "aaagctgctagag1" consists of 1 segment (the sequence itself). Expressing it as "a3gct2ag2" consists of 3 segments. "aaagctgct1ag2" consists of 2 segments. "a2a1ctg2ag2" would consist of 4 segments. Understood? Now, with this, you start filling a 3-dimensional array 13 x 13 x 13, so your time and memory complexity seems to be around n ** 3 for this. Are you sure you can handle it for million-bp sequences? I think that greedy approach would be better, because large DNA sequences are unlikely to repeat exactly. And, I would suggest that you widen your assignment to approximate matches, and you can publish it straight in a journal.

术语:例如,有一个=“aaagctgctagag”,序列位置坐标从1到13。在序列位置3(字母g),子序列大小为4后,子序列将是“gctg”。理解吗?至于节段的数量,则表示a为“aaagctgctagag1”,由1段(序列本身)组成。将其表示为“a3gct2ag2”由三个部分组成。“aaagctgct1ag2”由两个部分组成。“a2a1ctg2ag2”将由4个部分组成。理解吗?现在,你开始填充一个三维的数组13 x 13 x 13,所以你的时间和内存复杂度似乎在n ** 3附近。你确定你能处理上百万个bp的序列吗?我认为贪婪的方法会更好,因为大的DNA序列不太可能准确地重复。而且,我建议你把你的作业扩大到近似匹配,你可以把它直接发表在期刊上。

Anyway, you will start filling the table of compressing a subsequence starting at some position (dimension 1) with length equal to dimension 2 coordinate, having at most dimension 3 segments. So you first fill the first row, representing compressions of subsequences of length 1 consisting of at most 1 segment:

无论如何,你将开始填充压缩子序列的表,从某个位置(第1维)开始,长度等于二维坐标,在大多数维度为3段。所以你先填第一行,表示长度为1的子序列的压缩数最多为1段:

a        a        a        g        c        t        g        c        t        a        g        a        g
1(a1)    1(a1)    1(a1)    1(g1)    1(c1)    1(t1)    1(g1)    1(c1)    1(t1)    1(a1)    1(g1)    1(a1)    1(g1)

The number is the character cost (always 1 for these trivial 1-char sequences; number 1 does not count into the character cost), and in the parenthesis, you have the compression (also trivial for this simple case). The second row will be still simple:

数字是字符成本(对于这些简单的1字符序列总是1);数字1不计入字符成本),在括号中,您有压缩(对于这个简单的例子来说也很简单)。第二行仍然很简单:

2(a2)    2(a2)    2(ag1)   2(gc1)   2(ct1)   2(tg1)   2(gc1)   2(ct1)   2(ta1)   2(ag1)   2(ga1)    2(ag1)

There is only 1 way to decompose a 2-character sequence into 2 subsequences -- 1 character + 1 character. If they are identical, the result is like a + a = a2. If they are different, such as a + g, then, because only 1-segment sequences are admissible, the result cannot be a1g1, but must be ag1. The third row will be finally more interesting:

只有一种方法可以将2个字符的序列分解成2个子序列——1个字符+ 1个字符。如果它们是相同的,结果就像a + a = a2。如果它们是不同的,比如a + g,那么,因为只有1段序列是可接受的,结果不能是a1g1,但必须是ag1。第三行最后会更有趣:

2(a3)    2(aag1)  3(agc1)  3(gct1)  3(ctg1)  3(tgc1)  3(gct1)  3(cta1)  3(tag1)  3(aga1)  3(gag1)

Here, you can always choose between 2 ways of composing the compressed string. For example, aag can be composed either as aa + g or a + ag. But again, we cannot have 2 segments, as in aa1g1 or a1ag1, so we must be satisfied with aag1, unless both components consist of the same character, as in aa + a => a3, with character cost 2. We can continue onto 4 th line:

在这里,您可以选择两种方法来组合压缩字符串。例如,aag可以写成aa + g或a + ag。但是,同样的,我们不能有两个段,如在aa1g1或a1ag1中,所以我们必须满足aag1,除非两个组件由相同的字符组成,如在aa + a => a3,字符成本为2。我们可以继续到第4行:

4(aaag1) 4(aagc1) 4(agct1) 4(gctg1) 4(ctgc1) 4(tgct1) 4(gcta1) 4(ctag1) 4(taga1) 3(ag2)

Here, on the first position, we cannot use a3g1, because only 1 segment is allowed at this layer. But at the last position, compression to character cost 3 is agchieved by ag1 + ag1 = ag2. This way, one can fill the whole first-level table all the way up to the single subsequence of 13 characters, and each subsequence will have its optimal character cost and its compression under the first-level constraint of at most 1 segment associated with it.

这里,在第一个位置,我们不能使用a3g1,因为这个层只允许1个段。但在最后一个位置,压缩到角色成本3是ag1 + ag1 = ag2。通过这种方式,可以将整个一级表全部填充到13个字符的单个子序列中,每个子序列都有其最优的字符代价和在与它关联的最多1个段的一级约束下的压缩。

Then you go to the 2nd level, where 2 segments are allowed... And again, from the bottom up, you identify the optimum cost and compression of each table coordinate under the given level's segment count constraint, by comparing all the possible ways to compose the subsequence using already computed positions, until you fill the table completely and thus compute the global optimum. There are some details to solve, but sorry, I'm not gonna code this for you.

然后你进入第二等级,允许2个部分…再一次,从下往上,你确定了在给定级别的段数约束下每个表坐标的最优成本和压缩,通过比较所有可能的方法来使用已经计算的位置来组合子序列,直到你完全填满表格,从而计算出全局最优。有一些细节需要解决,但抱歉,我不会为你编码。

#3


2  

After trying my own way for a while, my kudos to jbaylina for his beautiful algorithm and C implementation. Here's my attempted version of jbaylina's algorithm in Haskell, and below it further development of my attempt at a linear-time algorithm that attempts to compress segments that include repeated patterns in a one-by-one fashion:

在尝试了一段时间后,我的荣誉给了jbaylina他的算法和C实现。这是我在Haskell中尝试过的jbaylina算法,下面是我尝试的线性时间算法的进一步发展,它试图压缩包含重复模式的片段:

import Data.Map (fromList, insert, size, (!))

compress s = (foldl f (fromList [(0,([],0)),(1,([s!!0],1))]) [1..n - 1]) ! n  
 where
  n = length s
  f b i = insert (size b) bestCandidate b where
    add (sequence, sLength) (sequence', sLength') = 
      (sequence ++ sequence', sLength + sLength')
    j' = [1..min 100 i]
    bestCandidate = foldr combCandidates (b!i `add` ([s!!i,'1'],2)) j'
    combCandidates j candidate' = 
      let nextCandidate' = comb 2 (b!(i - j + 1) 
                       `add` ((take j . drop (i - j + 1) $ s) ++ "1", j + 1))
      in if snd nextCandidate' <= snd candidate' 
            then nextCandidate' 
            else candidate' where
        comb r candidate
          | r > uBound                         = candidate
          | not (strcmp r True)                = candidate
          | snd nextCandidate <= snd candidate = comb (r + 1) nextCandidate
          | otherwise                          = comb (r + 1) candidate
         where 
           uBound = div (i + 1) j
           prev = b!(i - r * j + 1)
           nextCandidate = prev `add` 
             ((take j . drop (i - j + 1) $ s) ++ show r, j + length (show r))
           strcmp 1   _    = True
           strcmp num bool 
             | (take j . drop (i - num * j + 1) $ s) 
                == (take j . drop (i - (num - 1) * j + 1) $ s) = 
                  strcmp (num - 1) True
             | otherwise = False

Output:

输出:

*Main> compress "aaagctgctagag"
("a3gct2ag2",9)

*Main> compress "aaabbbaaabbbaaabbbaaabbb"
("aaabbb4",7)


Linear-time attempt:

线性时间的尝试:

import Data.List (sortBy)

group' xxs sAccum (chr, count)
  | null xxs = if null chr 
                  then singles
                  else if count <= 2 
                          then reverse sAccum ++ multiples ++ "1"
                  else singles ++ if null chr then [] else chr ++ show count
  | [x] == chr = group' xs sAccum (chr,count + 1)
  | otherwise = if null chr 
                   then group' xs (sAccum) ([x],1) 
                   else if count <= 2 
                           then group' xs (multiples ++ sAccum) ([x],1)
                   else singles 
                        ++ chr ++ show count ++ group' xs [] ([x],1)
 where x:xs = xxs
       singles = reverse sAccum ++ (if null sAccum then [] else "1")
       multiples = concat (replicate count chr)

sequences ws strIndex maxSeqLen = repeated' where
  half = if null . drop (2 * maxSeqLen - 1) $ ws 
            then div (length ws) 2 else maxSeqLen
  repeated' = let (sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = repeated
              in (sequence,(sequenceStart, sequenceEnd'))
  repeated = foldr divide ([],(strIndex,strIndex),False) [1..half]
  equalChunksOf t a = takeWhile(==t) . map (take a) . iterate (drop a)
  divide chunkSize b@(sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = 
    let t = take (2*chunkSize) ws
        t' = take chunkSize t
    in if t' == drop chunkSize t
          then let ts = equalChunksOf t' chunkSize ws
                   lenTs = length ts
                   sequenceEnd = strIndex + lenTs * chunkSize
                   newEnd = if sequenceEnd > sequenceEnd' 
                            then sequenceEnd else sequenceEnd'
               in if chunkSize > 1 
                     then if length (group' (concat (replicate lenTs t')) [] ([],0)) > length (t' ++ show lenTs)
                             then (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),True)
                             else b
                     else if notSinglesFlag
                             then b
                             else (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),False)
          else b

addOne a b
  | null (fst b) = a
  | null (fst a) = b
  | otherwise = 
      let (((start,end,patLen,lenS),sequence):rest,(sStart,sEnd)) = a 
          (((start',end',patLen',lenS'),sequence'):rest',(sStart',sEnd')) = b
      in if sStart' < sEnd && sEnd < sEnd'
            then let c = ((start,end,patLen,lenS),sequence):rest
                     d = ((start',end',patLen',lenS'),sequence'):rest'
                 in (c ++ d, (sStart, sEnd'))
            else a

segment xs baseIndex maxSeqLen = segment' xs baseIndex baseIndex where
  segment' zzs@(z:zs) strIndex farthest
    | null zs                              = initial
    | strIndex >= farthest && strIndex > 0 = ([],(0,0))
    | otherwise                            = addOne initial next
   where
     next@(s',(start',end')) = segment' zs (strIndex + 1) farthest'
     farthest' | null s = farthest
               | otherwise = if start /= end && end > farthest then end else farthest
     initial@(s,(start,end)) = sequences zzs strIndex maxSeqLen

areExclusive ((a,b,_,_),_) ((a',b',_,_),_) = (a' >= b) || (b' <= a)

combs []     r = [r]
combs (x:xs) r
  | null r    = combs xs (x:r) ++ if null xs then [] else combs xs r
  | otherwise = if areExclusive (head r) x
                   then combs xs (x:r) ++ combs xs r
                        else if l' > lowerBound
                                then combs xs (x: reduced : drop 1 r) ++ combs xs r
                                else combs xs r
 where lowerBound = l + 2 * patLen
       ((l,u,patLen,lenS),s) = head r
       ((l',u',patLen',lenS'),s') = x
       reduce = takeWhile (>=l') . iterate (\x -> x - patLen) $ u
       lenReduced = length reduce
       reduced = ((l,u - lenReduced * patLen,patLen,lenS - lenReduced),s)

buildString origStr sequences = buildString' origStr sequences 0 (0,"",0)
   where
    buildString' origStr sequences index accum@(lenC,cStr,lenOrig)
      | null sequences = accum
      | l /= index     = 
          buildString' (drop l' origStr) sequences l (lenC + l' + 1, cStr ++ take l' origStr ++ "1", lenOrig + l')
      | otherwise      = 
          buildString' (drop u' origStr) rest u (lenC + length s', cStr ++ s', lenOrig + u')
     where
       l' = l - index
       u' = u - l  
       s' = s ++ show lenS       
       (((l,u,patLen,lenS),s):rest) = sequences

compress []         _         accum = reverse accum ++ (if null accum then [] else "1")
compress zzs@(z:zs) maxSeqLen accum
  | null (fst segment')                      = compress zs maxSeqLen (z:accum)
  | (start,end) == (0,2) && not (null accum) = compress zs maxSeqLen (z:accum)
  | otherwise                                =
      reverse accum ++ (if null accum || takeWhile' compressedStr 0 /= 0 then [] else "1")
      ++ compressedStr
      ++ compress (drop lengthOriginal zzs) maxSeqLen []
 where segment'@(s,(start,end)) = segment zzs 0 maxSeqLen
       combinations = combs (fst $ segment') []
       takeWhile' xxs count
         | null xxs                                             = 0
         | x == '1' && null (reads (take 1 xs)::[(Int,String)]) = count 
         | not (null (reads [x]::[(Int,String)]))               = 0
         | otherwise                                            = takeWhile' xs (count + 1) 
        where x:xs = xxs
       f (lenC,cStr,lenOrig) (lenC',cStr',lenOrig') = 
         let g = compare ((fromIntegral lenC + if not (null accum) && takeWhile' cStr 0 == 0 then 1 else 0) / fromIntegral lenOrig) 
                         ((fromIntegral lenC' + if not (null accum) && takeWhile' cStr' 0 == 0 then 1 else 0) / fromIntegral lenOrig')
         in if g == EQ 
               then compare (takeWhile' cStr' 0) (takeWhile' cStr 0)
               else g
       (lenCompressed,compressedStr,lengthOriginal) = 
         head $ sortBy f (map (buildString (take end zzs)) (map reverse combinations))

Output:

输出:

*Main> compress "aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb" 100 []
"a9b9a9b9"

*Main> compress "aaabbbaaabbbaaabbbaaabbb" 100 []
"aaabbb4"

#1


10  

Explanation of the algorithm:

算法的解释:

  • Having a sequence S with symbols s(1), s(2),…, s(N).
  • 具有符号S(1)、S(2)、S (N)的序列S。
  • Let B(i) be the best compressed sequence with elements s(1), s(2),…,s(i).
  • 让B(i)是与元素s(1)、s(2)、s(i)的最佳压缩序列。
  • So, for example, B(3) will be the best compressed sequence for s(1), s(2), s(3).
  • 因此,例如,B(3)将是s(1)、s(2)、s(3)的最佳压缩序列。
  • What we want to know is B(N).
  • 我们想知道的是B(N)

To find it, we will proceed by induction. We want to calculate B(i+1), knowing B(i), B(i-1), B(i-2), …, B(1), B(0), where B(0) is empty sequence, and and B(1) = s(1). At the same time, this constitutes a proof that the solution is optimal. ;)

为了找到它,我们将采用归纳法。我们要计算B(i+1),知道B(i), B(i-1), B(i-2), B(1), B(0), B(0)为空序列,B(1) = s(1)。与此同时,这也证明了解决方案是最优的。,)

To calculate B(i+1), we will pick the best sequence among the candidates:

为了计算B(i+1),我们将选出候选的最佳序列:

  1. Candidate sequences where the last block has one element:

    最后一个块有一个元素的候选序列:

    B(i )s(i+1)1 B(i-1)s(i+1)2 ; only if s(i) = s(i+1) B(i-2)s(i+1)3 ; only if s(i-1) = s(i) and s(i) = s(i+1) … B(1)s(i+1)[i-1] ; only if s(2)=s(3) and s(3)=s(4) and … and s(i) = s(i+1) B(0)s(i+1)i = s(i+1)i ; only if s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)

    B(i)s(i + 1)1 B(张)(i + 1)2;只有s(i) = s(i+1) B(i-2)s(i+1)3;只有(张)=年代(我),(我)=年代(i + 1)…B(1)s(i + 1)(张);只有在(2)=年代(3)和s(3)=(4)和……和年代(i)= s(i + 1)B(0)(i + 1)i =年代(i + 1)我;只有(1)=年代(2)和s(2)=(3)和……(我)=年代(i + 1)

  2. Candidate sequences where the last block has 2 elements:

    最后一个块有两个元素的候选序列:

    B(i-1)s(i)s(i+1)1 B(i-3)s(i)s(i+1)2 ; only if s(i-2)s(i-1)=s(i)s(i+1) B(i-5)s(i)s(i+1)3 ; only if s(i-4)s(i-3)=s(i-2)s(i-1) and s(i-2)s(i-1)=s(i)s(i+1) …

    B(张)(i)年代(i + 1)1 B(我)(i)年代(i + 1)2;只有在(我2)年代(张)=(i)年代B(i + 1)(我)(i)年代(i + 1)3;只有在(我)年代(我)=(我2)年代(张)和(我2)年代(张)=(i)年代(i + 1)…

  3. Candidate sequences where the last block has 3 elements:

    最后一个块有3个元素的候选序列:

  4. Candidate sequences where the last block has 4 elements:

    最后一个块有4个元素的候选序列:

  5. Candidate sequences where last block has n+1 elements:

    最后一个区块有n+1个元素的候选序列:

    s(1)s(2)s(3)………s(i+1)

    (1)年代(2)(3).........年代(i + 1)

For each possibility, the algorithm stops when the sequence block is no longer repeated. And that’s it.

对于每一种可能性,当序列块不再重复时,算法会停止。就是这样。

The algorithm will be some thing like this in psude-c code:

在psude-c代码中,算法是这样的:

B(0) = “”
for (i=1; i<=N; i++) {
    // Calculate all the candidates for B(i)
    BestCandidate=null
    for (j=1; j<=i; j++) {
        Calculate all the candidates of length (i)

        r=1;
        do {
             Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
             If (   (BestCandidate==null) 
                      || (Candidate is shorter that BestCandidate)) 
                 {
            BestCandidate=Candidate.
                 }
             r++;
        } while (  ([i-j]*r <= i) 
             &&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))

    }
    B(i)=BestCandidate
}

Hope that this can help a little more.

希望这能帮助更多的人。

The full C program performing the required task is given below. It runs in O(n^2). The central part is only 30 lines of code.

执行所需任务的完整C程序如下所示。它运行在O(n ^ 2)。中心部分只有30行代码。

EDIT I have restructured a little bit the code, changed the names of the variables and added some comment in order to be more readable.

编辑我已经修改了一点代码,更改了变量的名称,并添加了一些注释以使其可读性更好。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>


// This struct represents a compressed segment like atg4, g3,  agc1
    struct Segment {
        char *elements;
        int nElements;
        int count;
    };
         // As an example, for the segment agagt3  elements would be:
         // {
         //      elements: "agagt",
         //      nElements: 5,
         //      count: 3
         // }

    struct Sequence {
        struct Segment lastSegment;
        struct Sequence *prev;      // Points to a sequence without the last segment or NULL if it is the first segment
        int totalLen;  // Total length of the compressed sequence.
    };
       // as an example, for the sequence agt32ta5, the representation will be:
       // {
       //     lastSegment:{"ta" , 2 , 5},
       //     prev: @A,
       //     totalLen: 8
       // }
       // and A will be
       // {
       //     lastSegment{ "agt", 3, 32},
       //     prev: NULL,
       //     totalLen: 5
       // }


// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.

char *sequence2string(struct Sequence *S) {
    char *Res=malloc(S->totalLen + 1);
    char *digits="0123456789";

    int p= S->totalLen;
    Res[p]=0;

    while (S!=NULL) {
            // first we insert the count of the last element.
            // We do digit by digit starting with the units.
            int C = S->lastSegment.count;
            while (C) {
                p--;
                Res[p] = digits[ C % 10 ];
                C /= 10;
            }

            p -= S->lastSegment.nElements;
            strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);

            S = S ->prev;
    }


    return Res;
}


// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
    int i,j;

    int N = strlen(in);;            // Number of elements of a in sequence.



    // B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
    // What we want to return is B[N];
    struct Sequence *B;
    B = malloc((N+1) * sizeof (struct Sequence));

    // We first do an initialization for i=0

    B[0].lastSegment.elements="";
    B[0].lastSegment.nElements=0;
    B[0].lastSegment.count=0;
    B[0].prev = NULL;
    B[0].totalLen=0;

    // and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth,  We will try different sequences and keep the minimum one.
    for (i=1; i<=N; i++) B[i].totalLen = INT_MAX;   // A very high value

    for (i=1; i<=N; i++) {

        // at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
        for (j=1; j<=i; j++) {

            // Here we will check all the candidates where the last segment has j elements

            int r=1;                  // number of times the last segment is repeated
            int rNDigits=1;           // Number of digits of r
            int rNDigitsBound=10;     // We will increment r, so this value is when r will have an extra digit.
                                      // when r = 0,1,...,9 => rNDigitsBound = 10
                                      // when r = 10,11,...,99 => rNDigitsBound = 100
                                      // when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.

            do {

                // Here we analitze a candidate B(i).
                // where the las segment has j elements repeated r times.

                int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
                if (CandidateLen < B[i].totalLen) {
                    B[i].lastSegment.elements = in + i - j*r;
                    B[i].lastSegment.nElements = j;
                    B[i].lastSegment.count = r;
                    B[i].prev = &(B[i-j*r]);
                    B[i].totalLen = CandidateLen;
                }

                r++;
                if (r == rNDigitsBound ) {
                    rNDigits++;
                    rNDigitsBound *= 10;
                }
            } while (   (i - j*r >= 0)
                     && (strncmp(in + i -j, in + i - j*r, j)==0));

        }
    }

    char *Res=sequence2string(&(B[N]));
    free(B);

    return Res;
}

int main(int argc, char** argv) {
    char *compressedDNA=dnaCompress(argv[1]);
    puts(compressedDNA);
    free(compressedDNA);
    return 0;
}

#2


2  

Forget Ukonnen. Dynamic programming it is. With 3-dimensional table:

忘记Ukonnen。动态编程。与三维表:

  1. sequence position
  2. 序列的位置
  3. subsequence size
  4. 子序列的大小
  5. number of segments
  6. 段数

TERMINOLOGY: For example, having a = "aaagctgctagag", sequence position coordinate would run from 1 to 13. At sequence position 3 (letter 'g'), having subsequence size 4, the subsequence would be "gctg". Understood? And as for the number of segments, then expressing a as "aaagctgctagag1" consists of 1 segment (the sequence itself). Expressing it as "a3gct2ag2" consists of 3 segments. "aaagctgct1ag2" consists of 2 segments. "a2a1ctg2ag2" would consist of 4 segments. Understood? Now, with this, you start filling a 3-dimensional array 13 x 13 x 13, so your time and memory complexity seems to be around n ** 3 for this. Are you sure you can handle it for million-bp sequences? I think that greedy approach would be better, because large DNA sequences are unlikely to repeat exactly. And, I would suggest that you widen your assignment to approximate matches, and you can publish it straight in a journal.

术语:例如,有一个=“aaagctgctagag”,序列位置坐标从1到13。在序列位置3(字母g),子序列大小为4后,子序列将是“gctg”。理解吗?至于节段的数量,则表示a为“aaagctgctagag1”,由1段(序列本身)组成。将其表示为“a3gct2ag2”由三个部分组成。“aaagctgct1ag2”由两个部分组成。“a2a1ctg2ag2”将由4个部分组成。理解吗?现在,你开始填充一个三维的数组13 x 13 x 13,所以你的时间和内存复杂度似乎在n ** 3附近。你确定你能处理上百万个bp的序列吗?我认为贪婪的方法会更好,因为大的DNA序列不太可能准确地重复。而且,我建议你把你的作业扩大到近似匹配,你可以把它直接发表在期刊上。

Anyway, you will start filling the table of compressing a subsequence starting at some position (dimension 1) with length equal to dimension 2 coordinate, having at most dimension 3 segments. So you first fill the first row, representing compressions of subsequences of length 1 consisting of at most 1 segment:

无论如何,你将开始填充压缩子序列的表,从某个位置(第1维)开始,长度等于二维坐标,在大多数维度为3段。所以你先填第一行,表示长度为1的子序列的压缩数最多为1段:

a        a        a        g        c        t        g        c        t        a        g        a        g
1(a1)    1(a1)    1(a1)    1(g1)    1(c1)    1(t1)    1(g1)    1(c1)    1(t1)    1(a1)    1(g1)    1(a1)    1(g1)

The number is the character cost (always 1 for these trivial 1-char sequences; number 1 does not count into the character cost), and in the parenthesis, you have the compression (also trivial for this simple case). The second row will be still simple:

数字是字符成本(对于这些简单的1字符序列总是1);数字1不计入字符成本),在括号中,您有压缩(对于这个简单的例子来说也很简单)。第二行仍然很简单:

2(a2)    2(a2)    2(ag1)   2(gc1)   2(ct1)   2(tg1)   2(gc1)   2(ct1)   2(ta1)   2(ag1)   2(ga1)    2(ag1)

There is only 1 way to decompose a 2-character sequence into 2 subsequences -- 1 character + 1 character. If they are identical, the result is like a + a = a2. If they are different, such as a + g, then, because only 1-segment sequences are admissible, the result cannot be a1g1, but must be ag1. The third row will be finally more interesting:

只有一种方法可以将2个字符的序列分解成2个子序列——1个字符+ 1个字符。如果它们是相同的,结果就像a + a = a2。如果它们是不同的,比如a + g,那么,因为只有1段序列是可接受的,结果不能是a1g1,但必须是ag1。第三行最后会更有趣:

2(a3)    2(aag1)  3(agc1)  3(gct1)  3(ctg1)  3(tgc1)  3(gct1)  3(cta1)  3(tag1)  3(aga1)  3(gag1)

Here, you can always choose between 2 ways of composing the compressed string. For example, aag can be composed either as aa + g or a + ag. But again, we cannot have 2 segments, as in aa1g1 or a1ag1, so we must be satisfied with aag1, unless both components consist of the same character, as in aa + a => a3, with character cost 2. We can continue onto 4 th line:

在这里,您可以选择两种方法来组合压缩字符串。例如,aag可以写成aa + g或a + ag。但是,同样的,我们不能有两个段,如在aa1g1或a1ag1中,所以我们必须满足aag1,除非两个组件由相同的字符组成,如在aa + a => a3,字符成本为2。我们可以继续到第4行:

4(aaag1) 4(aagc1) 4(agct1) 4(gctg1) 4(ctgc1) 4(tgct1) 4(gcta1) 4(ctag1) 4(taga1) 3(ag2)

Here, on the first position, we cannot use a3g1, because only 1 segment is allowed at this layer. But at the last position, compression to character cost 3 is agchieved by ag1 + ag1 = ag2. This way, one can fill the whole first-level table all the way up to the single subsequence of 13 characters, and each subsequence will have its optimal character cost and its compression under the first-level constraint of at most 1 segment associated with it.

这里,在第一个位置,我们不能使用a3g1,因为这个层只允许1个段。但在最后一个位置,压缩到角色成本3是ag1 + ag1 = ag2。通过这种方式,可以将整个一级表全部填充到13个字符的单个子序列中,每个子序列都有其最优的字符代价和在与它关联的最多1个段的一级约束下的压缩。

Then you go to the 2nd level, where 2 segments are allowed... And again, from the bottom up, you identify the optimum cost and compression of each table coordinate under the given level's segment count constraint, by comparing all the possible ways to compose the subsequence using already computed positions, until you fill the table completely and thus compute the global optimum. There are some details to solve, but sorry, I'm not gonna code this for you.

然后你进入第二等级,允许2个部分…再一次,从下往上,你确定了在给定级别的段数约束下每个表坐标的最优成本和压缩,通过比较所有可能的方法来使用已经计算的位置来组合子序列,直到你完全填满表格,从而计算出全局最优。有一些细节需要解决,但抱歉,我不会为你编码。

#3


2  

After trying my own way for a while, my kudos to jbaylina for his beautiful algorithm and C implementation. Here's my attempted version of jbaylina's algorithm in Haskell, and below it further development of my attempt at a linear-time algorithm that attempts to compress segments that include repeated patterns in a one-by-one fashion:

在尝试了一段时间后,我的荣誉给了jbaylina他的算法和C实现。这是我在Haskell中尝试过的jbaylina算法,下面是我尝试的线性时间算法的进一步发展,它试图压缩包含重复模式的片段:

import Data.Map (fromList, insert, size, (!))

compress s = (foldl f (fromList [(0,([],0)),(1,([s!!0],1))]) [1..n - 1]) ! n  
 where
  n = length s
  f b i = insert (size b) bestCandidate b where
    add (sequence, sLength) (sequence', sLength') = 
      (sequence ++ sequence', sLength + sLength')
    j' = [1..min 100 i]
    bestCandidate = foldr combCandidates (b!i `add` ([s!!i,'1'],2)) j'
    combCandidates j candidate' = 
      let nextCandidate' = comb 2 (b!(i - j + 1) 
                       `add` ((take j . drop (i - j + 1) $ s) ++ "1", j + 1))
      in if snd nextCandidate' <= snd candidate' 
            then nextCandidate' 
            else candidate' where
        comb r candidate
          | r > uBound                         = candidate
          | not (strcmp r True)                = candidate
          | snd nextCandidate <= snd candidate = comb (r + 1) nextCandidate
          | otherwise                          = comb (r + 1) candidate
         where 
           uBound = div (i + 1) j
           prev = b!(i - r * j + 1)
           nextCandidate = prev `add` 
             ((take j . drop (i - j + 1) $ s) ++ show r, j + length (show r))
           strcmp 1   _    = True
           strcmp num bool 
             | (take j . drop (i - num * j + 1) $ s) 
                == (take j . drop (i - (num - 1) * j + 1) $ s) = 
                  strcmp (num - 1) True
             | otherwise = False

Output:

输出:

*Main> compress "aaagctgctagag"
("a3gct2ag2",9)

*Main> compress "aaabbbaaabbbaaabbbaaabbb"
("aaabbb4",7)


Linear-time attempt:

线性时间的尝试:

import Data.List (sortBy)

group' xxs sAccum (chr, count)
  | null xxs = if null chr 
                  then singles
                  else if count <= 2 
                          then reverse sAccum ++ multiples ++ "1"
                  else singles ++ if null chr then [] else chr ++ show count
  | [x] == chr = group' xs sAccum (chr,count + 1)
  | otherwise = if null chr 
                   then group' xs (sAccum) ([x],1) 
                   else if count <= 2 
                           then group' xs (multiples ++ sAccum) ([x],1)
                   else singles 
                        ++ chr ++ show count ++ group' xs [] ([x],1)
 where x:xs = xxs
       singles = reverse sAccum ++ (if null sAccum then [] else "1")
       multiples = concat (replicate count chr)

sequences ws strIndex maxSeqLen = repeated' where
  half = if null . drop (2 * maxSeqLen - 1) $ ws 
            then div (length ws) 2 else maxSeqLen
  repeated' = let (sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = repeated
              in (sequence,(sequenceStart, sequenceEnd'))
  repeated = foldr divide ([],(strIndex,strIndex),False) [1..half]
  equalChunksOf t a = takeWhile(==t) . map (take a) . iterate (drop a)
  divide chunkSize b@(sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = 
    let t = take (2*chunkSize) ws
        t' = take chunkSize t
    in if t' == drop chunkSize t
          then let ts = equalChunksOf t' chunkSize ws
                   lenTs = length ts
                   sequenceEnd = strIndex + lenTs * chunkSize
                   newEnd = if sequenceEnd > sequenceEnd' 
                            then sequenceEnd else sequenceEnd'
               in if chunkSize > 1 
                     then if length (group' (concat (replicate lenTs t')) [] ([],0)) > length (t' ++ show lenTs)
                             then (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),True)
                             else b
                     else if notSinglesFlag
                             then b
                             else (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),False)
          else b

addOne a b
  | null (fst b) = a
  | null (fst a) = b
  | otherwise = 
      let (((start,end,patLen,lenS),sequence):rest,(sStart,sEnd)) = a 
          (((start',end',patLen',lenS'),sequence'):rest',(sStart',sEnd')) = b
      in if sStart' < sEnd && sEnd < sEnd'
            then let c = ((start,end,patLen,lenS),sequence):rest
                     d = ((start',end',patLen',lenS'),sequence'):rest'
                 in (c ++ d, (sStart, sEnd'))
            else a

segment xs baseIndex maxSeqLen = segment' xs baseIndex baseIndex where
  segment' zzs@(z:zs) strIndex farthest
    | null zs                              = initial
    | strIndex >= farthest && strIndex > 0 = ([],(0,0))
    | otherwise                            = addOne initial next
   where
     next@(s',(start',end')) = segment' zs (strIndex + 1) farthest'
     farthest' | null s = farthest
               | otherwise = if start /= end && end > farthest then end else farthest
     initial@(s,(start,end)) = sequences zzs strIndex maxSeqLen

areExclusive ((a,b,_,_),_) ((a',b',_,_),_) = (a' >= b) || (b' <= a)

combs []     r = [r]
combs (x:xs) r
  | null r    = combs xs (x:r) ++ if null xs then [] else combs xs r
  | otherwise = if areExclusive (head r) x
                   then combs xs (x:r) ++ combs xs r
                        else if l' > lowerBound
                                then combs xs (x: reduced : drop 1 r) ++ combs xs r
                                else combs xs r
 where lowerBound = l + 2 * patLen
       ((l,u,patLen,lenS),s) = head r
       ((l',u',patLen',lenS'),s') = x
       reduce = takeWhile (>=l') . iterate (\x -> x - patLen) $ u
       lenReduced = length reduce
       reduced = ((l,u - lenReduced * patLen,patLen,lenS - lenReduced),s)

buildString origStr sequences = buildString' origStr sequences 0 (0,"",0)
   where
    buildString' origStr sequences index accum@(lenC,cStr,lenOrig)
      | null sequences = accum
      | l /= index     = 
          buildString' (drop l' origStr) sequences l (lenC + l' + 1, cStr ++ take l' origStr ++ "1", lenOrig + l')
      | otherwise      = 
          buildString' (drop u' origStr) rest u (lenC + length s', cStr ++ s', lenOrig + u')
     where
       l' = l - index
       u' = u - l  
       s' = s ++ show lenS       
       (((l,u,patLen,lenS),s):rest) = sequences

compress []         _         accum = reverse accum ++ (if null accum then [] else "1")
compress zzs@(z:zs) maxSeqLen accum
  | null (fst segment')                      = compress zs maxSeqLen (z:accum)
  | (start,end) == (0,2) && not (null accum) = compress zs maxSeqLen (z:accum)
  | otherwise                                =
      reverse accum ++ (if null accum || takeWhile' compressedStr 0 /= 0 then [] else "1")
      ++ compressedStr
      ++ compress (drop lengthOriginal zzs) maxSeqLen []
 where segment'@(s,(start,end)) = segment zzs 0 maxSeqLen
       combinations = combs (fst $ segment') []
       takeWhile' xxs count
         | null xxs                                             = 0
         | x == '1' && null (reads (take 1 xs)::[(Int,String)]) = count 
         | not (null (reads [x]::[(Int,String)]))               = 0
         | otherwise                                            = takeWhile' xs (count + 1) 
        where x:xs = xxs
       f (lenC,cStr,lenOrig) (lenC',cStr',lenOrig') = 
         let g = compare ((fromIntegral lenC + if not (null accum) && takeWhile' cStr 0 == 0 then 1 else 0) / fromIntegral lenOrig) 
                         ((fromIntegral lenC' + if not (null accum) && takeWhile' cStr' 0 == 0 then 1 else 0) / fromIntegral lenOrig')
         in if g == EQ 
               then compare (takeWhile' cStr' 0) (takeWhile' cStr 0)
               else g
       (lenCompressed,compressedStr,lengthOriginal) = 
         head $ sortBy f (map (buildString (take end zzs)) (map reverse combinations))

Output:

输出:

*Main> compress "aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb" 100 []
"a9b9a9b9"

*Main> compress "aaabbbaaabbbaaabbbaaabbb" 100 []
"aaabbb4"