Smith Waterman算法如何将较短的序列与较长的序列匹配

时间:2021-03-18 21:48:42

This is a pseudo code i wrote for the original smith waterman algorithm.

这是我为原始史密斯水手算法编写的伪代码。

Input: U[1, n], V[1, m]
Set W[0, j] = 0 for j = 0 .. m
Set W[i, 0] = 0 for i = 0 .. n
For i = 1 to n
    For j = 1 to m
        W[i, j] = max {
                       0, 
                       W[i, j-1] - d, 
                       W[i-1, j-1] + s(U[i], V[j]), 
                       W[i-1, j] - d
                       }

Now I need to fit a shorter sequence U (1-n) into a longer sequence V (1-m), How can I modify the code?

现在我需要将更短的序列U(1-n)拟合成更长的序列V(1-m),如何修改代码?

1 个解决方案

#1


1  

Eliminating the penalty for gaps in the DNA sequence while keeping it for gaps in the mRNA sequence should be pretty straightforward. Assuming that U is the DNA and V is mRNA (otherwise, this will be the other way around), you just remove the gap penalty term (-d) from the value for the scenario where you are inserting a gap in i (W[i, j-1]), as so:

消除DNA序列中缺口的惩罚,同时保持mRNA序列中的缺口应该非常简单。假设U是DNA而V是mRNA(否则,这将是另一种方式),你只需从你在i中插入间隙的情况的值中删除空位罚分项(-d)(W [ i,j-1]),如下:

W[i, j] = max {
               0, 
               W[i, j-1], 
               W[i-1, j-1] + s(U[i], V[j]), 
               W[i-1, j] - d
               }

However, this will probably not be a robust solution to your problem. You still want to favor alignments without gaps in the DNA over alignments with them. A simple solution would be to use a different, smaller gap penalty for gaps in the DNA. However, in light of the fact that you want to account for one sequence having introns and the other not having them, you would probably be best served by an affine gap penalty.

但是,这可能不是您问题的可靠解决方案。您仍然希望在与DNA对齐的情况下,在DNA中没有间隙的情况下进行比对。一个简单的解决方案是对DNA中的缺口使用不同的,较小的间隙罚分。然而,鉴于您想要考虑具有内含子的一个序列而另一个没有内含子的序列,您可能最好通过仿射空位罚分来服务。

An affine gap penalty is one in which there is a greater penalty for starting a gap than for continuing one. So it will favor creating extended intron-like gaps, making it much more likely to give you a biologically-plausible alignment (even in more standard biological sequence alignment problems, affine gap penalties tend to be more successful). Unfortunately, implementing an affine gap penalty is a bit more complicated. You end up needing to keep track of three score/pointer matrices. See this question or these slides for pseudo-code and more information.

仿射差距罚分是指开始差距而不是继续罚分的惩罚。因此,它有利于创建扩展的内含子样间隙,使其更有可能为您提供生物学上合理的比对(即使在更标准的生物序列比对问题中,仿射间隙惩罚往往更成功)。不幸的是,实现仿射差距惩罚有点复杂。您最终需要跟踪三个得分/指针矩阵。有关伪代码和更多信息,请参阅此问题或这些幻灯片。

#1


1  

Eliminating the penalty for gaps in the DNA sequence while keeping it for gaps in the mRNA sequence should be pretty straightforward. Assuming that U is the DNA and V is mRNA (otherwise, this will be the other way around), you just remove the gap penalty term (-d) from the value for the scenario where you are inserting a gap in i (W[i, j-1]), as so:

消除DNA序列中缺口的惩罚,同时保持mRNA序列中的缺口应该非常简单。假设U是DNA而V是mRNA(否则,这将是另一种方式),你只需从你在i中插入间隙的情况的值中删除空位罚分项(-d)(W [ i,j-1]),如下:

W[i, j] = max {
               0, 
               W[i, j-1], 
               W[i-1, j-1] + s(U[i], V[j]), 
               W[i-1, j] - d
               }

However, this will probably not be a robust solution to your problem. You still want to favor alignments without gaps in the DNA over alignments with them. A simple solution would be to use a different, smaller gap penalty for gaps in the DNA. However, in light of the fact that you want to account for one sequence having introns and the other not having them, you would probably be best served by an affine gap penalty.

但是,这可能不是您问题的可靠解决方案。您仍然希望在与DNA对齐的情况下,在DNA中没有间隙的情况下进行比对。一个简单的解决方案是对DNA中的缺口使用不同的,较小的间隙罚分。然而,鉴于您想要考虑具有内含子的一个序列而另一个没有内含子的序列,您可能最好通过仿射空位罚分来服务。

An affine gap penalty is one in which there is a greater penalty for starting a gap than for continuing one. So it will favor creating extended intron-like gaps, making it much more likely to give you a biologically-plausible alignment (even in more standard biological sequence alignment problems, affine gap penalties tend to be more successful). Unfortunately, implementing an affine gap penalty is a bit more complicated. You end up needing to keep track of three score/pointer matrices. See this question or these slides for pseudo-code and more information.

仿射差距罚分是指开始差距而不是继续罚分的惩罚。因此,它有利于创建扩展的内含子样间隙,使其更有可能为您提供生物学上合理的比对(即使在更标准的生物序列比对问题中,仿射间隙惩罚往往更成功)。不幸的是,实现仿射差距惩罚有点复杂。您最终需要跟踪三个得分/指针矩阵。有关伪代码和更多信息,请参阅此问题或这些幻灯片。