找到最小汉明距离到任何子串的最快方法?

时间:2021-03-23 15:25:27

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,

给定一个长字符串L和一个较短的字符串S(约束是L)。长度必须为>= S长度),我要找到S和任意长度等于S长度的L的最小汉明距离。让我们调用这个minHamming()函数。例如,

minHamming(ABCDEFGHIJ, CDEFGG) == 1.

minHamming(ABCDEFGHIJ CDEFGG)= = 1。

minHamming(ABCDEFGHIJ, BCDGHI) == 3.

minHamming(ABCDEFGHIJ BCDGHI)= = 3。

Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.

这样做很明显(枚举L的每个子字符串)需要O(S)。* L.length)时间。有什么聪明的方法可以在短时间内做到这一点吗?我用几个不同的S字符串搜索相同的L,所以对L做一些复杂的预处理是可以接受的。

Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).

编辑:修改后的Boyer-Moore是个好主意,除了我的字母表只有4个字母(DNA)。

3 个解决方案

#1


13  

Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.

也许令人惊讶的是,这个精确的问题可以用快速傅里叶变换(FFTs)在O(|A|nlog n)时间里得到解决,其中n是大序列L的长度,而|则是字母表的大小。

Here is a freely available PDF of a paper by Donald Benson describing how it works:

这是唐纳德·本森的一篇论文的免费PDF格式,描述了它是如何工作的:

  • Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
  • 生物序列分析的傅立叶方法(Donald Benson,核酸研究1990年vol. 18, pp. 3001-3006)

Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.

总结:将每个字符串S和L转换成几个指标向量(每个字符一个,在DNA的情况下为4),然后将相应的向量进行卷积,以确定每个可能的对齐方式的匹配计数。关键是“时间”域中的卷积,这通常需要O(n ^ 2)时间,可以使用乘法实现“频率”领域,也需要O(n)时间,加域之间转换所需的时间和回来。使用FFT每次转换只需要O(nlog n)时间,所以总的时间复杂度是O(|A|nlog n)。

Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.

注意:对于任意的年代和L算法显然是一个巨大的性能赢得简单O(mn)算法| |和| |变得很大,但是OTOH如果年代通常短于日志L | |(例如当查询大型DB小序列),那么这种方法没有提供加速。

UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.

更新21/7/2009:更新后提到,时间复杂度也与字母表的大小成线性关系,因为在字母表中每个字符必须使用单独的指示向量。

#2


2  

Modified Boyer-Moore

I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:

我刚刚挖掘了一些关于Boyer-Moore的旧Python实现,我已经在周围进行了修改,并修改了匹配的循环(将文本与模式进行比较)。当在两个字符串之间找到第一个不匹配时,不要立即跳出来,只需要计算错误匹配的数量,但是要记住第一个错配:

current_dist = 0
while pattern_pos >= 0:
    if pattern[pattern_pos] != text[text_pos]:
        if first_mismatch == -1:
            first_mismatch = pattern_pos
            tp = text_pos
        current_dist += 1
        if current_dist == smallest_dist:
           break

     pattern_pos -= 1
     text_pos -= 1

 smallest_dist = min(current_dist, smallest_dist)
 # if the distance is 0, we've had a match and can quit
 if current_dist == 0:
     return 0
 else: # shift
     pattern_pos = first_mismatch
     text_pos = tp
     ...

If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.

如果字符串在这一点上不完全匹配,则返回到第一个不匹配的点,恢复值。这确保了最小距离的发现。

The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.

整个实现相当长(~150LOC),但是我可以根据请求发布。上面概述了核心思想,其他一切都是标准的Boyer-Moore。

Preprocessing on the Text

Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.

另一种提高速度的方法是预处理文本,以获得字符位置的索引。你只需要开始比较两个字符串之间至少有一个匹配的位置,否则汉明距离就是|S|。

import sys
from collections import defaultdict
import bisect

def char_positions(t):
    pos = defaultdict(list)
    for idx, c in enumerate(t):
        pos[c].append(idx)
    return dict(pos)

This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.

该方法简单地创建了一个字典,它将文本中的每个字符映射到所发生事件的排序列表。

The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:

比较循环或多或少与天真的O(mn)方法保持一致,除了我们不增加每次1的比较的位置,而是基于角色的位置:

def min_hamming(text, pattern):
    best = len(pattern)
    pos = char_positions(text)

    i = find_next_pos(pattern, pos, 0)

    while i < len(text) - len(pattern):
        dist = 0
        for c in range(len(pattern)):
            if text[i+c] != pattern[c]:
                dist += 1
                if dist == best:
                    break
            c += 1
        else:
            if dist == 0:
                return 0
        best = min(dist, best)
        i = find_next_pos(pattern, pos, i + 1)

    return best

The actual improvement is in find_next_pos:

实际的改进是find_next_pos:

def find_next_pos(pattern, pos, i):
    smallest = sys.maxint
    for idx, c in enumerate(pattern):
        if c in pos:
            x = bisect.bisect_left(pos[c], i + idx)
            if x < len(pos[c]):
                smallest = min(smallest, pos[c][x] - idx)
    return smallest

For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.

对于每一个新位置,我们都会找到一个S在l中出现的最低索引,如果没有这样的索引,那么算法就会终止。

find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.

find_next_pos肯定是复杂的,我们可以尝试通过使用模式S的前几个字符来改进它,或者使用一个集合来确保模式中的字符不会被检查两次。

Discussion

Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.

哪个方法更快很大程度上取决于您的数据集。你的字母表越多样化,跳跃就越大。如果你有一个很长的L,第二个方法预处理可能会更快。对于非常短的字符串(比如在您的问题中),天真的方法肯定是最快的。

DNA

If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.

如果你有一个非常小的字母,你可以试着获得字符比克(或更大)的字符位置,而不是unig。

#3


-1  

You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.

你被困在大公司里了。在基本级别上,您将需要测试目标中的每个字母是否匹配子字符串中的每个合格字母。

Luckily, this is easily parallelized.

幸运的是,这很容易并行化。

One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

您可以应用的一个优化是保持当前位置的不匹配的运行计数。如果它比最低的汉明距离要大,那么显然你可以跳到下一个可能性。

#1


13  

Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.

也许令人惊讶的是,这个精确的问题可以用快速傅里叶变换(FFTs)在O(|A|nlog n)时间里得到解决,其中n是大序列L的长度,而|则是字母表的大小。

Here is a freely available PDF of a paper by Donald Benson describing how it works:

这是唐纳德·本森的一篇论文的免费PDF格式,描述了它是如何工作的:

  • Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
  • 生物序列分析的傅立叶方法(Donald Benson,核酸研究1990年vol. 18, pp. 3001-3006)

Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.

总结:将每个字符串S和L转换成几个指标向量(每个字符一个,在DNA的情况下为4),然后将相应的向量进行卷积,以确定每个可能的对齐方式的匹配计数。关键是“时间”域中的卷积,这通常需要O(n ^ 2)时间,可以使用乘法实现“频率”领域,也需要O(n)时间,加域之间转换所需的时间和回来。使用FFT每次转换只需要O(nlog n)时间,所以总的时间复杂度是O(|A|nlog n)。

Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.

注意:对于任意的年代和L算法显然是一个巨大的性能赢得简单O(mn)算法| |和| |变得很大,但是OTOH如果年代通常短于日志L | |(例如当查询大型DB小序列),那么这种方法没有提供加速。

UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.

更新21/7/2009:更新后提到,时间复杂度也与字母表的大小成线性关系,因为在字母表中每个字符必须使用单独的指示向量。

#2


2  

Modified Boyer-Moore

I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:

我刚刚挖掘了一些关于Boyer-Moore的旧Python实现,我已经在周围进行了修改,并修改了匹配的循环(将文本与模式进行比较)。当在两个字符串之间找到第一个不匹配时,不要立即跳出来,只需要计算错误匹配的数量,但是要记住第一个错配:

current_dist = 0
while pattern_pos >= 0:
    if pattern[pattern_pos] != text[text_pos]:
        if first_mismatch == -1:
            first_mismatch = pattern_pos
            tp = text_pos
        current_dist += 1
        if current_dist == smallest_dist:
           break

     pattern_pos -= 1
     text_pos -= 1

 smallest_dist = min(current_dist, smallest_dist)
 # if the distance is 0, we've had a match and can quit
 if current_dist == 0:
     return 0
 else: # shift
     pattern_pos = first_mismatch
     text_pos = tp
     ...

If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.

如果字符串在这一点上不完全匹配,则返回到第一个不匹配的点,恢复值。这确保了最小距离的发现。

The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.

整个实现相当长(~150LOC),但是我可以根据请求发布。上面概述了核心思想,其他一切都是标准的Boyer-Moore。

Preprocessing on the Text

Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.

另一种提高速度的方法是预处理文本,以获得字符位置的索引。你只需要开始比较两个字符串之间至少有一个匹配的位置,否则汉明距离就是|S|。

import sys
from collections import defaultdict
import bisect

def char_positions(t):
    pos = defaultdict(list)
    for idx, c in enumerate(t):
        pos[c].append(idx)
    return dict(pos)

This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.

该方法简单地创建了一个字典,它将文本中的每个字符映射到所发生事件的排序列表。

The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:

比较循环或多或少与天真的O(mn)方法保持一致,除了我们不增加每次1的比较的位置,而是基于角色的位置:

def min_hamming(text, pattern):
    best = len(pattern)
    pos = char_positions(text)

    i = find_next_pos(pattern, pos, 0)

    while i < len(text) - len(pattern):
        dist = 0
        for c in range(len(pattern)):
            if text[i+c] != pattern[c]:
                dist += 1
                if dist == best:
                    break
            c += 1
        else:
            if dist == 0:
                return 0
        best = min(dist, best)
        i = find_next_pos(pattern, pos, i + 1)

    return best

The actual improvement is in find_next_pos:

实际的改进是find_next_pos:

def find_next_pos(pattern, pos, i):
    smallest = sys.maxint
    for idx, c in enumerate(pattern):
        if c in pos:
            x = bisect.bisect_left(pos[c], i + idx)
            if x < len(pos[c]):
                smallest = min(smallest, pos[c][x] - idx)
    return smallest

For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.

对于每一个新位置,我们都会找到一个S在l中出现的最低索引,如果没有这样的索引,那么算法就会终止。

find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.

find_next_pos肯定是复杂的,我们可以尝试通过使用模式S的前几个字符来改进它,或者使用一个集合来确保模式中的字符不会被检查两次。

Discussion

Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.

哪个方法更快很大程度上取决于您的数据集。你的字母表越多样化,跳跃就越大。如果你有一个很长的L,第二个方法预处理可能会更快。对于非常短的字符串(比如在您的问题中),天真的方法肯定是最快的。

DNA

If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.

如果你有一个非常小的字母,你可以试着获得字符比克(或更大)的字符位置,而不是unig。

#3


-1  

You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.

你被困在大公司里了。在基本级别上,您将需要测试目标中的每个字母是否匹配子字符串中的每个合格字母。

Luckily, this is easily parallelized.

幸运的是,这很容易并行化。

One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

您可以应用的一个优化是保持当前位置的不匹配的运行计数。如果它比最低的汉明距离要大,那么显然你可以跳到下一个可能性。