高效的算法,用于搜索长度超过另一个文本内文本的14个字符的匹配子串

时间:2022-09-13 08:31:45

I've got a long text (about 5 MB filesize) and another text called pattern (around 2000 characters).

我有一个长文本(大约5 MB文件大小)和另一个文本称为模式(大约2000个字符)。

The task is to find matching parts from a genom-pattern which are 15 characters or longer in the long text.

任务是从长基因模式中找到15个字符或更长的基因组模式的匹配部分。

example:

long text: ACGTACGTGTCA AAAACCCCGGGGTTTTA GTACCCGTAGGCGTAT AND MUCH LONGER

长文:ACGTACGTGTCA AAAACCCCGGGGTTTTA GTACCCGTAGGCGTAT和更长的时间

pattern: ACGGTATTGAC AAAACCCCGGGGTTTTA TGTTCCCAG

模式:ACGGTATTGAC AAAACCCCGGGGTTTTA TGTTCCCAG

I'm looking for an efficient (and easy to understand and implement) algorithm.

我正在寻找一种有效(易于理解和实现)的算法。

A bonus would be a way to implement this with just char-arrays in C++ if thats possible at all.

如果可能的话,奖励将是在C ++中使用char数组实现此方法的一种方法。

6 个解决方案

#1


2  

Stand back, I'm gonna live-code:

退后一步,我要现场代码:

void match_substring(const char *a, const char *b, int n) // n=15 in your case
{
    int alen = strlen(a); // I'll leave all the null-checking and buffer-overrun business as an exercise to the reader
    int blen = strlen(b);
    for (int i=0; i<alen; i++) {
        for (int j=0; j<blen; j++) {
            for (int k; (i+k<alen) && (j+k<blen) && a[i+k]==b[i+k]; k++);
            if (k >= n)
                printf("match from (%d:%d) for %d bytes\n", i, j, k);
        }
    }
}

#2


7  

Here's one algorithm - I'm not sure if it has a name. It requires a "rolling hash" - a (non-cryptographic) hash function that has the property that given the hash of a sequence AB...C, it is efficient to calculate the hash of the sequence B...CD.

这是一种算法 - 我不确定它是否有名称。它需要一个“滚动散列” - 一个(非加密)散列函数,它具有给定序列AB ... C的散列的属性,计算序列B ... CD的散列是有效的。

  1. Calculate the rolling hashes of the sequences pattern[0..14], pattern[1..15], pattern[2..16]... and store each index in pattern in a hash table.

    计算序列模式[0..14],模式[1..15],模式[2..16] ...的滚动哈希值,并将每个索引以模式存储在哈希表中。

  2. Caculate the rolling hash of haystack[0..14] and see if it is in the hash table. If it is, compare haystack[0..14] to pattern[pos..pos+14] where pos was retrieved from the hash table.

    计算haystack [0..14]的滚动哈希并查看它是否在哈希表中。如果是,则将haystack [0..14]与模式[pos..pos + 14]进行比较,其中从哈希表中检索pos。

  3. From the rolling hash of haystack[0..14], efficiently compute the rolling hash of haystack[1..15] and see if it is in the hash table. Repeat until you reach the end of haystack.

    从haystack [0..14]的滚动哈希中,有效地计算haystack [1..15]的滚动哈希并查看它是否在哈希表中。重复,直到你到达干草堆的尽头。

Note that your 15 character strings only have 230 possible values so your "hash function" could be a simple mapping to the value of the string treated as a 15 digit base-4 number, which is fast to compute, has the rolling hash property and is unique.

请注意,您的15个字符串只有230个可能的值,因此您的“哈希函数”可以是一个简单的映射到字符串的值,该字符串被视为一个15位数的base-4数字,它可以快速计算,具有滚动哈希属性和是独特的。

#3


4  

One way would be to get hold of an implementation of Aho-Corasick and use it to create something that will recognise any of the 15-character chunks in the pattern, and then use this to search the text. With Aho-Corasick the cost to build the matcher and the cost to search are both linear, so this should be practical.

一种方法是获取Aho-Corasick的实现并使用它来创建能够识别模式中任何15个字符块的东西,然后使用它来搜索文本。使用Aho-Corasick,构建匹配器的成本和搜索成本都是线性的,所以这应该是实用的。

#4


1  

If you're using a good implementation of the C library (or even a mediocre one like glibc that happens to have a good implementation of this function), strstr will do very well. I've heard there's a new algorithm that's especially good for DNA (small alphabet), but I can't find the reference right now. Other than that, 2way (which glibc uses) is optimal.

如果你正在使用C库的一个很好的实现(或者甚至是像glibc这样的平庸的实现这个函数的良好实现),strstr会做得很好。我听说有一种新算法特别适合DNA(小字母),但我现在找不到参考。除此之外,2way(glibc使用)是最佳的。

#5


1  

I would highly suggest going to your library and checking out "Algorithms 4th Edition" by Robert Sedgwick and Kevin Wayne. They have an entire chapter devoted to substring searching. In addition, it is probably worth checking out the book website algs4.cs.princeton.edu.

我强烈建议你去你的图书馆看看Robert Sedgwick和Kevin Wayne的“Algorithms 4th Edition”。他们有一整章致力于子字符串搜索。此外,可能值得查看书籍网站algs4.cs.princeton.edu。

TL;DR -- If you're determined, you can whip yourself up a substring search using char arrays in guaranteed time linear to input length. There are code samples in the book and online. Doesn't get much easier than that.

TL; DR - 如果你已经确定,你可以使用字符串数组在保证时间线性到输入长度的情况下进行子字符串搜索。书中和网上都有代码示例。没有那么容易。

#6


-1  

i think the "suffix tree" can solve it better with a preformance of O(log n)

我认为“后缀树”可以通过O(log n)的性能更好地解决它

#1


2  

Stand back, I'm gonna live-code:

退后一步,我要现场代码:

void match_substring(const char *a, const char *b, int n) // n=15 in your case
{
    int alen = strlen(a); // I'll leave all the null-checking and buffer-overrun business as an exercise to the reader
    int blen = strlen(b);
    for (int i=0; i<alen; i++) {
        for (int j=0; j<blen; j++) {
            for (int k; (i+k<alen) && (j+k<blen) && a[i+k]==b[i+k]; k++);
            if (k >= n)
                printf("match from (%d:%d) for %d bytes\n", i, j, k);
        }
    }
}

#2


7  

Here's one algorithm - I'm not sure if it has a name. It requires a "rolling hash" - a (non-cryptographic) hash function that has the property that given the hash of a sequence AB...C, it is efficient to calculate the hash of the sequence B...CD.

这是一种算法 - 我不确定它是否有名称。它需要一个“滚动散列” - 一个(非加密)散列函数,它具有给定序列AB ... C的散列的属性,计算序列B ... CD的散列是有效的。

  1. Calculate the rolling hashes of the sequences pattern[0..14], pattern[1..15], pattern[2..16]... and store each index in pattern in a hash table.

    计算序列模式[0..14],模式[1..15],模式[2..16] ...的滚动哈希值,并将每个索引以模式存储在哈希表中。

  2. Caculate the rolling hash of haystack[0..14] and see if it is in the hash table. If it is, compare haystack[0..14] to pattern[pos..pos+14] where pos was retrieved from the hash table.

    计算haystack [0..14]的滚动哈希并查看它是否在哈希表中。如果是,则将haystack [0..14]与模式[pos..pos + 14]进行比较,其中从哈希表中检索pos。

  3. From the rolling hash of haystack[0..14], efficiently compute the rolling hash of haystack[1..15] and see if it is in the hash table. Repeat until you reach the end of haystack.

    从haystack [0..14]的滚动哈希中,有效地计算haystack [1..15]的滚动哈希并查看它是否在哈希表中。重复,直到你到达干草堆的尽头。

Note that your 15 character strings only have 230 possible values so your "hash function" could be a simple mapping to the value of the string treated as a 15 digit base-4 number, which is fast to compute, has the rolling hash property and is unique.

请注意,您的15个字符串只有230个可能的值,因此您的“哈希函数”可以是一个简单的映射到字符串的值,该字符串被视为一个15位数的base-4数字,它可以快速计算,具有滚动哈希属性和是独特的。

#3


4  

One way would be to get hold of an implementation of Aho-Corasick and use it to create something that will recognise any of the 15-character chunks in the pattern, and then use this to search the text. With Aho-Corasick the cost to build the matcher and the cost to search are both linear, so this should be practical.

一种方法是获取Aho-Corasick的实现并使用它来创建能够识别模式中任何15个字符块的东西,然后使用它来搜索文本。使用Aho-Corasick,构建匹配器的成本和搜索成本都是线性的,所以这应该是实用的。

#4


1  

If you're using a good implementation of the C library (or even a mediocre one like glibc that happens to have a good implementation of this function), strstr will do very well. I've heard there's a new algorithm that's especially good for DNA (small alphabet), but I can't find the reference right now. Other than that, 2way (which glibc uses) is optimal.

如果你正在使用C库的一个很好的实现(或者甚至是像glibc这样的平庸的实现这个函数的良好实现),strstr会做得很好。我听说有一种新算法特别适合DNA(小字母),但我现在找不到参考。除此之外,2way(glibc使用)是最佳的。

#5


1  

I would highly suggest going to your library and checking out "Algorithms 4th Edition" by Robert Sedgwick and Kevin Wayne. They have an entire chapter devoted to substring searching. In addition, it is probably worth checking out the book website algs4.cs.princeton.edu.

我强烈建议你去你的图书馆看看Robert Sedgwick和Kevin Wayne的“Algorithms 4th Edition”。他们有一整章致力于子字符串搜索。此外,可能值得查看书籍网站algs4.cs.princeton.edu。

TL;DR -- If you're determined, you can whip yourself up a substring search using char arrays in guaranteed time linear to input length. There are code samples in the book and online. Doesn't get much easier than that.

TL; DR - 如果你已经确定,你可以使用字符串数组在保证时间线性到输入长度的情况下进行子字符串搜索。书中和网上都有代码示例。没有那么容易。

#6


-1  

i think the "suffix tree" can solve it better with a preformance of O(log n)

我认为“后缀树”可以通过O(log n)的性能更好地解决它