在C/ c++中使用任何子字符串的顺序查找字符串内部的子字符串

时间:2022-01-14 21:28:40

Suppose I have a string "abcdpqrs", now "dcb" can be counted as a substring of above string as the characters are together. Also "pdq" is a part of above string. But "bcpq" is not. I hope you got what I want. Is there any efficient way to do this. All I can think is taking help of hash to do this. But it is taking long time even in O(n) program as backtracking is required in many cases. Any help will be appreciated.

假设我有一个字符串“abcdpqrs”,现在“dcb”可以作为上面字符串的子字符串,因为字符在一起。“pdq”也是上述字符串的一部分。但“bcpq”不是。我希望你得到我想要的。有什么有效的办法吗?我能想到的就是利用哈希做这个。但是即使在O(n)程序中也需要很长时间,因为在许多情况下都需要回溯。如有任何帮助,我们将不胜感激。

5 个解决方案

#1


3  

Here is an O(n * alphabet size) solution:

这里有一个O(n *字母表大小)解决方案:

Let's maintain an array count[a] = how many times the character a was in the current window [pos; pos + lenght of substring - 1]. It can be recomputed in O(1) time when the window is moved by 1 to the right(count[s[pos]]--, count[s[pos + substring lenght]]++, pos++). Now all we need is to check for each pos that count array is the same as count array for the substring(it can be computed only once).

让我们维护数组计数[a] =当前窗口中字符a的次数[pos;子字符串- 1的pos + lenght]。当窗口向右移动1时(count[s[pos]]]-- -- -- -- -- count[s[pos + substring lenght]]]- ++ +、pos++),可以在O(1)的时间内重新计算。现在我们只需要检查每个pos,它的计数数组与子字符串的计数数组相同(只能计算一次)。

It can actually be improved to O(n + alphabet size):

它实际上可以改进为O(n +字母大小):

Instead of comparing count arrays in a naive way, we can maintain the number diff = number of characters that do not have the same count value as in a substring for the current window. The key observation is that diff changes in obvious way we apply count[c]-- or count[c]++ (it either gets incremented, decremented or stays the same depending on only count[c] value). Two count arrays are the same if and only if diff is zero for current pos.

我们可以保持number diff =不具有与当前窗口子字符串中相同计数值的字符数,而不是简单地比较计数数组。关键的观察是,diff以明显的方式变化,我们应用count[c]——或count[c]++(它要么被递增、递减,要么保持不变,这取决于只使用count[c]值)。当且仅当当前pos的diff为零时,两个计数数组是相同的。

#2


1  

Lets say you have the string "axcdlef" and wants to search "opde":

假设你有一个字符串"axcdlef"并想要搜索"opde"

bool compare (string s1, string s2)
{
  // sort both here
  // return if they are equal when sorted;
}

you would need to call this function for this example with the following substrings of size 4(same as length as "opde"):

您需要为这个示例调用这个函数,其子字符串大小为4(与“opde”长度相同):

"axcd" "xcdl" "cdle" "dlef"

“xcdl”“axcd cdle”“dlef”

  bool exist = false;

  for (/*every split that has the same size as the search */)
      exist = exist || compare(currentsplit, search);

#3


0  

To use an array for this you are going to need some extra code to map where each character goes in there... Unless you know you are only using 'a' - 'z' or something similar that you can simply subtract from 'a' to get the position.

要为此使用数组,您需要一些额外的代码来映射每个字符所在的位置……除非你知道你只使用“a”-“z”或者类似的东西,你可以简单地从“a”中减去“a”来得到这个位置。

bool compare(string s1, string s2)
{
   int v1[SIZE_OF_ALFABECT];
   int v2[SIZE_OF_ALFABECT];
   int count = 0;
   map<char, int> mymap;

  // here is just pseudocode
   foreach letter in s1:
      if map doesnt contain this letter already:
           mymap[letter] = count++;

 // repeat the same foreach in s2

 /* You can break and return false here if you try to add new char into map, 
  that means that the second string has a different character already... */

 // count will now have the number of distinct chars that you have in both strs

 // you will need to check only 'count' positions in the vectors

 for(int i = 0; i < count; i++)
    v1[i] = v2[i] = 0;

 //another pseudocode
   foreach letter in s1:
      v1[mymap[leter]]++;
   foreach letter in s1:
      v2[mymap[leter]]++;

  for(int i = 0; i < count; i++)
      if(v1[i] != v2[i])
          return false;

  return true;
}

#4


0  

You can use a regex (i.e boost or Qt) for this. Alternately you an use this simple approach. You know the length k of the string s to be searched in string str. So take each k consecutive characters from str and check if any of these characters is present in s.

您可以使用regex (i)。e boost或Qt)用于此。交替使用这个简单的方法。你知道要在字符串str中搜索的字符串s的长度k,所以从str中取出每个连续的k个字符,检查这些字符是否存在于s中。

Starting point ( a naive implementation to make further optimizations):

起点(进行进一步优化的幼稚实现):

#include <iostream>

/* pos position where to extract probable string from str
*  s string set with possible repetitions being searched in str
*  str original string
*/
bool find_in_string( int pos, std::string s, std::string str)
{
    std::string str_s = str.substr( pos, s.length());
    int s_pos = 0;

    while( !s.empty())
    {
        std::size_t found = str_s.find( s[0]);
        if ( found!=std::string::npos)
        {
            s.erase( 0, 1);
            str_s.erase( found, 1);
        } else return 0;
    }

    return 1;
}

bool find_in_string( std::string s, std::string str)
{
    bool found = false;
    int pos = 0;    
    while( !found && pos < str.length() - s.length() + 1)
    {
        found = find_in_string( pos++, s, str);
    }

    return found;
}

Usage:

用法:

int main() {

    std::string s1 = "abcdpqrs";
    std::string s2 = "adcbpqrs";
    std::string searched = "dcb";
    std::string searched2 = "pdq";
    std::string searched3 = "bcpq";
    std::cout << find_in_string( searched, s1);
    std::cout << find_in_string( searched, s2);
    std::cout << find_in_string( searched2, s1);
    std::cout << find_in_string( searched3, s1);

    return 0;
}

prints: 1110

打印:1110

http://ideone.com/WrSMeV

http://ideone.com/WrSMeV

#5


-1  

Here is a O(m) best case, O(m!) worst case solution - m being the length of your search string:

这是O(m)最好的情况,O(m!)最坏的情况- m是你搜索字符串的长度:

Use a suffix-trie, e.g. a Ukkonnen Trie (there are some floating around, but I have no link at hand at the moment), and search for any permutation of the substring. Note that any lookup needs just O(1) for each chararacter of the string to search, regardless of the size of n.

使用一个suffix-trie,例如一个Ukkonnen Trie(有一些浮动,但我现在手头没有链接),并搜索子字符串的任何排列。注意,无论n的大小,任何查找都需要对字符串的每个chararacter进行搜索。

However, while the size of n does not matter, this becomes inpractical for large m.

然而,虽然n的大小并不重要,但对于大m来说,这就变得不实际了。

If however n is small enough anf one is willing to sacrifice lookup performance for index size, the suffix trie can store a string that contains all permutations of the original string.

如果n足够小,人们愿意牺牲索引大小的查找性能,那么后缀trie可以存储包含原始字符串的所有排列的字符串。

Then the lookup will always be O(m).

然后查找将始终为O(m)。

I'd suggest to go with the accepted answer for the general case. However, here you have a suggestion that can perform (much) better for small substrings and large string.

我建议对一般情况采用公认的答案。但是,这里您有一个建议,可以对小的子字符串和大字符串执行(很多)。

#1


3  

Here is an O(n * alphabet size) solution:

这里有一个O(n *字母表大小)解决方案:

Let's maintain an array count[a] = how many times the character a was in the current window [pos; pos + lenght of substring - 1]. It can be recomputed in O(1) time when the window is moved by 1 to the right(count[s[pos]]--, count[s[pos + substring lenght]]++, pos++). Now all we need is to check for each pos that count array is the same as count array for the substring(it can be computed only once).

让我们维护数组计数[a] =当前窗口中字符a的次数[pos;子字符串- 1的pos + lenght]。当窗口向右移动1时(count[s[pos]]]-- -- -- -- -- count[s[pos + substring lenght]]]- ++ +、pos++),可以在O(1)的时间内重新计算。现在我们只需要检查每个pos,它的计数数组与子字符串的计数数组相同(只能计算一次)。

It can actually be improved to O(n + alphabet size):

它实际上可以改进为O(n +字母大小):

Instead of comparing count arrays in a naive way, we can maintain the number diff = number of characters that do not have the same count value as in a substring for the current window. The key observation is that diff changes in obvious way we apply count[c]-- or count[c]++ (it either gets incremented, decremented or stays the same depending on only count[c] value). Two count arrays are the same if and only if diff is zero for current pos.

我们可以保持number diff =不具有与当前窗口子字符串中相同计数值的字符数,而不是简单地比较计数数组。关键的观察是,diff以明显的方式变化,我们应用count[c]——或count[c]++(它要么被递增、递减,要么保持不变,这取决于只使用count[c]值)。当且仅当当前pos的diff为零时,两个计数数组是相同的。

#2


1  

Lets say you have the string "axcdlef" and wants to search "opde":

假设你有一个字符串"axcdlef"并想要搜索"opde"

bool compare (string s1, string s2)
{
  // sort both here
  // return if they are equal when sorted;
}

you would need to call this function for this example with the following substrings of size 4(same as length as "opde"):

您需要为这个示例调用这个函数,其子字符串大小为4(与“opde”长度相同):

"axcd" "xcdl" "cdle" "dlef"

“xcdl”“axcd cdle”“dlef”

  bool exist = false;

  for (/*every split that has the same size as the search */)
      exist = exist || compare(currentsplit, search);

#3


0  

To use an array for this you are going to need some extra code to map where each character goes in there... Unless you know you are only using 'a' - 'z' or something similar that you can simply subtract from 'a' to get the position.

要为此使用数组,您需要一些额外的代码来映射每个字符所在的位置……除非你知道你只使用“a”-“z”或者类似的东西,你可以简单地从“a”中减去“a”来得到这个位置。

bool compare(string s1, string s2)
{
   int v1[SIZE_OF_ALFABECT];
   int v2[SIZE_OF_ALFABECT];
   int count = 0;
   map<char, int> mymap;

  // here is just pseudocode
   foreach letter in s1:
      if map doesnt contain this letter already:
           mymap[letter] = count++;

 // repeat the same foreach in s2

 /* You can break and return false here if you try to add new char into map, 
  that means that the second string has a different character already... */

 // count will now have the number of distinct chars that you have in both strs

 // you will need to check only 'count' positions in the vectors

 for(int i = 0; i < count; i++)
    v1[i] = v2[i] = 0;

 //another pseudocode
   foreach letter in s1:
      v1[mymap[leter]]++;
   foreach letter in s1:
      v2[mymap[leter]]++;

  for(int i = 0; i < count; i++)
      if(v1[i] != v2[i])
          return false;

  return true;
}

#4


0  

You can use a regex (i.e boost or Qt) for this. Alternately you an use this simple approach. You know the length k of the string s to be searched in string str. So take each k consecutive characters from str and check if any of these characters is present in s.

您可以使用regex (i)。e boost或Qt)用于此。交替使用这个简单的方法。你知道要在字符串str中搜索的字符串s的长度k,所以从str中取出每个连续的k个字符,检查这些字符是否存在于s中。

Starting point ( a naive implementation to make further optimizations):

起点(进行进一步优化的幼稚实现):

#include <iostream>

/* pos position where to extract probable string from str
*  s string set with possible repetitions being searched in str
*  str original string
*/
bool find_in_string( int pos, std::string s, std::string str)
{
    std::string str_s = str.substr( pos, s.length());
    int s_pos = 0;

    while( !s.empty())
    {
        std::size_t found = str_s.find( s[0]);
        if ( found!=std::string::npos)
        {
            s.erase( 0, 1);
            str_s.erase( found, 1);
        } else return 0;
    }

    return 1;
}

bool find_in_string( std::string s, std::string str)
{
    bool found = false;
    int pos = 0;    
    while( !found && pos < str.length() - s.length() + 1)
    {
        found = find_in_string( pos++, s, str);
    }

    return found;
}

Usage:

用法:

int main() {

    std::string s1 = "abcdpqrs";
    std::string s2 = "adcbpqrs";
    std::string searched = "dcb";
    std::string searched2 = "pdq";
    std::string searched3 = "bcpq";
    std::cout << find_in_string( searched, s1);
    std::cout << find_in_string( searched, s2);
    std::cout << find_in_string( searched2, s1);
    std::cout << find_in_string( searched3, s1);

    return 0;
}

prints: 1110

打印:1110

http://ideone.com/WrSMeV

http://ideone.com/WrSMeV

#5


-1  

Here is a O(m) best case, O(m!) worst case solution - m being the length of your search string:

这是O(m)最好的情况,O(m!)最坏的情况- m是你搜索字符串的长度:

Use a suffix-trie, e.g. a Ukkonnen Trie (there are some floating around, but I have no link at hand at the moment), and search for any permutation of the substring. Note that any lookup needs just O(1) for each chararacter of the string to search, regardless of the size of n.

使用一个suffix-trie,例如一个Ukkonnen Trie(有一些浮动,但我现在手头没有链接),并搜索子字符串的任何排列。注意,无论n的大小,任何查找都需要对字符串的每个chararacter进行搜索。

However, while the size of n does not matter, this becomes inpractical for large m.

然而,虽然n的大小并不重要,但对于大m来说,这就变得不实际了。

If however n is small enough anf one is willing to sacrifice lookup performance for index size, the suffix trie can store a string that contains all permutations of the original string.

如果n足够小,人们愿意牺牲索引大小的查找性能,那么后缀trie可以存储包含原始字符串的所有排列的字符串。

Then the lookup will always be O(m).

然后查找将始终为O(m)。

I'd suggest to go with the accepted answer for the general case. However, here you have a suggestion that can perform (much) better for small substrings and large string.

我建议对一般情况采用公认的答案。但是,这里您有一个建议,可以对小的子字符串和大字符串执行(很多)。