在C/ c++中进行不区分大小写的子字符串搜索的最快方法是什么?

Note

The question below was asked in 2008 about some code from 2003. As the OP's update shows, this entire post has been obsoleted by vintage 2008 algorithms and persists here only as a historical curiosity.

下面的问题在2008年被问及2003年的一些代码。正如OP的更新所显示的那样，这篇文章已经被2008年的老式算法淘汰了，并且只在这里停留作为一种历史的好奇心。

I need to do a fast case-insensitive substring search in C/C++. My requirements are as follows:

我需要在C/ c++中做一个快速不区分大小写的子字符串搜索。我的要求如下:

Should behave like strstr() (i.e. return a pointer to the match point).
应该表现为strstr()(即返回一个指向匹配点的指针)。
Must be case-insensitive (doh).
必须不区分大小写(哎)。
Must support the current locale.
必须支持当前区域设置。
Must be available on Windows (MSVC++ 8.0) or easily portable to Windows (i.e. from an open source library).
必须在Windows上可用(msvc++ 8.0)，或者可以方便地移植到Windows(即从开源库中)。

Here is the current implementation I am using (taken from the GNU C Library):

下面是我正在使用的当前实现(取自GNU C库):

/* Return the offset of one string within another.
   Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */

/*
 * Modified to use table lookup instead of tolower(), since tolower() isn't
 * worth s*** on Windows.
 *
 * -- Anders Sandvig (anders@wincue.org)
 */

#if HAVE_CONFIG_H
# include <config.h>
#endif

#include <ctype.h>
#include <string.h>

typedef unsigned chartype;

char char_table[256];

void init_stristr(void)
{
  int i;
  char string[2];

  string[1] = '\0';
  for (i = 0; i < 256; i++)
  {
    string[0] = i;
    _strlwr(string);
    char_table[i] = string[0];
  }
}

#define my_tolower(a) ((chartype) char_table[a])

char *
my_stristr (phaystack, pneedle)
     const char *phaystack;
     const char *pneedle;
{
  register const unsigned char *haystack, *needle;
  register chartype b, c;

  haystack = (const unsigned char *) phaystack;
  needle = (const unsigned char *) pneedle;

  b = my_tolower (*needle); 
  if (b != '\0')
  {
    haystack--;             /* possible ANSI violation */
    do
      {
        c = *++haystack;
        if (c == '\0')
          goto ret0;
      }
    while (my_tolower (c) != (int) b);

    c = my_tolower (*++needle);
    if (c == '\0')
        goto foundneedle;

    ++needle;
    goto jin;

    for (;;)
    {
      register chartype a;
        register const unsigned char *rhaystack, *rneedle;

        do
        {
          a = *++haystack;
          if (a == '\0')
              goto ret0;
          if (my_tolower (a) == (int) b)
              break;
          a = *++haystack;
          if (a == '\0')
              goto ret0;
        shloop:
          ;
        }
      while (my_tolower (a) != (int) b);

jin:      
      a = *++haystack;
      if (a == '\0')
          goto ret0;

        if (my_tolower (a) != (int) c)
          goto shloop;

        rhaystack = haystack-- + 1;
        rneedle = needle;

        a = my_tolower (*rneedle);

        if (my_tolower (*rhaystack) == (int) a)
          do
          {
              if (a == '\0')
                goto foundneedle;

              ++rhaystack;
          a = my_tolower (*++needle);
              if (my_tolower (*rhaystack) != (int) a)
                break;

          if (a == '\0')
                goto foundneedle;

          ++rhaystack;
              a = my_tolower (*++needle);
          }
          while (my_tolower (*rhaystack) == (int) a);

        needle = rneedle;       /* took the register-poor approach */

      if (a == '\0')
          break;
    }
  }
foundneedle:
  return (char*) haystack;
ret0:
  return 0;
}

Can you make this code faster, or do you know of a better implementation?

你能让代码更快吗?或者你知道更好的实现吗?

Note: I noticed that the GNU C Library now has a new implementation of strstr(), but I am not sure how easily it can be modified to be case-insensitive, or if it is in fact faster than the old one (in my case). I also noticed that the old implementation is still used for wide character strings, so if anyone knows why, please share.

注意:我注意到GNU C库现在有了strstr()的新实现，但是我不确定它可以很容易地被修改为不区分大小写，或者如果它实际上比旧的更快(在我的例子中)。我还注意到旧的实现仍然用于宽字符串，所以如果有人知道为什么，请分享。

Update

更新

Just to make things clear—in case it wasn't already—I didn't write this function, it's a part of the GNU C Library. I only modified it to be case-insensitive.

只是为了让事情更清楚——如果不是，我没有写这个函数，它是GNU C库的一部分。我只是修改它不区分大小写。

Also, thanks for the tip about strcasestr() and checking out other implementations from other sources (like OpenBSD, FreeBSD, etc.). It seems to be the way to go. The code above is from 2003, which is why I posted it here in hope for a better version being available, which apparently it is. :)

另外，感谢关于strcasestr()的提示以及从其他来源(如OpenBSD、FreeBSD等)检查其他实现。这似乎是一条出路。上面的代码是2003年的，这就是为什么我把它发布在这里，希望有一个更好的版本可用，显然是这样。:)

10 个解决方案

#1

The code you posted is about half as fast as strcasestr.

您发布的代码大约是strcasestr的一半。

$ gcc -Wall -o my_stristr my_stristr.c
steve@solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c 
steve@solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve@solaris:~/code/tmp
$ cat my_stristr.result 
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve@solaris:~/code/tmp
$ cat strcasestr.result 
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve@solaris:~/code/tmp

The main function was:

主要功能是:

int main(void)
{
        char * needle="hello";
        char haystack[1024];
        int i;

        for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
        {
                haystack[i]='A'+i%57;
        }
        memcpy(haystack+i,needle, strlen(needle)+1);
        /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
        init_stristr();

        for (i=0;i<1000000;++i)
        {
                /*my_stristr(haystack, needle);*/
                strcasestr(haystack,needle);
        }


        return 0;
}

It was suitably modified to test both implementations. I notice as I am typing this up I left in the init_stristr call, but it shouldn't change things too much. bench is just a simple shell script:

它经过适当的修改以测试这两个实现。我注意到，当我在init_stristr调用中输入这个时，它不应该改变太多。bench只是一个简单的shell脚本:

#!/bin/bash
function bc_calc()
{
        echo $(echo "scale=4;$1" | bc)
}
time="/usr/bin/time -p"
prog="$1"
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
        echo -n "run $a... "
        t=$($time $prog 2>&1| grep user | awk '{print $2}')
        echo "time = $t"
        accum=$(bc_calc "$accum+$t")
done

echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")

#2

You can use StrStrI function which finds the first occurrence of a substring within a string. The comparison is not case-sensitive. Don't forget to include its header - Shlwapi.h. Check this out: http://msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=vs.85).aspx

您可以使用StrStrI函数，它在字符串中找到子字符串的第一个出现。这种比较是不区分大小写的。别忘了加上它的标题——Shlwapi.h。看看这个:http://msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v = vs.85). aspx

#3

Why do you use _strlwr(string); in init_stristr()? It's not a standard function. Presumably it's for locale support, but as it's not standard, I'd just use:

为什么使用_strlwr(string);在init_stristr()?它不是一个标准函数。可能是为了地区支持，但由于它不是标准，我只使用:

char_table[i] = tolower(i);

#4

use boost string algo. It is available, cross platform, and only a header file (no library to link in). Not to mention that you should be using boost anyway.

使用字符串算法。它是可用的，跨平台，并且只有一个头文件(没有库链接)。更不用说你应该使用boost了。

#include <boost/algorithm/string/find.hpp>

const char* istrstr( const char* haystack, const char* needle )
{
   using namespace boost;
   iterator_range<char*> result = ifind_first( haystack, needle );
   if( result ) return result.begin();

   return NULL;
}

#5

For platform independent use:

平*立的使用:

const wchar_t *szk_wcsstri(const wchar_t *s1, const wchar_t *s2)
{
    if (s1 == NULL || s2 == NULL) return NULL;
    const wchar_t *cpws1 = s1, *cpws1_, *cpws2;
    char ch1, ch2;
    bool bSame;

    while (*cpws1 != L'\0')
    {
        bSame = true;
        if (*cpws1 != *s2)
        {
            ch1 = towlower(*cpws1);
            ch2 = towlower(*s2);

            if (ch1 == ch2)
                bSame = true;
        }

        if (true == bSame)
        {
            cpws1_ = cpws1;
            cpws2 = s2;
            while (*cpws1_ != L'\0')
            {
                ch1 = towlower(*cpws1_);
                ch2 = towlower(*cpws2);

                if (ch1 != ch2)
                    break;

                cpws2++;

                if (*cpws2 == L'\0')
                    return cpws1_-(cpws2 - s2 - 0x01);
                cpws1_++;
            }
        }
        cpws1++;
    }
    return NULL;
}

#6

I'd advice you to take some of the common strcasestr implementation that already exists. For example of glib, glibc, OpenBSD, FreeBSD, etc. You can search for more with google.com/codesearch. You can then make some performance measurements and compare the different implementation.

我建议您采取一些已经存在的公共strcasestr实现。例如glib、glibc、OpenBSD、FreeBSD等，您可以通过google.com/codesearch搜索更多信息。然后您可以进行一些性能度量并比较不同的实现。

#7

Assuming both input strings are already lowercase.

假设两个输入字符串都是小写的。

int StringInStringFindFirst(const char* p_cText, const char* p_cSearchText)
{
    int iTextSize = strlen(p_cText);
    int iSearchTextSize = strlen(p_cSearchText);

    char* p_cFound = NULL;

    if(iTextSize >= iSearchTextSize)
    {
        int iCounter = 0;
        while((iCounter + iSearchTextSize) <= iTextSize)
        {
            if(memcmp( (p_cText + iCounter), p_cSearchText, iSearchTextSize) == 0)
                return  iCounter;
            iCounter ++;
        }
    }

    return -1;
}

You could also, try using masks... if for example most of the strings you are going to compare only contains chars from a to z, maybe it's worth to do something like this.

你也可以试试用口罩……如果你要比较的大多数字符串只包含从a到z的字符，那么做类似的事情是值得的。

long GetStringMask(const char* p_cText)
{
    long lMask=0;

    while(*p_cText != '\0')
    {       
        if (*p_cText>='a' && *p_cText<='z')
            lMask = lMask | (1 << (*p_cText - 'a') );
        else if(*p_cText != ' ')
        {
            lMask = 0;
            break;      
        }

        p_cText ++;
    }
    return lMask;
}

Then...

然后……

int main(int argc, char* argv[])
{

    char* p_cText = "this is a test";   
    char* p_cSearchText = "test";

    long lTextMask = GetStringMask(p_cText);
    long lSearchMask = GetStringMask(p_cSearchText);

    int iFoundAt = -1;
    // If Both masks are Valid
    if(lTextMask != 0 && lSearchMask != 0)
    {
        if((lTextMask & lSearchMask) == lSearchMask)
        {       
             iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
        }
    }
    else
    {
        iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
    }


    return 0;
}

#8

This will not consider the locale, but If you can change the IS_ALPHA and TO_UPPER you can make it to consider it.

这不会考虑语言环境，但是如果您可以更改IS_ALPHA和TO_UPPER，您可以让它考虑它。

#define IS_ALPHA(c) (((c) >= 'A' && (c) <= 'Z') || ((c) >= 'a' && (c) <= 'z'))
#define TO_UPPER(c) ((c) & 0xDF)

char * __cdecl strstri (const char * str1, const char * str2){
        char *cp = (char *) str1;
        char *s1, *s2;

        if ( !*str2 )
            return((char *)str1);

        while (*cp){
                s1 = cp;
                s2 = (char *) str2;

                while ( *s1 && *s2 && (IS_ALPHA(*s1) && IS_ALPHA(*s2))?!(TO_UPPER(*s1) - TO_UPPER(*s2)):!(*s1-*s2))
                        ++s1, ++s2;

                if (!*s2)
                        return(cp);

                ++cp;
        }
        return(NULL);
}

#9

If you want to shed CPU cycles, you might consider this - let's assume that we're dealing with ASCII and not Unicode.

如果您想摆脱CPU周期，您可能会考虑这一点——假设我们在处理的是ASCII而不是Unicode。

Make a static table with 256 entries. Each entry in the table is 256 bits.

创建一个包含256个条目的静态表。表中的每个条目都是256位。

To test whether or not two characters are equal, you do something like this:

为了测试两个字符是否相等，你可以这样做:

if (BitLookup(table[char1], char2)) { /* match */ }

To build the table, you set a bit everywhere in table[char1] where you consider it a match for char2. So in building the table you would set the bits at the index for 'a' and 'A' in the 'a'th entry (and the 'A'th entry).

为了构建表，您在表[char1]中设置了一个位置，您认为它与char2匹配。所以在构建表格时，你要在“a”和“a”的索引中设置“a”和“a”的位。

Now this is going to be slowish to do the bit lookup (bit look up will be a shift, mask and add most likely), so you could use instead a table of bytes so you use 8 bits to represent 1 bit. This will take 32K - so hooray - you've hit a time/space trade-off! We might want to make the table more flexible, so let's say we do this instead - the table will define congruences instead.

现在这将是slowish来做位查找(位查找将是一个移位，掩码和添加最有可能)，所以您可以使用一个字节表，所以您使用8位表示1位。这将需要32K——那么好——你已经达到了一个时间/空间的平衡!我们可能想要使表更灵活，所以我们假设我们这样做——这个表将定义一致。

Two characters are considered congruent if and only if there is a function that defines them as equivalent. So 'A' and 'a' are congruent for case insensitivity. 'A', 'À', 'Á' and 'Â' are congruent for diacritical insensitivity.

如果只有一个函数定义它们是等价的，那么两个字符被认为是一致的。所以A和A是不敏感的。“A”、“A”、“A”和“A”是不敏感的。

So you define bitfields that correspond to your congruencies

所以你定义了对应于你的一致性的位域。

#define kCongruentCase (1 << 0)
#define kCongruentDiacritical (1 << 1)
#define kCongruentVowel (1 << 2)
#define kCongruentConsonant (1 << 3)

Then your test is something like this:

你的测试是这样的:

inline bool CharsAreCongruent(char c1, char c2, unsigned char congruency)
{
    return (_congruencyTable[c1][c2] & congruency) != 0;
}

#define CaseInsensitiveCharEqual(c1, c2) CharsAreCongruent(c1, c2, kCongruentCase)

This kind of bit fiddling with ginormous tables is the heart of ctype, by the by.

这种摆弄巨大桌子的小玩意儿是ctype的心脏。

#10

If you can control the needle string so that it is always in lower case, then you can write a modified version of stristr() to avoid the lookups for that, and thus speed up the code. It isn't as general, but it can be faster - slightly faster. Similar comments apply to the haystack, but you are more likely to be reading the haystack from sources outside your control for you cannot be certain that the data meets the requirement.

如果您可以控制指针字符串，使其始终处于较低的情况，那么您可以编写一个修改的stristr()版本，以避免查找，从而加快代码的速度。它不是一般的，但它可以更快——稍微快一点。类似的评论也适用于haystack，但是您更有可能从您的控件之外的源读取haystack，因为您不能确定数据是否符合要求。

Whether the gain in performance is worth it is another question altogether. For 99% of applications, the answer is "No, it is not worth it". Your application might be one of the tiny minority where it matters. More likely, it is not.

绩效的增长是否值得，这是另一个问题。对于99%的应用程序来说，答案是“不，不值得”。您的应用程序可能是少数几个重要的应用程序之一。更有可能的是，事实并非如此。

#1