字符串匹配性能:gcc与CPython。

时间:2021-06-15 04:13:00

Whilst researching performance trade-offs between Python and C++, I've devised a small example, which mostly focusses on a dumb substring matching.

在研究Python和c++之间的性能权衡时,我设计了一个小示例,它主要关注于哑子字符串匹配。

Here is the relevant C++:

以下是相关的c++:

using std::string;
std::vector<string> matches;
std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
   [&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );

The above is built with -O3.

上面是用-O3构建的。

And here is Python:

这是Python:

def getMatchingPatterns(patterns, text):
    return filter(text.__contains__, patterns)

Both of them take a large-ish set of patterns and input file, and filter down the list of patterns to the ones found in the file using a dumb substring search.

它们都采用一组大的模式和输入文件,并使用哑子字符串搜索将模式列表过滤到文件中找到的模式。

The versions are:

版本:

  • gcc - 4.8.2 (Ubuntu) and 4.9.2 (cygwin)
  • gcc - 4.8.2 (Ubuntu)和4.9.2 (cygwin)
  • python - 2.7.6 (Ubuntu) and 2.7.8 (cygwin)
  • python - 2.7.6 (Ubuntu)和2.7.8 (cygwin)

What was surprising to me is the performance. I've run both on a low-spec Ubuntu and Python was faster by about 20%. The same on mid-spec PC with cygwin - Python twice faster. Profiler shows that 99+% of cycles are spent in string matching (string copying and list comprehensions are insignificant).

令我惊讶的是表演。我在低规格的Ubuntu上运行过,Python的运行速度快了20%。同样的事情也发生在中型PC上,cygwin - Python比cygwin快了两倍。Profiler显示99% +%的循环用于字符串匹配(不需要进行字符串复制和列表理解)。

Obviously, the Python implementation is native C, and I'd expected to be roughly the same as C++, but did not expect it as fast.

显然,Python实现是本地的C,我希望它与c++大致相同,但没想到它会这么快。

Any insight into relevant CPython optimisations in comparison to gcc would be most welcome.

与gcc相比,任何有关CPython优化的见解都是最受欢迎的。

For reference, here are the full examples. The inputs just take a set of 50K HTLMs (all read from disk in each test, no special caching):

作为参考,这里有完整的例子。输入只接受一组50K HTLMs(每个测试都从磁盘读取,没有特殊缓存):

Python:

Python:

import sys

def getMatchingPatterns(patterns, text):
   return filter(text.__contains__, patterns)

def serialScan(filenames, patterns):
   return zip(filenames, [getMatchingPatterns(patterns, open(filename).read()) for filename in filenames])

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   resultTuple = serialScan(filenames, patterns)
   for filename, patterns in resultTuple:
      print ': '.join([filename, ','.join(patterns)])

C++:

c++:

#include <iostream>
#include <iterator>
#include <fstream>
#include <string>
#include <vector>
#include <unordered_map>
#include <algorithm>

using namespace std;
using MatchResult = unordered_map<string, vector<string>>;
static const size_t PATTERN_RESERVE_DEFAULT_SIZE = 5000;

MatchResult serialMatch(const vector<string> &filenames, const vector<string> &patterns)
   {
   MatchResult res;
   for (auto &filename : filenames)
      {
      ifstream file(filename);
      const string fileContents((istreambuf_iterator<char>(file)),
                                         istreambuf_iterator<char>());
      vector<string> matches;
      std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
                   [&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );

      res.insert(make_pair(filename, std::move(matches)));
      }
   return res;
   }

int main(int argc, char **argv)
    {
    vector<string> filenames;
    ifstream filenamesListFile(argv[1]);
    std::copy(istream_iterator<string>(filenamesListFile), istream_iterator<string>(),
             back_inserter(filenames));

    vector<string> patterns;
    patterns.reserve(PATTERN_RESERVE_DEFAULT_SIZE);
    ifstream patternsFile(argv[2]);
    std::copy(istream_iterator<string>(patternsFile), istream_iterator<string>(),
             back_inserter(patterns));

    auto matchResult = serialMatch(filenames, patterns);

    for (const auto &matchItem : matchResult)
      {
      cout << matchItem.first << ": ";
      for (const auto &matchString : matchItem.second)
         cout << matchString << ",";
      cout << endl;
      }
    }

1 个解决方案

#1


13  

The python 3.4 code b'abc' in b'abcabc' (or b'abcabc'.__contains__(b'abc') as in your example) executes the bytes_contains method, which in turn calls the inlined function stringlib_find; which delegates the search to FASTSEARCH.

在b'abcabc'(或b'abcabc'.__contains__ '.__contains__(b'abc')中的python 3.4代码中,执行bytes_contains方法,该方法反过来调用内联函数stringlib_find;将搜索委托给FASTSEARCH。

The FASTSEARCH function then uses a modified Boyer-Moore search algorithm:

FASTSEARCH函数然后使用一个经过修改的Boyer-Moore搜索算法:

fast search/count implementation, based on a mix between boyer- moore and horspool, with a few more bells and whistles on the top. for some more background, see: http://effbot.org/zone/stringlib.htm

快速搜索/计数实现,基于boyer- moore和horspool的混合,上面还有一些附加功能。有关更多的背景知识,请参见:http://effbot.org/zone/stringlib.htm

There are some modifications too, as noted by the comments:

如评论所指出,也有一些修改:

note: fastsearch may access s[n], which isn't a problem when using Python's ordinary string types, but may cause problems if you're using this code in other contexts. also, the count mode returns -1 if there cannot possible be a match in the target string, and 0 if it has actually checked for matches, but didn't find any. callers beware!

注意:fastsearch可能访问s[n],这在使用Python的普通字符串类型时不是问题,但是如果在其他上下文中使用此代码,则可能会导致问题。此外,如果目标字符串中不可能有匹配项,计数模式返回-1;如果确实检查了匹配项,返回0,但没有找到匹配项。调用者小心!


The GNU C++ Standard Library basic_string<T>::find() implementation is as generic (and dumb) as possible; it just tries dumbly matching the pattern at each and every consecutive character position until it finds the match.

GNU c++标准库basic_string ::find()实现是尽可能通用的(和愚蠢的);它只是在每一个连续的字符位置上尝试不加注意地匹配模式,直到找到匹配。


TL;DR: The reason why the C++ standard library is so slow compared to Python is because it tries to do a generic algorithm on top of std::basic_string<char>, but fails to do it efficiently for the more interesting cases; whereas in Python the programmer gets the most efficient algorithms on case-by-case basis for free.

与Python相比,c++标准库如此缓慢的原因是,它试图在std::basic_string 之上执行一个通用算法,但在更有趣的情况下却没有有效地执行;而在Python中,程序员可以根据具体情况免费获得最有效的算法。

#1


13  

The python 3.4 code b'abc' in b'abcabc' (or b'abcabc'.__contains__(b'abc') as in your example) executes the bytes_contains method, which in turn calls the inlined function stringlib_find; which delegates the search to FASTSEARCH.

在b'abcabc'(或b'abcabc'.__contains__ '.__contains__(b'abc')中的python 3.4代码中,执行bytes_contains方法,该方法反过来调用内联函数stringlib_find;将搜索委托给FASTSEARCH。

The FASTSEARCH function then uses a modified Boyer-Moore search algorithm:

FASTSEARCH函数然后使用一个经过修改的Boyer-Moore搜索算法:

fast search/count implementation, based on a mix between boyer- moore and horspool, with a few more bells and whistles on the top. for some more background, see: http://effbot.org/zone/stringlib.htm

快速搜索/计数实现,基于boyer- moore和horspool的混合,上面还有一些附加功能。有关更多的背景知识,请参见:http://effbot.org/zone/stringlib.htm

There are some modifications too, as noted by the comments:

如评论所指出,也有一些修改:

note: fastsearch may access s[n], which isn't a problem when using Python's ordinary string types, but may cause problems if you're using this code in other contexts. also, the count mode returns -1 if there cannot possible be a match in the target string, and 0 if it has actually checked for matches, but didn't find any. callers beware!

注意:fastsearch可能访问s[n],这在使用Python的普通字符串类型时不是问题,但是如果在其他上下文中使用此代码,则可能会导致问题。此外,如果目标字符串中不可能有匹配项,计数模式返回-1;如果确实检查了匹配项,返回0,但没有找到匹配项。调用者小心!


The GNU C++ Standard Library basic_string<T>::find() implementation is as generic (and dumb) as possible; it just tries dumbly matching the pattern at each and every consecutive character position until it finds the match.

GNU c++标准库basic_string ::find()实现是尽可能通用的(和愚蠢的);它只是在每一个连续的字符位置上尝试不加注意地匹配模式,直到找到匹配。


TL;DR: The reason why the C++ standard library is so slow compared to Python is because it tries to do a generic algorithm on top of std::basic_string<char>, but fails to do it efficiently for the more interesting cases; whereas in Python the programmer gets the most efficient algorithms on case-by-case basis for free.

与Python相比,c++标准库如此缓慢的原因是,它试图在std::basic_string 之上执行一个通用算法,但在更有趣的情况下却没有有效地执行;而在Python中,程序员可以根据具体情况免费获得最有效的算法。