grep为何如此之快

时间:2021-12-22 16:51:06

下面是GNU grep的原作者MikeHaertel 在FreeBSD邮件列表中对 “GNU grep为什么比BSD grep要快” 这个问题所做的回答,解释了grep是如何进行快速搜索的,下面是邮件正文内容:

why GNU grep is fast

Mike Haertel mike at ducky.net

Sat Aug 21 03:00:30 UTC 2010

•     Previous message: Latest intr problems

•     Next message: why GNU grep is fast

•     Messages sorted by: [ date ] [ thread ] [subject ] [ author ]

________________________________________

Hi Gabor,

 

I am the original author of GNU grep.  I am also a FreeBSD user,

although I live on -stable (and older) andrarely pay attention

to -current.

 

However, while searching the -current mailinglist for an unrelated

reason, I stumbled across some flamageregarding BSD grep vs GNU grep

performance. You may have noticed that discussion too...

 

Anyway, just FYI, here's a quick summary ofwhere GNU grep gets

its speed. Hopefully you can carry these ideas over to BSD grep.

 

#1 trick: GNU grep is fast because it AVOIDSLOOKING AT

EVERY INPUT BYTE.

 

#2 trick: GNU grep is fast because itEXECUTES VERY FEW

INSTRUCTIONS FOR EACH BYTE that it *does*look at.

 

GNU grep uses the well-known Boyer-Moorealgorithm, which looks

first for the final letter of the targetstring, and uses a lookup

table to tell it how far ahead it can skip inthe input whenever

it finds a non-matching character.

 

GNU grep also unrolls the inner loop ofBoyer-Moore, and sets up

the Boyer-Moore delta table entries in such away that it doesn't

need to do the loop exit test at every unrolledstep.  The result

of this is that, in the limit, GNU grepaverages fewer than 3 x86

instructions executed for each input byte itactually looks at

(and it skips many bytes entirely).

 

See "Fast String Searching", byAndrew Hume and Daniel Sunday,

in the November 1991 issue of SoftwarePractice & Experience, for

a good discussion of Boyer-Mooreimplementation tricks.  It's

available as a free PDF online.

 

Once you have fast search, you'll find youalso need fast input.

 

GNU grep uses raw Unix input system calls andavoids copying data

after reading it.

 

Moreover, GNU grep AVOIDS BREAKING THE INPUTINTO LINES.  Looking

for newlines would slow grep down by a factorof several times,

because to find the newlines it would have tolook at every byte!

 

So instead of using line-oriented input, GNUgrep reads raw data into

a large buffer, searches the buffer usingBoyer-Moore, and only when

it finds a match does it go and look for thebounding newlines.

(Certain command line options like -n disablethis optimization.)

 

Finally, when I was last the maintainer ofGNU grep (15+ years ago...),

GNU grep also tried very hard to set thingsup so that the *kernel*

could ALSO avoid handling every byte of theinput, by using mmap()

instead of read() for file input.  At the time, using read() caused

most Unix versions to do extra copying.  Since GNU grep passed out

of my hands, it appears that use of mmapbecame non-default, but you

can still get it via --mmap.  And at least in cases where the data

is already file system buffer caches, mmap isstill faster:

 

  $time sh -c 'find . -type f -print | xargs grep -l 123456789abcdef'

  real     0m1.530s

  user     0m0.230s

  sys      0m1.357s

  $time sh -c 'find . -type f -print | xargs grep --mmap -l 123456789abcdef'

  real     0m1.201s

  user     0m0.330s

  sys      0m0.929s

 

[workload was a 648 megabyte MH mail foldercontaining 41000 messages]

So even nowadays, using --mmap can be worth a>20% speedup.

 

Summary:

 

- Use Boyer-Moore (and unroll its inner loopa few times).

 

- Roll your own unbuffered input using rawsystem calls.  Avoid copying

  theinput bytes before searching them.  (Do,however, use buffered

 *output*.  The normal grepscenario is that the amount of output is

  smallcompared to the amount of input, so the overhead of output

 buffer copying is small, while savings due to avoiding many small

 unbuffered writes can be large.)

 

- Don't look for newlines in the input untilafter you've found a match.

 

- Try to set things up (page-aligned buffers,page-sized read chunks,

 optionally use mmap) so the kernel can ALSO avoid copying the bytes.

 

The key to making programs fast is to makethem do practically nothing. ;-)

 

Regards,

 

       Mike

 

下面是翻译:


Gabor 您好,

我是GNU grep的原作者,同时我也是一名FreeBSD用户,不过我一直使用的是-stable版本(一个更老的版本),没怎么关注-current版本。


但是,当我无意间翻阅-current版的邮件列表时,发现了一些关于BSD grep与GNU grep性能的讨论,你可能也注意到了。


不管怎么说,仅供参考,下面是一些关于为什么GNU grep如此之快的简单总结。或许你能借鉴其中的一些思想运用到BSD grep中去。


#技巧1:GNU grep之所以快是因为它并不会去检查输入中的每一个字节


#技巧2:GNU grep之所以快是因为它对那些的确需要检查的每个字节都执行非常少的指令(操作)


GNU grep使用了非常著名的Boyer-Moore算法,该算法首先从目标字符串的最后一个字符开始查找,并且使用一个查找表,它可以在发现一个不匹配字符之后,计算出可以跳过多少个输入字符并继续查找。


GNU grep还展开了Boyer-Moore算法的内部循环,并建立了一个Boyer-Moore的delta表,这样它就不需要在每一个展开的步骤进行循环退出判断了。这样的结果就是,在极限情况下,GNU grep在需要检查的每一个输入字节上所执行的x86指令不会超过3条(并且还跳过了许多字节)。


你可以看看由Andrew Hume和Daniel Sunday 1991年11月在“Software Practice & Experience”上发表的论文“FastString Searching”,该文很好的讨论了Boyer-Moore算法的实现技巧,该文有免费的PDF在线版(这里查看或下载)。


一旦有了快速搜索,这时你会发现也需要同样快速的输入。


GNU grep使用了原生Unix输入系统调用并避免了在读取后对数据进行拷贝。


而且,GNU grep还避免了对输入进行分行,查找换行符会让grep减慢好几倍,因为要找换行符你就必须查看每个字节!


所以GNU grep没有使用基于行的输入,而是将原数据读入到一个大的缓冲区buffer,用Boyer-Moore算法对这个缓冲区进行搜索,只有在发现一个匹配之后才会去查找最近的换行符(某些命令参数,比如-n会禁止这种优化)。


最后,当我还在维护GNU grep的时候(15+年前……),GNU grep也尝试做一些非常困难的事情使内核也能避免处理输入的每个字节,比如使用mmap()而不是read()来进行文件输入。当时,用read()会使大部分Unix版本造成一些额外的拷贝。因为我已经不再GNU grep了,所以似乎mmap已经不再默认使用了,但是你仍然可以通过参数–mmap来启用它,至少在文件系统的buffer已经缓存了你的数据的情况下,mmap仍然要快一些:


1

2

3

4

5

6

7

8

$ time sh -c 'find . -type f -print | xargs grep -l 123456789abcdef'

  real  0m1.530s

  user  0m0.230s

  sys   0m1.357s

$ time sh -c 'find . -type f -print | xargs grep --mmap -l 123456789abcdef'

  real  0m1.201s

  user  0m0.330s

  sys   0m0.929s

[这里使用的输入是一个648M的MH邮件文件夹,包含大约41000条信息]

所以即使在今天,使用–mmap仍然可以提速20%以上。


总结:


- 使用Boyer-Moore算法(并且展开它的内层循环)。

- 使用原生系统调用来建立你的缓冲输入,避免在搜索之前拷贝输入字节。(无论如何,最好使用缓冲输出,因为在grep的常用场景中,输出的要比输入的少,所以输出缓冲拷贝的开销要小,并且可以节省许多这样小的无缓冲写操作。)

- 在找到一个匹配之前,不要查找换行符。

- 尝试做一些设置(比如页面对齐缓冲区,按页大小来读取块,选择性的使用mmap),这样可以使内核避免拷贝字节。

让程序变得更快的关键就是让它们做更少的事情。;-)

致礼

Mike