在10000个文件中并行搜索特定的字符串模式

时间:2021-11-15 01:04:19

Problem Statement:-

问题陈述:

I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.

我需要在大约10000个文件中搜索一个特定的字符串模式,并在包含该特定模式的文件中找到记录。我可以在这里用grep,但是要花很多时间。

Below is the command I am using to search a particular string pattern after unzipping the dat.gz file

下面是我在解压缩dat后用于搜索特定字符串模式的命令。gz文件

gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'

If I simply count how many files are there after unzipping the above dat.gz file

如果我只是在解压缩上面的数据之后计算有多少文件。gz文件

gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l

I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.

我得到大约10000个文件。我需要在所有10000个文件中搜索上面的字符串模式,并找到包含上面的字符串模式的记录。我上面的命令运行得很好,但是非常慢。

What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.

在这方面最好的方法是什么?我们一次取100个文件,然后并行地搜索100个文件中的特定字符串模式。

Note:

注意:

I am running SunOS

我跑步SunOS

bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc

4 个解决方案

#1


2  

Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.

不要同时运行这个!!!这将使磁盘磁头到处弹起,速度会慢得多。

Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.

因为您正在阅读一个归档文件,所以有一种方法可以显著提高性能——不要写出解压缩的结果。理想的答案是在内存中解压到一个流中,如果不可行的话,就把它解压到一个ramdisk上。

In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.

在任何情况下,你都希望这里有一些并行性——一个线程应该获取数据,然后将数据传递给另一个执行搜索的线程。这样,您就可以在磁盘上等待或在核心上执行解压缩,而不会浪费任何时间进行搜索。

(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)

(注意,对于ramdisk,您需要积极地读取它所写的文件,然后删除它们,这样ramdisk就不会被填满。)

#2


0  

For starters, you will need to uncompress the file to disk.

首先,您需要将文件解压到磁盘。

This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:

这确实起作用(在bash中),但是您可能不想尝试一次启动10,000个进程。在未压缩的目录中运行:

for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done

So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):

因此,我们需要一种方法来限制衍生过程的数量。只要机器上运行的grep进程数量超过10(包括计算在内),就会循环使用:

while [ `top -b -n1 | grep -c grep` -gt 10  ]; do echo true; done

I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?

我运行这个,它的工作原理....但是top需要很长时间才能运行,它有效地限制了你每秒只能运行一个grep。是否有人可以对此进行改进,在启动新进程时向计数中添加一个,在进程结束时由一个进程减重?

for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done

Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.

关于什么时候睡觉,什么时候不睡觉,还有什么别的想法吗?不好意思,这是部分解,但我希望有人有你需要的另一个位。

#3


0  

If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.

如果不使用正则表达式,可以使用grep的-F选项或使用fgrep。这可以为您提供额外的性能。

#4


0  

Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.

你gzcat ....| wc -l并不表示10000个文件,它表示无论有多少个文件,总共10000行。

This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:

这是xargs存在的问题类型。假设您的gzip版本带有一个名为gzgrep的脚本(或者可能仅仅是zgrep),您可以这样做:

find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep

That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

这将运行一个gzgrep命令,该命令可以在命令行中包含尽可能多的单个文件(xargs有限制数量的选项,或者其他一些选项)。不幸的是,gzgrep仍然需要解压每个文件并将其传递给grep,但是实际上没有任何好方法可以避免为了搜索整个文集而不得不解压它。然而,以这种方式使用xargs将减少需要生成的新进程的总数。

#1


2  

Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.

不要同时运行这个!!!这将使磁盘磁头到处弹起,速度会慢得多。

Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.

因为您正在阅读一个归档文件,所以有一种方法可以显著提高性能——不要写出解压缩的结果。理想的答案是在内存中解压到一个流中,如果不可行的话,就把它解压到一个ramdisk上。

In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.

在任何情况下,你都希望这里有一些并行性——一个线程应该获取数据,然后将数据传递给另一个执行搜索的线程。这样,您就可以在磁盘上等待或在核心上执行解压缩,而不会浪费任何时间进行搜索。

(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)

(注意,对于ramdisk,您需要积极地读取它所写的文件,然后删除它们,这样ramdisk就不会被填满。)

#2


0  

For starters, you will need to uncompress the file to disk.

首先,您需要将文件解压到磁盘。

This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:

这确实起作用(在bash中),但是您可能不想尝试一次启动10,000个进程。在未压缩的目录中运行:

for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done

So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):

因此,我们需要一种方法来限制衍生过程的数量。只要机器上运行的grep进程数量超过10(包括计算在内),就会循环使用:

while [ `top -b -n1 | grep -c grep` -gt 10  ]; do echo true; done

I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?

我运行这个,它的工作原理....但是top需要很长时间才能运行,它有效地限制了你每秒只能运行一个grep。是否有人可以对此进行改进,在启动新进程时向计数中添加一个,在进程结束时由一个进程减重?

for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done

Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.

关于什么时候睡觉,什么时候不睡觉,还有什么别的想法吗?不好意思,这是部分解,但我希望有人有你需要的另一个位。

#3


0  

If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.

如果不使用正则表达式,可以使用grep的-F选项或使用fgrep。这可以为您提供额外的性能。

#4


0  

Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.

你gzcat ....| wc -l并不表示10000个文件,它表示无论有多少个文件,总共10000行。

This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:

这是xargs存在的问题类型。假设您的gzip版本带有一个名为gzgrep的脚本(或者可能仅仅是zgrep),您可以这样做:

find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep

That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

这将运行一个gzgrep命令,该命令可以在命令行中包含尽可能多的单个文件(xargs有限制数量的选项,或者其他一些选项)。不幸的是,gzgrep仍然需要解压每个文件并将其传递给grep,但是实际上没有任何好方法可以避免为了搜索整个文集而不得不解压它。然而,以这种方式使用xargs将减少需要生成的新进程的总数。