This question is related to How to use grep efficiently?
这个问题与如何有效地使用grep有关?
I am trying to search for a "string" in a folder which has 8-10 million small (~2-3kb) plain text files. I need to know all the files which has "string".
我正在搜索一个文件夹中的“字符串”,该文件夹有800 - 1000万个小(约2-3kb)的纯文本文件。我需要知道所有有“字符串”的文件。
At first I used this
一开始我用的是这个
grep "string"
That was super slow.
这是超级慢。
Then I tried
然后我试着
grep * "string" {} \; -print
Based on linked question, I used this
基于链接的问题,我使用了这个。
find . | xargs -0 -n1 -P8 grep -H "string"
I get this error:
我得到这个错误:
xargs: argument line too long
Does anyone know a way to accomplish this task relatively quicker?
有没有人知道一种相对较快地完成这项任务的方法?
I run this search on a server machine which has more than 50GB of available RAM, and 14 cores of CPU. I wish somehow I could use all that processing power to run this search faster.
我在一台服务器机器上运行这个搜索,它有超过50GB的可用RAM和14个CPU内核。我希望我能以某种方式使用所有的处理能力来更快地运行这个搜索。
4 个解决方案
#1
11
You should remove -0
argument to xargs
and up -n
parameter instead:
应该删除xargs的-0参数和up -n参数:
... | xargs -n16 ...
#2
10
It's not that big stack of files (kudos to 10⁷ files - a messys dream) but I created 100k files (400 MB overall) with
不是大摞文件(荣誉10⁷文件-一个混乱的梦)但我创建了100 k文件整体(400 MB)
for i in {1..100000}; do head -c 10 /dev/urandom > dummy_$i; done
and made some tests for pure curiosity (the keyword 10 I was searching is chosen randomly):
并对纯粹的好奇心做了一些测试(我搜索的关键字10是随机选择的):
> time find . | xargs -n1 -P8 grep -H "10"
real 0m22.626s
user 0m0.572s
sys 0m5.800s
> time find . | xargs -n8 -P8 grep -H "10"
real 0m3.195s
user 0m0.180s
sys 0m0.748s
> time grep "10" *
real 0m0.879s
user 0m0.512s
sys 0m0.328s
> time awk '/10/' *
real 0m1.123s
user 0m0.760s
sys 0m0.348s
> time sed -n '/10/p' *
real 0m1.531s
user 0m0.896s
sys 0m0.616s
> time perl -ne 'print if /10/' *
real 0m1.428s
user 0m1.004s
sys 0m0.408s
Btw. there isn't a big difference in running time if I suppress the output with piping STDOUT
to /dev/null
. I am using Ubuntu 12.04 on a not so powerful laptop ;) My CPU is Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz.
顺便说一句。如果我使用管道输出到/dev/null来抑制输出,那么在运行时间上并没有太大的差异我正在使用Ubuntu 12.04,在一个不太强大的笔记本电脑上;)我的CPU是Intel(R) Core(TM) i3-3110M CPU @ 2.4 ghz。
More curiosity:
更多的好奇心:
> time find . | xargs -n1 -P8 grep -H "10" 1>/dev/null
real 0m22.590s
user 0m0.616s
sys 0m5.876s
> time find . | xargs -n4 -P8 grep -H "10" 1>/dev/null
real m5.604s
user 0m0.196s
sys 0m1.488s
> time find . | xargs -n8 -P8 grep -H "10" 1>/dev/null
real 0m2.939s
user 0m0.140s
sys 0m0.784s
> time find . | xargs -n16 -P8 grep -H "10" 1>/dev/null
real 0m1.574s
user 0m0.108s
sys 0m0.428s
> time find . | xargs -n32 -P8 grep -H "10" 1>/dev/null
real 0m0.907s
user 0m0.084s
sys 0m0.264s
> time find . | xargs -n1024 -P8 grep -H "10" 1>/dev/null
real 0m0.245s
user 0m0.136s
sys 0m0.404s
> time find . | xargs -n100000 -P8 grep -H "10" 1>/dev/null
real 0m0.224s
user 0m0.100s
sys 0m0.520s
#3
0
8 million files is a lot in one directory! However, 8 million times 2kb is 16GB and you have 50GB of RAM. I am thinking of a RAMdisk...
800万个文件在一个目录里是很多的!然而,800万乘以2kb是16GB,而你有50GB的RAM。我在想一个RAMdisk…
#4
-2
If you've got that much RAM, why not read it all into memory and use a regular expression library to search? It's a simple C program:
如果你有那么多内存,为什么不把它全部读入内存,然后使用一个正则表达式库来搜索呢?这是一个简单的C程序:
#include <fcntl.h>
#include <regex.h>
...
#1
11
You should remove -0
argument to xargs
and up -n
parameter instead:
应该删除xargs的-0参数和up -n参数:
... | xargs -n16 ...
#2
10
It's not that big stack of files (kudos to 10⁷ files - a messys dream) but I created 100k files (400 MB overall) with
不是大摞文件(荣誉10⁷文件-一个混乱的梦)但我创建了100 k文件整体(400 MB)
for i in {1..100000}; do head -c 10 /dev/urandom > dummy_$i; done
and made some tests for pure curiosity (the keyword 10 I was searching is chosen randomly):
并对纯粹的好奇心做了一些测试(我搜索的关键字10是随机选择的):
> time find . | xargs -n1 -P8 grep -H "10"
real 0m22.626s
user 0m0.572s
sys 0m5.800s
> time find . | xargs -n8 -P8 grep -H "10"
real 0m3.195s
user 0m0.180s
sys 0m0.748s
> time grep "10" *
real 0m0.879s
user 0m0.512s
sys 0m0.328s
> time awk '/10/' *
real 0m1.123s
user 0m0.760s
sys 0m0.348s
> time sed -n '/10/p' *
real 0m1.531s
user 0m0.896s
sys 0m0.616s
> time perl -ne 'print if /10/' *
real 0m1.428s
user 0m1.004s
sys 0m0.408s
Btw. there isn't a big difference in running time if I suppress the output with piping STDOUT
to /dev/null
. I am using Ubuntu 12.04 on a not so powerful laptop ;) My CPU is Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz.
顺便说一句。如果我使用管道输出到/dev/null来抑制输出,那么在运行时间上并没有太大的差异我正在使用Ubuntu 12.04,在一个不太强大的笔记本电脑上;)我的CPU是Intel(R) Core(TM) i3-3110M CPU @ 2.4 ghz。
More curiosity:
更多的好奇心:
> time find . | xargs -n1 -P8 grep -H "10" 1>/dev/null
real 0m22.590s
user 0m0.616s
sys 0m5.876s
> time find . | xargs -n4 -P8 grep -H "10" 1>/dev/null
real m5.604s
user 0m0.196s
sys 0m1.488s
> time find . | xargs -n8 -P8 grep -H "10" 1>/dev/null
real 0m2.939s
user 0m0.140s
sys 0m0.784s
> time find . | xargs -n16 -P8 grep -H "10" 1>/dev/null
real 0m1.574s
user 0m0.108s
sys 0m0.428s
> time find . | xargs -n32 -P8 grep -H "10" 1>/dev/null
real 0m0.907s
user 0m0.084s
sys 0m0.264s
> time find . | xargs -n1024 -P8 grep -H "10" 1>/dev/null
real 0m0.245s
user 0m0.136s
sys 0m0.404s
> time find . | xargs -n100000 -P8 grep -H "10" 1>/dev/null
real 0m0.224s
user 0m0.100s
sys 0m0.520s
#3
0
8 million files is a lot in one directory! However, 8 million times 2kb is 16GB and you have 50GB of RAM. I am thinking of a RAMdisk...
800万个文件在一个目录里是很多的!然而,800万乘以2kb是16GB,而你有50GB的RAM。我在想一个RAMdisk…
#4
-2
If you've got that much RAM, why not read it all into memory and use a regular expression library to search? It's a simple C program:
如果你有那么多内存,为什么不把它全部读入内存,然后使用一个正则表达式库来搜索呢?这是一个简单的C程序:
#include <fcntl.h>
#include <regex.h>
...