I'm working in bash and I have a large file in which I want to remove all the lines that do not match a certain regex, probably using $ grep -e "<regex>" <file> > output.txt
我在bash工作,我有一个大文件,我想删除所有与某个正则表达式不匹配的行,可能使用$ grep -e“
What I want to keep is any line that contain exactly x times a specified character, for example in the binary sequence
我想要保留的是任何包含指定字符x次的行,例如二进制序列
0000, 0001, 0010, 0011, 0100, 0101, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111
0000,0001,0010,0011,0100,0101,0111,1000,1001,1010,1011,1100,1101,1110,1111
I would like to keep only those who have 2 1, leaving me with
我想只保留那些有2 1的人,让我留下
0011, 0101, 0110, 1001, 1010, 1100
0011,0101,0110,1001,11010,1100
I would then use a bash variable to vary the amount I need (always exactly half of the length, working with strings of the same length) I'm litterally looking for lines that are half 0 and half 1
然后我会使用一个bash变量来改变我需要的数量(总是正好是长度的一半,使用相同长度的字符串)我正在寻找半0和半1的行
I have this right now. It's not using regex. It works, but is very slow:
我现在有这个。它不使用正则表达式。它有效,但速度很慢:
($1
is the length of every string, $d
is just a directory)
($ 1是每个字符串的长度,$ d只是一个目录)
sed -e 's/\(.\)/\1 /g' < $d/input.txt > $d/spaces.txt
awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print c}' $d/spaces.txt > $d/sums.txt
grep -n "$(($1/2))" $d/sums.txt | cut -f1 -d: > $d/linenums.txt
for i in $(cat $d/linenums.txt)
do
sed "${i}q;d" $d/input.txt
done > $d/valids.txt
In case you wonder this puts spaces in between every digit turning 1010
into 1 0 1 0
, then it adds the values together, saves the results in sums.txt, grep for length/2 and save only the line numbers in linenums.txt, then it reads linenums.txt and outputs the corresponding line from input.txt to output.txt
如果你想知道这会在每个数字之间放置空格1010变成1 0 1 0,那么它将值加在一起,将结果保存在sums.txt中,grep表示长度/ 2并且只保存linenums.txt中的行号,然后它读取linenums.txt并从input.txt输出相应的行到output.txt
I need something quicker, the for loop is what's taking way too long
我需要更快的东西,for循环是太长时间了
Thanks for your time and for sharing your knowledge with me.
感谢您的时间,并与我分享您的知识。
1 个解决方案
#1
2
you can definitely make this faster.
你绝对可以加快速度。
here is a grep
regex example to match any lines with exactly two occurrences of 1
:
这是一个grep正则表达式示例,以匹配任何行恰好两次出现1:
grep '^\([^1]*1[^1]*\)\{2\}$' input.txt
you can generalize this to match exactly n
occurrences of c
:
你可以推广这个以恰好匹配n次出现的c:
grep "^\([^$c]*$c[^$c]*\)\{$n\}\$" input.txt
you also mentioned wanting to match lines that are half 0
s, half 1
s. since you stipulated that all the lines are of the same length, you can consider only the first line, and use awk
(or wc
) to get line length and choose n
:
你还提到想要匹配半0,半1的线。既然你规定所有的行都是相同的长度,你可以只考虑第一行,并使用awk(或wc)来获取行长并选择n:
n=`head -n1 input.txt | awk '{printf "%d\n",length($0)/2}'`
c=1
grep "^\([^$c]*$c[^$c]*\)\{$n\}\$" input.txt
#1
2
you can definitely make this faster.
你绝对可以加快速度。
here is a grep
regex example to match any lines with exactly two occurrences of 1
:
这是一个grep正则表达式示例,以匹配任何行恰好两次出现1:
grep '^\([^1]*1[^1]*\)\{2\}$' input.txt
you can generalize this to match exactly n
occurrences of c
:
你可以推广这个以恰好匹配n次出现的c:
grep "^\([^$c]*$c[^$c]*\)\{$n\}\$" input.txt
you also mentioned wanting to match lines that are half 0
s, half 1
s. since you stipulated that all the lines are of the same length, you can consider only the first line, and use awk
(or wc
) to get line length and choose n
:
你还提到想要匹配半0,半1的线。既然你规定所有的行都是相同的长度,你可以只考虑第一行,并使用awk(或wc)来获取行长并选择n:
n=`head -n1 input.txt | awk '{printf "%d\n",length($0)/2}'`
c=1
grep "^\([^$c]*$c[^$c]*\)\{$n\}\$" input.txt