I have a text file containing a giant list of line numbers which I have to remove from another main file. Here's what my data looks like
我有一个文本文件,其中包含一个巨大的行号列表,我必须从另一个主文件中删除。这是我的数据的样子
lines.txt
1
2
4
5
22
36
400
...
and documents.txt
string1
string2
string3
...
If I had a short list of line numbers I could've easily used
如果我有一个简短的行号列表,我可以很容易地使用
sed -i '1d,4d,5d' documents.txt
.
sed -i'1d,4d,5d'documents.txt。
But there are lots of lots of line number that I have to delete. Also, I could use bash/perl script to store the line numbers in an array and echo the lines which are not in the array. But I was wondering if there is a built in command to do just that.
但是我必须删除很多行号。此外,我可以使用bash / perl脚本将行号存储在数组中,并回显不在数组中的行。但我想知道是否有内置命令来做到这一点。
Any help would be highly appreciated.
任何帮助将受到高度赞赏。
5 个解决方案
#1
9
awk oneliner should work for you, see test below:
awk oneliner应该适合你,见下面的测试:
kent$ head lines.txt doc.txt
==> lines.txt <==
1
3
5
7
==> doc.txt <==
a
b
c
d
e
f
g
h
kent$ awk 'NR==FNR{l[$0];next;} !(FNR in l)' lines.txt doc.txt
b
d
f
h
as Levon suggested, I add some explanation:
正如Levon所说,我补充一些解释:
awk # the awk command
'NR==FNR{l[$0];next;} # process the first file(lines.txt),save each line(the line# you want to delete) into an array "l"
!(FNR in l)' #now come to the 2nd file(doc.txt), if line number not in "l",print the line out
lines.txt # 1st argument, file:lines.txt
docs.txt # 2nd argument, file:doc.txt
#2
2
Well, I speak no Perl and bash I develop painful trial after trial after trial. However, Rexx would do this easily;
嗯,我说的不是Perl和bash我在试用后的试验后开发了痛苦的试验。但是,Rexx很容易做到这一点;
lines_to_delete = ""
do while lines( "lines.txt" )
lines_to_delete = lines_to_delete linein( "lines.txt" )
end
n = 0
do while lines( "documents.txt" )
line = linein( "documents.txt" )
n = n + 1
if ( wordpos( n, lines_to_delete ) == 0 )
call lineout "temp_out,txt", line
end
This will leave your output in temp_out.txt which you may rename to documents.txt as desired.
这将使您的输出保留在temp_out.txt中,您可以根据需要将其重命名为documents.txt。
#3
2
Here's a way to do it with sed
:
这是用sed做的一种方法:
sed ':a;${s/\n//g;s/^/sed \o47/;s/$/d\o47 documents.txt/;b};s/$/d\;/;N;ba' lines.txt | sh
It uses sed
to build a sed
command and pipes it to the shell to be executed. The resulting sed
command simply looks like `sed '3d;5d;11d' documents.txt.
它使用sed构建一个sed命令并将其传递给要执行的shell。生成的sed命令看起来像`sed'3d; 5d; 11d'procuments.txt。
To build it the outer sed
command adds a d;
after each number, loops to the next line, branching back to the beginning (N; ba
). When the last line is reached ($
), all the newlines are removed, sed '
is prepended and the final d
and ' documents.txt
are appended. Then b
branches out of the :a
- ba
loop to the end since no label is specified.
要构建它,外部sed命令会添加一个d;在每个数字之后,循环到下一行,分支回到开头(N; ba)。当到达最后一行($)时,将删除所有换行符,并添加sed'并附加最后的d和'documents.txt。然后b分支出:a - ba循环到结尾,因为没有指定标签。
Here's how you can do it using join
and cat -n
(assuming that lines.txt is sorted):
以下是使用join和cat -n(假设lines.txt已排序)的方法:
join -t $'\v' -v 2 -o 2.2 lines.txt <(cat -n documents.txt | sed 's/^ *//;s/\t/\v/')
If lines.txt isn't sorted:
如果lines.txt没有排序:
join -t $'\v' -v 2 -o 2.2 <(sort lines.txt) <(cat -n documents.txt | sed '^s/ *//;s/\t/\v/')
Edit:
Fixed a bug in the join
commands in which the original versions only output the first word of each line in documents.txt.
修复了join命令中的一个错误,其中原始版本只输出documents.txt中每行的第一个单词。
#4
1
This might work for you (GNU sed):
这可能适合你(GNU sed):
sed 's/.*/&d/' lines.txt | sed -i -f - documents.txt
or:
sed ':a;$!{N;ba};s/\n/d;/g;s/^/sed -i '\''/;s/$/d'\'' documents.txt/' lines.txt | sh
#5
0
I asked a similar question on Unix SE and got wonderful answers, among them the following awk script:
我在Unix SE上问了一个类似的问题,得到了很好的答案,其中包括以下awk脚本:
#!/bin/bash
#
# filterline keeps a subset of lines of a file.
#
# cf. https://unix.stackexchange.com/q/209404/376
#
set -eu -o pipefail
if [ "$#" -ne 2 ]; then
echo "Usage: filterline FILE1 FILE2"
echo
echo "FILE1: one integer per line indicating line number, one-based, sorted"
echo "FILE2: input file to filter"
exit 1
fi
LIST="$1" LC_ALL=C awk '
function nextline() {
if ((getline n < list) <=0) exit
}
BEGIN{
list = ENVIRON["LIST"]
nextline()
}
NR == n {
print
nextline()
}' < "$2"
And another C version, which is a bit more performant:
另一个C版本,性能更高一些:
#1
9
awk oneliner should work for you, see test below:
awk oneliner应该适合你,见下面的测试:
kent$ head lines.txt doc.txt
==> lines.txt <==
1
3
5
7
==> doc.txt <==
a
b
c
d
e
f
g
h
kent$ awk 'NR==FNR{l[$0];next;} !(FNR in l)' lines.txt doc.txt
b
d
f
h
as Levon suggested, I add some explanation:
正如Levon所说,我补充一些解释:
awk # the awk command
'NR==FNR{l[$0];next;} # process the first file(lines.txt),save each line(the line# you want to delete) into an array "l"
!(FNR in l)' #now come to the 2nd file(doc.txt), if line number not in "l",print the line out
lines.txt # 1st argument, file:lines.txt
docs.txt # 2nd argument, file:doc.txt
#2
2
Well, I speak no Perl and bash I develop painful trial after trial after trial. However, Rexx would do this easily;
嗯,我说的不是Perl和bash我在试用后的试验后开发了痛苦的试验。但是,Rexx很容易做到这一点;
lines_to_delete = ""
do while lines( "lines.txt" )
lines_to_delete = lines_to_delete linein( "lines.txt" )
end
n = 0
do while lines( "documents.txt" )
line = linein( "documents.txt" )
n = n + 1
if ( wordpos( n, lines_to_delete ) == 0 )
call lineout "temp_out,txt", line
end
This will leave your output in temp_out.txt which you may rename to documents.txt as desired.
这将使您的输出保留在temp_out.txt中,您可以根据需要将其重命名为documents.txt。
#3
2
Here's a way to do it with sed
:
这是用sed做的一种方法:
sed ':a;${s/\n//g;s/^/sed \o47/;s/$/d\o47 documents.txt/;b};s/$/d\;/;N;ba' lines.txt | sh
It uses sed
to build a sed
command and pipes it to the shell to be executed. The resulting sed
command simply looks like `sed '3d;5d;11d' documents.txt.
它使用sed构建一个sed命令并将其传递给要执行的shell。生成的sed命令看起来像`sed'3d; 5d; 11d'procuments.txt。
To build it the outer sed
command adds a d;
after each number, loops to the next line, branching back to the beginning (N; ba
). When the last line is reached ($
), all the newlines are removed, sed '
is prepended and the final d
and ' documents.txt
are appended. Then b
branches out of the :a
- ba
loop to the end since no label is specified.
要构建它,外部sed命令会添加一个d;在每个数字之后,循环到下一行,分支回到开头(N; ba)。当到达最后一行($)时,将删除所有换行符,并添加sed'并附加最后的d和'documents.txt。然后b分支出:a - ba循环到结尾,因为没有指定标签。
Here's how you can do it using join
and cat -n
(assuming that lines.txt is sorted):
以下是使用join和cat -n(假设lines.txt已排序)的方法:
join -t $'\v' -v 2 -o 2.2 lines.txt <(cat -n documents.txt | sed 's/^ *//;s/\t/\v/')
If lines.txt isn't sorted:
如果lines.txt没有排序:
join -t $'\v' -v 2 -o 2.2 <(sort lines.txt) <(cat -n documents.txt | sed '^s/ *//;s/\t/\v/')
Edit:
Fixed a bug in the join
commands in which the original versions only output the first word of each line in documents.txt.
修复了join命令中的一个错误,其中原始版本只输出documents.txt中每行的第一个单词。
#4
1
This might work for you (GNU sed):
这可能适合你(GNU sed):
sed 's/.*/&d/' lines.txt | sed -i -f - documents.txt
or:
sed ':a;$!{N;ba};s/\n/d;/g;s/^/sed -i '\''/;s/$/d'\'' documents.txt/' lines.txt | sh
#5
0
I asked a similar question on Unix SE and got wonderful answers, among them the following awk script:
我在Unix SE上问了一个类似的问题,得到了很好的答案,其中包括以下awk脚本:
#!/bin/bash
#
# filterline keeps a subset of lines of a file.
#
# cf. https://unix.stackexchange.com/q/209404/376
#
set -eu -o pipefail
if [ "$#" -ne 2 ]; then
echo "Usage: filterline FILE1 FILE2"
echo
echo "FILE1: one integer per line indicating line number, one-based, sorted"
echo "FILE2: input file to filter"
exit 1
fi
LIST="$1" LC_ALL=C awk '
function nextline() {
if ((getline n < list) <=0) exit
}
BEGIN{
list = ENVIRON["LIST"]
nextline()
}
NR == n {
print
nextline()
}' < "$2"
And another C version, which is a bit more performant:
另一个C版本,性能更高一些: