I need to repeatedly remove the first line from a huge text file using a bash script.
我需要使用bash脚本从一个巨大的文本文件中重复删除第一行。
Right now I am using sed -i -e "1d" $FILE
- but it takes around a minute to do the deletion.
现在我正在使用sed -i -e“1d”$ FILE - 但是删除需要大约一分钟。
Is there a more efficient way to accomplish this?
有没有更有效的方法来实现这一目标?
15 个解决方案
#1
837
Try GNU tail:
试试GNU尾巴:
tail -n +2 "$FILE"
-n x
: Just print the last x
lines. tail -n 5
would give you the last 5 lines of the input. The +
sign kind of inverts the argument and make tail
print anything but the first x-1
lines. tail -n +1
would print the whole file, tail -n +2
everything but the first line, etc.
-n x:只打印最后的x行。 tail -n 5会给你输入的最后5行。 +符号类型反转参数并使尾部打印除了第一个x-1行之外的任何内容。 tail -n +1会打印整个文件,tail -n +2除第一行之外的所有内容,等等。
GNU tail
is much faster than sed
. tail
is also available on BSD and the -n +2
flag is consistent across both tools. Check the FreeBSD or OS X man pages for more.
GNU尾巴比sed快得多。 tail也可用于BSD,-n +2标志在两个工具中都是一致的。查看FreeBSD或OS X手册页了解更多信息。
The BSD version can be much slower than sed
, though. I wonder how they managed that; tail
should just read a file line by line while sed
does pretty complex operations involving interpreting a script, applying regular expressions and the like.
不过,BSD版本可能比sed慢得多。我想知道他们是如何做到的; tail应该只是逐行读取文件,而sed执行相当复杂的操作,包括解释脚本,应用正则表达式等。
Note: You may be tempted to use
注意:您可能很想使用
# THIS WILL GIVE YOU AN EMPTY FILE!
tail -n +2 "$FILE" > "$FILE"
but this will give you an empty file. The reason is that the redirection (>
) happens before tail
is invoked by the shell:
但这会给你一个空文件。原因是重定向(>)发生在shell调用tail之前:
- Shell truncates file
$FILE
- Shell creates a new process for
tail
- Shell redirects stdout of the
tail
process to$FILE
-
tail
reads from the now empty$FILE
Shell截断文件$ FILE
Shell为tail创建了一个新进程
Shell将尾部进程的stdout重定向到$ FILE
tail从现在空的$ FILE中读取
If you want to remove the first line inside the file, you should use:
如果要删除文件中的第一行,则应使用:
tail -n +2 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"
The &&
will make sure that the file doesn't get overwritten when there is a problem.
&&将确保在出现问题时不会覆盖该文件。
#2
106
You can use -i to update the file without using '>' operator. The following command will delete the first line from the file and save it to the file.
您可以使用-i更新文件,而无需使用“>”运算符。以下命令将从文件中删除第一行并将其保存到文件中。
sed -i '1d' filename
#3
65
For those who are on SunOS which is non-GNU, the following code will help:
对于那些使用非GNU的SunOS的人,以下代码将有所帮助:
sed '1d' test.dat > tmp.dat
#4
16
No, that's about as efficient as you're going to get. You could write a C program which could do the job a little faster (less startup time and processing arguments) but it will probably tend towards the same speed as sed as files get large (and I assume they're large if it's taking a minute).
不,这就像你要获得的那样高效。你可以编写一个C程序,它可以更快地完成工作(减少启动时间和处理参数)但它可能会趋向于与文件变大的sed相同的速度(并且我认为如果需要一分钟,它们会很大)。
But your question suffers from the same problem as so many others in that it pre-supposes the solution. If you were to tell us in detail what you're trying to do rather then how, we may be able to suggest a better option.
但是你的问题与许多其他问题一样,因为它预先设定了解决方案。如果您要详细告诉我们您要做什么而不是如何做,我们可能会建议更好的选择。
For example, if this is a file A that some other program B processes, one solution would be to not strip off the first line, but modify program B to process it differently.
例如,如果这是某个其他程序B处理的文件A,则一种解决方案是不剥离第一行,而是修改程序B以不同方式处理它。
Let's say all your programs append to this file A and program B currently reads and processes the first line before deleting it.
假设所有程序都附加到此文件A,程序B当前在删除它之前读取并处理第一行。
You could re-engineer program B so that it didn't try to delete the first line but maintains a persistent (probably file-based) offset into the file A so that, next time it runs, it could seek to that offset, process the line there, and update the offset.
您可以重新设计程序B,以便它不会尝试删除第一行但是在文件A中保持一个持久的(可能是基于文件的)偏移量,以便下次运行时可以寻找该偏移量,进程那里的线,并更新偏移量。
Then, at a quiet time (midnight?), it could do special processing of file A to delete all lines currently processed and set the offset back to 0.
然后,在安静的时间(午夜?),它可以对文件A进行特殊处理,以删除当前处理的所有行,并将偏移量设置回0。
It will certainly be faster for a program to open and seek a file rather than open and rewrite. This discussion assumes you have control over program B, of course. I don't know if that's the case but there may be other possible solutions if you provide further information.
程序打开和查找文件而不是打开和重写肯定会更快。当然,本讨论假定您可以控制程序B.我不知道是否是这种情况,但如果您提供进一步的信息,可能还有其他可能的解决方案。
#5
9
You can edit the files in place: Just use perl's -i
flag, like this:
您可以编辑文件:只需使用perl的-i标志,如下所示:
perl -ni -e 'print unless $. == 1' filename.txt
This makes the first line disappear, as you ask. Perl will need to read and copy the entire file, but it arranges for the output to be saved under the name of the original file.
这会使第一行消失,正如你所问的那样。 Perl需要读取和复制整个文件,但它会安排输出以原始文件的名称保存。
#6
8
As Pax said, you probably aren't going to get any faster than this. The reason is that there are almost no filesystems that support truncating from the beginning of the file so this is going to be an O(n
) operation where n
is the size of the file. What you can do much faster though is overwrite the first line with the same number of bytes (maybe with spaces or a comment) which might work for you depending on exactly what you are trying to do (what is that by the way?).
正如Pax所说,你可能不会比这更快。原因是几乎没有文件系统支持从文件开头截断,因此这将是一个O(n)操作,其中n是文件的大小。你可以做得更快但是用相同的字节数(可能带空格或注释)覆盖第一行,这可能对你有用,具体取决于你想要做什么(顺便说一句是什么?)。
#7
5
The sponge
util avoids the need for juggling a temp file:
海绵工具避免了处理临时文件的需要:
tail -n +2 "$FILE" | sponge "$FILE"
#8
3
How about using csplit?
用csplit怎么样?
man csplit
csplit -k file 1 '{1}'
#9
3
Could use vim to do this:
可以使用vim来做到这一点:
vim -u NONE +'1d' +'wq!' /tmp/test.txt
This should be faster, since vim won't read whole file when process.
这应该更快,因为vim在处理时不会读取整个文件。
#10
2
should show the lines except the first line :
应显示除第一行以外的行:
cat textfile.txt | tail -n +2
#11
2
If you want to modify the file in place, you could always use the original ed
instead of its streaming successor sed
:
如果要在适当的位置修改文件,可以始终使用原始ed而不是其流式继承器sed:
ed "$FILE" <<<$'1d\nwq\n'
#12
0
Since it sounds like I can't speed up the deletion, I think a good approach might be to process the file in batches like this:
由于听起来我无法加速删除,我认为一个好的方法可能是批量处理文件,如下所示:
While file1 not empty
file2 = head -n1000 file1
process file2
sed -i -e "1000d" file1
end
The drawback of this is that if the program gets killed in the middle (or if there's some bad sql in there - causing the "process" part to die or lock-up), there will be lines that are either skipped, or processed twice.
这样做的缺点是,如果程序在中间被杀死(或者如果那里有一些不好的sql - 导致“进程”部分死亡或锁定),则会有跳过或被处理两次的行。
(file1 contains lines of sql code)
(file1包含sql代码行)
#13
-1
Would using tail on N-1 lines and directing that into a file, followed by removing the old file, and renaming the new file to the old name do the job?
会在N-1行上使用tail并将其导入文件,然后删除旧文件,并将新文件重命名为旧名称吗?
If i were doing this programatically, i would read through the file, and remember the file offset, after reading each line, so i could seek back to that position to read the file with one less line in it.
如果我以编程方式执行此操作,我会读取文件,并在读取每一行后记住文件偏移量,因此我可以回到该位置以读取文件中少一行。
#14
-1
If what you are looking to do is recover after failure, you could just build up a file that has what you've done so far.
如果你想要做的是在失败后恢复,你可以建立一个具有你迄今为止所做的事情的文件。
if [[ -f $tmpf ]] ; then
rm -f $tmpf
fi
cat $srcf |
while read line ; do
# process line
echo "$line" >> $tmpf
done
#15
-1
You can easily do this with:
你可以轻松地做到这一点:
cat filename | sed 1d > filename_without_first_line
on the command line; or to remove the first line of a file permanently, use the in-place mode of sed with the -i
flag:
在命令行上;或者永久删除文件的第一行,使用带-i标志的sed就地模式:
sed -i 1d <filename>
#1
837
Try GNU tail:
试试GNU尾巴:
tail -n +2 "$FILE"
-n x
: Just print the last x
lines. tail -n 5
would give you the last 5 lines of the input. The +
sign kind of inverts the argument and make tail
print anything but the first x-1
lines. tail -n +1
would print the whole file, tail -n +2
everything but the first line, etc.
-n x:只打印最后的x行。 tail -n 5会给你输入的最后5行。 +符号类型反转参数并使尾部打印除了第一个x-1行之外的任何内容。 tail -n +1会打印整个文件,tail -n +2除第一行之外的所有内容,等等。
GNU tail
is much faster than sed
. tail
is also available on BSD and the -n +2
flag is consistent across both tools. Check the FreeBSD or OS X man pages for more.
GNU尾巴比sed快得多。 tail也可用于BSD,-n +2标志在两个工具中都是一致的。查看FreeBSD或OS X手册页了解更多信息。
The BSD version can be much slower than sed
, though. I wonder how they managed that; tail
should just read a file line by line while sed
does pretty complex operations involving interpreting a script, applying regular expressions and the like.
不过,BSD版本可能比sed慢得多。我想知道他们是如何做到的; tail应该只是逐行读取文件,而sed执行相当复杂的操作,包括解释脚本,应用正则表达式等。
Note: You may be tempted to use
注意:您可能很想使用
# THIS WILL GIVE YOU AN EMPTY FILE!
tail -n +2 "$FILE" > "$FILE"
but this will give you an empty file. The reason is that the redirection (>
) happens before tail
is invoked by the shell:
但这会给你一个空文件。原因是重定向(>)发生在shell调用tail之前:
- Shell truncates file
$FILE
- Shell creates a new process for
tail
- Shell redirects stdout of the
tail
process to$FILE
-
tail
reads from the now empty$FILE
Shell截断文件$ FILE
Shell为tail创建了一个新进程
Shell将尾部进程的stdout重定向到$ FILE
tail从现在空的$ FILE中读取
If you want to remove the first line inside the file, you should use:
如果要删除文件中的第一行,则应使用:
tail -n +2 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"
The &&
will make sure that the file doesn't get overwritten when there is a problem.
&&将确保在出现问题时不会覆盖该文件。
#2
106
You can use -i to update the file without using '>' operator. The following command will delete the first line from the file and save it to the file.
您可以使用-i更新文件,而无需使用“>”运算符。以下命令将从文件中删除第一行并将其保存到文件中。
sed -i '1d' filename
#3
65
For those who are on SunOS which is non-GNU, the following code will help:
对于那些使用非GNU的SunOS的人,以下代码将有所帮助:
sed '1d' test.dat > tmp.dat
#4
16
No, that's about as efficient as you're going to get. You could write a C program which could do the job a little faster (less startup time and processing arguments) but it will probably tend towards the same speed as sed as files get large (and I assume they're large if it's taking a minute).
不,这就像你要获得的那样高效。你可以编写一个C程序,它可以更快地完成工作(减少启动时间和处理参数)但它可能会趋向于与文件变大的sed相同的速度(并且我认为如果需要一分钟,它们会很大)。
But your question suffers from the same problem as so many others in that it pre-supposes the solution. If you were to tell us in detail what you're trying to do rather then how, we may be able to suggest a better option.
但是你的问题与许多其他问题一样,因为它预先设定了解决方案。如果您要详细告诉我们您要做什么而不是如何做,我们可能会建议更好的选择。
For example, if this is a file A that some other program B processes, one solution would be to not strip off the first line, but modify program B to process it differently.
例如,如果这是某个其他程序B处理的文件A,则一种解决方案是不剥离第一行,而是修改程序B以不同方式处理它。
Let's say all your programs append to this file A and program B currently reads and processes the first line before deleting it.
假设所有程序都附加到此文件A,程序B当前在删除它之前读取并处理第一行。
You could re-engineer program B so that it didn't try to delete the first line but maintains a persistent (probably file-based) offset into the file A so that, next time it runs, it could seek to that offset, process the line there, and update the offset.
您可以重新设计程序B,以便它不会尝试删除第一行但是在文件A中保持一个持久的(可能是基于文件的)偏移量,以便下次运行时可以寻找该偏移量,进程那里的线,并更新偏移量。
Then, at a quiet time (midnight?), it could do special processing of file A to delete all lines currently processed and set the offset back to 0.
然后,在安静的时间(午夜?),它可以对文件A进行特殊处理,以删除当前处理的所有行,并将偏移量设置回0。
It will certainly be faster for a program to open and seek a file rather than open and rewrite. This discussion assumes you have control over program B, of course. I don't know if that's the case but there may be other possible solutions if you provide further information.
程序打开和查找文件而不是打开和重写肯定会更快。当然,本讨论假定您可以控制程序B.我不知道是否是这种情况,但如果您提供进一步的信息,可能还有其他可能的解决方案。
#5
9
You can edit the files in place: Just use perl's -i
flag, like this:
您可以编辑文件:只需使用perl的-i标志,如下所示:
perl -ni -e 'print unless $. == 1' filename.txt
This makes the first line disappear, as you ask. Perl will need to read and copy the entire file, but it arranges for the output to be saved under the name of the original file.
这会使第一行消失,正如你所问的那样。 Perl需要读取和复制整个文件,但它会安排输出以原始文件的名称保存。
#6
8
As Pax said, you probably aren't going to get any faster than this. The reason is that there are almost no filesystems that support truncating from the beginning of the file so this is going to be an O(n
) operation where n
is the size of the file. What you can do much faster though is overwrite the first line with the same number of bytes (maybe with spaces or a comment) which might work for you depending on exactly what you are trying to do (what is that by the way?).
正如Pax所说,你可能不会比这更快。原因是几乎没有文件系统支持从文件开头截断,因此这将是一个O(n)操作,其中n是文件的大小。你可以做得更快但是用相同的字节数(可能带空格或注释)覆盖第一行,这可能对你有用,具体取决于你想要做什么(顺便说一句是什么?)。
#7
5
The sponge
util avoids the need for juggling a temp file:
海绵工具避免了处理临时文件的需要:
tail -n +2 "$FILE" | sponge "$FILE"
#8
3
How about using csplit?
用csplit怎么样?
man csplit
csplit -k file 1 '{1}'
#9
3
Could use vim to do this:
可以使用vim来做到这一点:
vim -u NONE +'1d' +'wq!' /tmp/test.txt
This should be faster, since vim won't read whole file when process.
这应该更快,因为vim在处理时不会读取整个文件。
#10
2
should show the lines except the first line :
应显示除第一行以外的行:
cat textfile.txt | tail -n +2
#11
2
If you want to modify the file in place, you could always use the original ed
instead of its streaming successor sed
:
如果要在适当的位置修改文件,可以始终使用原始ed而不是其流式继承器sed:
ed "$FILE" <<<$'1d\nwq\n'
#12
0
Since it sounds like I can't speed up the deletion, I think a good approach might be to process the file in batches like this:
由于听起来我无法加速删除,我认为一个好的方法可能是批量处理文件,如下所示:
While file1 not empty
file2 = head -n1000 file1
process file2
sed -i -e "1000d" file1
end
The drawback of this is that if the program gets killed in the middle (or if there's some bad sql in there - causing the "process" part to die or lock-up), there will be lines that are either skipped, or processed twice.
这样做的缺点是,如果程序在中间被杀死(或者如果那里有一些不好的sql - 导致“进程”部分死亡或锁定),则会有跳过或被处理两次的行。
(file1 contains lines of sql code)
(file1包含sql代码行)
#13
-1
Would using tail on N-1 lines and directing that into a file, followed by removing the old file, and renaming the new file to the old name do the job?
会在N-1行上使用tail并将其导入文件,然后删除旧文件,并将新文件重命名为旧名称吗?
If i were doing this programatically, i would read through the file, and remember the file offset, after reading each line, so i could seek back to that position to read the file with one less line in it.
如果我以编程方式执行此操作,我会读取文件,并在读取每一行后记住文件偏移量,因此我可以回到该位置以读取文件中少一行。
#14
-1
If what you are looking to do is recover after failure, you could just build up a file that has what you've done so far.
如果你想要做的是在失败后恢复,你可以建立一个具有你迄今为止所做的事情的文件。
if [[ -f $tmpf ]] ; then
rm -f $tmpf
fi
cat $srcf |
while read line ; do
# process line
echo "$line" >> $tmpf
done
#15
-1
You can easily do this with:
你可以轻松地做到这一点:
cat filename | sed 1d > filename_without_first_line
on the command line; or to remove the first line of a file permanently, use the in-place mode of sed with the -i
flag:
在命令行上;或者永久删除文件的第一行,使用带-i标志的sed就地模式:
sed -i 1d <filename>