使用命令行复制大文件的一部分

时间:2022-08-27 00:38:11

I've a text file with 2 million lines. Each line has some transaction information.

我有一个200万行的文本文件。每行都有一些交易信息。

e.g.

23848923748, sample text, feild2 , 12/12/2008

23848923748,示例文本,feild2,12 / 12/2008

etc

What I want to do is create a new file from a certain unique transaction number onwards. So I want to split the file at the line where this number exists.

我想要做的是从某个唯一的交易号码开始创建一个新文件。所以我想将文件拆分到这个数字所在的行。

How can I do this form the command line?

如何从命令行执行此操作?

I can find the line by doing this:

我可以通过这样做找到这条线:

cat myfile.txt | grep 23423423423

5 个解决方案

#1


On a random file in my tmp directory, this is how I output everything from the line matching popd onwards in a file named tmp.sh:

在我的tmp目录中的随机文件中,这是我在名为tmp.sh的文件中从popd以后的行匹配输出的所有内容:

tail -n+`grep -n popd tmp.sh | cut -f 1 -d:` tmp.sh

tail -n+X matches from that line number onwards; grep -n outputs lineno:filename, and cut extracts just lineno from grep.

tail -n + X从该行号开始匹配; grep -n输出lineno:filename,cut从grep中提取lineno。

So for your case it would be:

所以对于你的情况,它将是:

 tail -n+`grep -n 23423423423 myfile.txt | cut -f 1 -d:` myfile.txt

And it should indeed match from the first occurrence onwards.

它确实应该从第一次出现开始匹配。

#2


use sed like this

像这样使用sed

sed '/23423423423/,$!d' myfile.txt

Just confirm that the unique transaction number cannot appear as a pattern in some other part of the line (especially, before the correctly matching line) in your file.

只需确认唯一的事务编号不能在文件的某些其他部分(特别是在正确匹配的行之前)中显示为模式。


There is already a 'perl' answer here, so, i'll give one more AWK way :-)

这里已经有一个'perl'答案了,所以,我将再提供一个AWK方式:-)

awk '{BEGIN{skip=1} /number/ {skip=0} // {if (skip!=1) print $0}' myfile.txt

#3


It's not a pretty solution, but how about using -A parameter of grep?

这不是一个漂亮的解决方案,但如何使用grep的-A参数?

Like this:

mc@zolty:/tmp$ cat a
1
2
3
4
5
6
7
mc@zolty:/tmp$ cat a | grep 3 -A1000000
3
4
5
6
7

The only problem I see in this solution is the 1000000 magic number. Probably someone will know the answer without using such a trick.

我在这个解决方案中看到的唯一问题是1000000幻数。可能有人会在不使用这种技巧的情况下知道答案。

#4


You can probably get the line number using Grep and then use Tail to print the file from that point into your output file.

您可以使用Grep获取行号,然后使用Tail将文件从该点打印到输出文件中。

Sorry I don't have actual code to show, but hopefully the idea is clear.

对不起我没有显示实际代码,但希望这个想法很明确。

#5


I would write a quick Perl script, frankly. It's invaluable for anything like this (relatively simple issues) and as soon as something more complex rears its head (as it will do!) then you'll need the extra power.

坦白说,我会写一个快速的Perl脚本。对于这样的事情(相对简单的问题)而言,这是非常宝贵的,只要更复杂的东西(就像它会做的那样!)那么你就需要额外的力量。

Something like:

#!/bin/perl

my $out = 0;
while (<STDIN>) {
   if /23423423423/ then $out = 1;
   print $_ if $out;
}

and run it using:

并使用以下命令运行:

$ perl mysplit.pl < input > output

Not tested, I'm afraid.

没有测试,我很害怕。

#1


On a random file in my tmp directory, this is how I output everything from the line matching popd onwards in a file named tmp.sh:

在我的tmp目录中的随机文件中,这是我在名为tmp.sh的文件中从popd以后的行匹配输出的所有内容:

tail -n+`grep -n popd tmp.sh | cut -f 1 -d:` tmp.sh

tail -n+X matches from that line number onwards; grep -n outputs lineno:filename, and cut extracts just lineno from grep.

tail -n + X从该行号开始匹配; grep -n输出lineno:filename,cut从grep中提取lineno。

So for your case it would be:

所以对于你的情况,它将是:

 tail -n+`grep -n 23423423423 myfile.txt | cut -f 1 -d:` myfile.txt

And it should indeed match from the first occurrence onwards.

它确实应该从第一次出现开始匹配。

#2


use sed like this

像这样使用sed

sed '/23423423423/,$!d' myfile.txt

Just confirm that the unique transaction number cannot appear as a pattern in some other part of the line (especially, before the correctly matching line) in your file.

只需确认唯一的事务编号不能在文件的某些其他部分(特别是在正确匹配的行之前)中显示为模式。


There is already a 'perl' answer here, so, i'll give one more AWK way :-)

这里已经有一个'perl'答案了,所以,我将再提供一个AWK方式:-)

awk '{BEGIN{skip=1} /number/ {skip=0} // {if (skip!=1) print $0}' myfile.txt

#3


It's not a pretty solution, but how about using -A parameter of grep?

这不是一个漂亮的解决方案,但如何使用grep的-A参数?

Like this:

mc@zolty:/tmp$ cat a
1
2
3
4
5
6
7
mc@zolty:/tmp$ cat a | grep 3 -A1000000
3
4
5
6
7

The only problem I see in this solution is the 1000000 magic number. Probably someone will know the answer without using such a trick.

我在这个解决方案中看到的唯一问题是1000000幻数。可能有人会在不使用这种技巧的情况下知道答案。

#4


You can probably get the line number using Grep and then use Tail to print the file from that point into your output file.

您可以使用Grep获取行号,然后使用Tail将文件从该点打印到输出文件中。

Sorry I don't have actual code to show, but hopefully the idea is clear.

对不起我没有显示实际代码,但希望这个想法很明确。

#5


I would write a quick Perl script, frankly. It's invaluable for anything like this (relatively simple issues) and as soon as something more complex rears its head (as it will do!) then you'll need the extra power.

坦白说,我会写一个快速的Perl脚本。对于这样的事情(相对简单的问题)而言,这是非常宝贵的,只要更复杂的东西(就像它会做的那样!)那么你就需要额外的力量。

Something like:

#!/bin/perl

my $out = 0;
while (<STDIN>) {
   if /23423423423/ then $out = 1;
   print $_ if $out;
}

and run it using:

并使用以下命令运行:

$ perl mysplit.pl < input > output

Not tested, I'm afraid.

没有测试,我很害怕。