Hi guys I hope the subject is clear enough, I haven't found anything specifically about this in the previously asked bin. I've tried implementing this in Perl or Python, but I think I may be trying too hard.
大家好,我希望这个主题足够清晰,我还没有在之前的提问中发现任何关于这个的细节。我已经尝试用Perl或Python来实现它,但我想我可能太过努力了。
Is there a simple shell command / pipeline that will split my 4mb .txt file into seperate .txt files, based on a beginning and ending regex?
是否有一个简单的shell命令/管道将我的4mb .txt文件分割成独立的.txt文件,基于开始和结束的regex?
I provide a short sample of the file below.. so you can see that every "story" starts with the phrase "X of XXX DOCUMENTS", which could be used to split the file.
我提供了以下文件的一个简短样本。因此,您可以看到,每个“故事”都以短语“X of XXX DOCUMENTS”开头,可以用来分割文件。
I think this should be easy and I'd be surprised if bash can't do it - faster than Perl/Py.
我认为这应该很容易,如果bash不能做到这一点,我将感到惊讶——比Perl/Py快。
Here it is:
这里是:
1 of 999 DOCUMENTS
Copyright 2011 Virginian-Pilot Companies LLC
All Rights Reserved
The Virginian-Pilot(Norfolk, VA.)
...
3 of 999 DOCUMENTS
Copyright 2011 Canwest News Service
All Rights Reserved
Canwest News Service
...
Thanks in advance for all your help.
谢谢你的帮助。
Ross
罗斯
5 个解决方案
#1
22
awk '/[0-9]+ of [0-9]+ DOCUMENTS/{g++} { print $0 > g".txt"}' file
OSX users will need
gawk
, as the builtinawk
will produce an error likeawk: illegal statement at source line 1
OSX用户将需要gawk,因为builtin awk将产生一个错误,如awk:源行1中的非法语句
Ruby(1.9+)
Ruby(1.9 +)
#!/usr/bin/env ruby
g=1
f=File.open(g.to_s + ".txt","w")
open("file").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
g+=1
f=File.open(g.to_s + ".txt","w")
end
f.print line
end
#2
9
As suggested in other solutions, you could use csplit
for that:
如其他解决方案所建议的,可以使用csplit:
csplit csplit.test '/^\.\.\./' '{*}' && sed -i '/^\.\.\./d' xx*
I haven't found a better way to get rid of the reminiscent separator in the split files.
我还没有找到更好的方法来删除分割文件中的回忆分隔符。
#3
1
How hard did you try in Perl?
您在Perl中尝试了多少?
Edit Here is a faster method. It splits the file then prints the part files.
这里的编辑是一个更快的方法。它分割文件,然后打印部分文件。
use strict;
use warnings;
my $count = 1;
open (my $file, '<', 'source.txt') or die "Can't open source.txt: $!";
for (split /(?=^.*\d+[^\S\n]*of[^\S\n]*\d+[^\S\n]*DOCUMENTS)/m, join('',<$file>))
{
if ( s/^.*(\d+)\s*of\s*\d+\s*DOCUMENTS.*(\n|$)//m )
{
open (my $part, '>', "Part$1_$count.txt")
or die "Can't open Part$1_$count for output: $!";
print $part $_;
close ($part);
$count++;
}
}
close ($file);
This is the line by line method:
这是逐行法:
use strict;
use warnings;
open (my $masterfile, '<', 'yourfilename.txt') or die "Can't open yourfilename.txt: $!";
my $count = 1;
my $fh;
while (<$masterfile>) {
if ( /(?<!\d)(\d+)\s*of\s*\d+\s*DOCUMENTS/ ) {
defined $fh and close ($fh);
open ($fh, '>', "Part$1_$count.txt") or die "Can't open Part$1_$count for output: $!";
$count++;
next;
}
defined $fh and print $fh $_;
}
defined $fh and close ($fh);
close ($masterfile);
#4
0
regex to match "X of XXX DOCUMENTS" is
\d{1,3} of \d{1,3) DOCUMENTS
与“XXX文件X”匹配的regex是\d{1,3}的文档。
reading line by line and starting to write new file upon regex match should be fine.
逐行读取并在regex匹配时开始编写新文件应该没问题。
#5
-1
Untested:
测试:
base=outputfile
start=1
pattern='^[[:blank:]]*[[:digit:]]+ OF [[:digit:]]+ DOCUMENTS[[:blank:]]*$
while read -r line
do
if [[ $line =~ $pattern ]]
then
((start++))
printf -v filecount '%4d' $start
>"$base$filecount" # create an empty file named like foo0001
fi
echo "$line" >> "$base$filecount"
done
#1
22
awk '/[0-9]+ of [0-9]+ DOCUMENTS/{g++} { print $0 > g".txt"}' file
OSX users will need
gawk
, as the builtinawk
will produce an error likeawk: illegal statement at source line 1
OSX用户将需要gawk,因为builtin awk将产生一个错误,如awk:源行1中的非法语句
Ruby(1.9+)
Ruby(1.9 +)
#!/usr/bin/env ruby
g=1
f=File.open(g.to_s + ".txt","w")
open("file").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
g+=1
f=File.open(g.to_s + ".txt","w")
end
f.print line
end
#2
9
As suggested in other solutions, you could use csplit
for that:
如其他解决方案所建议的,可以使用csplit:
csplit csplit.test '/^\.\.\./' '{*}' && sed -i '/^\.\.\./d' xx*
I haven't found a better way to get rid of the reminiscent separator in the split files.
我还没有找到更好的方法来删除分割文件中的回忆分隔符。
#3
1
How hard did you try in Perl?
您在Perl中尝试了多少?
Edit Here is a faster method. It splits the file then prints the part files.
这里的编辑是一个更快的方法。它分割文件,然后打印部分文件。
use strict;
use warnings;
my $count = 1;
open (my $file, '<', 'source.txt') or die "Can't open source.txt: $!";
for (split /(?=^.*\d+[^\S\n]*of[^\S\n]*\d+[^\S\n]*DOCUMENTS)/m, join('',<$file>))
{
if ( s/^.*(\d+)\s*of\s*\d+\s*DOCUMENTS.*(\n|$)//m )
{
open (my $part, '>', "Part$1_$count.txt")
or die "Can't open Part$1_$count for output: $!";
print $part $_;
close ($part);
$count++;
}
}
close ($file);
This is the line by line method:
这是逐行法:
use strict;
use warnings;
open (my $masterfile, '<', 'yourfilename.txt') or die "Can't open yourfilename.txt: $!";
my $count = 1;
my $fh;
while (<$masterfile>) {
if ( /(?<!\d)(\d+)\s*of\s*\d+\s*DOCUMENTS/ ) {
defined $fh and close ($fh);
open ($fh, '>', "Part$1_$count.txt") or die "Can't open Part$1_$count for output: $!";
$count++;
next;
}
defined $fh and print $fh $_;
}
defined $fh and close ($fh);
close ($masterfile);
#4
0
regex to match "X of XXX DOCUMENTS" is
\d{1,3} of \d{1,3) DOCUMENTS
与“XXX文件X”匹配的regex是\d{1,3}的文档。
reading line by line and starting to write new file upon regex match should be fine.
逐行读取并在regex匹配时开始编写新文件应该没问题。
#5
-1
Untested:
测试:
base=outputfile
start=1
pattern='^[[:blank:]]*[[:digit:]]+ OF [[:digit:]]+ DOCUMENTS[[:blank:]]*$
while read -r line
do
if [[ $line =~ $pattern ]]
then
((start++))
printf -v filecount '%4d' $start
>"$base$filecount" # create an empty file named like foo0001
fi
echo "$line" >> "$base$filecount"
done