I am working on a bioinformatics project where I am looking at very large genomes. Seg only reads 135 lines at a time, so when we feed the genomes in it gets overloaded. I am trying to create a perl command that will split the sections into 135 line sections. The character limit would be 10,800 since there are 80 columns. This is what i have so far
我正在研究一个生物信息学项目,我正在研究非常大的基因组。 Seg一次只读取135行,所以当我们喂食基因组时,它就会过载。我正在尝试创建一个perl命令,将部分拆分为135个线段。由于有80列,因此字符限制为10,800。这就是我到目前为止所拥有的
#!usr/bin/perl
use warnings;
use strict;
my $str =
'>AATTCCGG
TTCCGGAA
CCGGTTAA
AAGGTTCC
>AATTCCGG';
substr($str,17) = "";
print "$str";
It splits at the 17th character but only prints that section, I want it to continue printing the rest of the data. How do i add a command that allows the rest of the data to be shown. Like it should split at every 17th character continuing. (then of course i can go back in and scale it up to the size i actually need. )
它分裂为第17个字符,但只打印该部分,我希望它继续打印其余的数据。如何添加允许显示其余数据的命令。喜欢它应该在每17个字符继续分裂。 (当然,我可以回去并将其扩展到我实际需要的尺寸。)
2 个解决方案
#1
1
I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.
我假设“非常大的基因组”存储在一个非常大的文件中,并且按行数(而不是字符数)收集数据是好的,因为这是第一个提到的标准。
Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going
然后你可以逐行读取文件并组装行,直到有135行。然后将它们交给一个处理它的程序或例程,清空缓冲区并继续运行
use warnings;
use strict;
use feature 'say';
my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;
open my $fh, '<', $file or die "Can't open $file: $!";
my ($line_counter, @buffer);
while (<$fh>) {
chomp;
if ($line_counter == $num_lines_to_process)
{
process_data(\@buffer);
@buffer = ();
$line_counter = 0;
}
push @buffer, $_;
++$line_counter;
}
process_data(\@buffer) if @buffer; # last batch
sub process_data {
my ($rdata) = @_;
say for @$rdata; say '---'; # print data for a test
}
If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_;
and clear that by $buffer = '';
as needed.
如果您的处理应用程序/例程需要一个字符串,您可以每次追加一个字符串而不是添加到数组$ buffer。= $ _;并通过$ buffer =''清除它;如所须。
If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', @buffer;
如果你需要传递一个字符串,但是在收集数据时也会使用数组(中间检查/修剪/处理?),然后将行收集到一个数组中并根据需要使用,并在交换之前加入一个字符串,我的$ data = join'',@ buffer;
You can also make use of the $.
variable and the modulo operator (%
)
你也可以使用$。变量和模运算符(%)
while (<$fh>) {
chomp;
push @buffer, $_;
if ($. % $num_lines_to_process == 0) # every $num_lines_to_process
{
process_data(\@buffer);
@buffer = ();
}
}
process_data(\@buffer) if @buffer; # last batch
In this case we need to first store a line and then check its number, since $.
(line number read from a filehandle, see docs linked above) starts from 1 (not 0).
在这种情况下,我们需要先存储一行,然后检查其编号,因为$。 (从文件句柄读取的行号,请参阅上面链接的文档)从1开始(不是0)。
#2
0
substr returns the removed part of a string; you can just run it in a loop:
substr返回字符串的删除部分;你可以在一个循环中运行它:
while (length $str) {
my $substr = substr $str, 0, 17, "";
print $substr, "\n";
}
#1
1
I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.
我假设“非常大的基因组”存储在一个非常大的文件中,并且按行数(而不是字符数)收集数据是好的,因为这是第一个提到的标准。
Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going
然后你可以逐行读取文件并组装行,直到有135行。然后将它们交给一个处理它的程序或例程,清空缓冲区并继续运行
use warnings;
use strict;
use feature 'say';
my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;
open my $fh, '<', $file or die "Can't open $file: $!";
my ($line_counter, @buffer);
while (<$fh>) {
chomp;
if ($line_counter == $num_lines_to_process)
{
process_data(\@buffer);
@buffer = ();
$line_counter = 0;
}
push @buffer, $_;
++$line_counter;
}
process_data(\@buffer) if @buffer; # last batch
sub process_data {
my ($rdata) = @_;
say for @$rdata; say '---'; # print data for a test
}
If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_;
and clear that by $buffer = '';
as needed.
如果您的处理应用程序/例程需要一个字符串,您可以每次追加一个字符串而不是添加到数组$ buffer。= $ _;并通过$ buffer =''清除它;如所须。
If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', @buffer;
如果你需要传递一个字符串,但是在收集数据时也会使用数组(中间检查/修剪/处理?),然后将行收集到一个数组中并根据需要使用,并在交换之前加入一个字符串,我的$ data = join'',@ buffer;
You can also make use of the $.
variable and the modulo operator (%
)
你也可以使用$。变量和模运算符(%)
while (<$fh>) {
chomp;
push @buffer, $_;
if ($. % $num_lines_to_process == 0) # every $num_lines_to_process
{
process_data(\@buffer);
@buffer = ();
}
}
process_data(\@buffer) if @buffer; # last batch
In this case we need to first store a line and then check its number, since $.
(line number read from a filehandle, see docs linked above) starts from 1 (not 0).
在这种情况下,我们需要先存储一行,然后检查其编号,因为$。 (从文件句柄读取的行号,请参阅上面链接的文档)从1开始(不是0)。
#2
0
substr returns the removed part of a string; you can just run it in a loop:
substr返回字符串的删除部分;你可以在一个循环中运行它:
while (length $str) {
my $substr = substr $str, 0, 17, "";
print $substr, "\n";
}