如何以足够小的块分割我的数据以提供给Seq?

时间:2021-12-16 21:33:00

I am working on a bioinformatics project where I am looking at very large genomes. Seg only reads 135 lines at a time, so when we feed the genomes in it gets overloaded. I am trying to create a perl command that will split the sections into 135 line sections. The character limit would be 10,800 since there are 80 columns. This is what i have so far

我正在研究一个生物信息学项目,我正在研究非常大的基因组。 Seg一次只读取135行,所以当我们喂食基因组时,它就会过载。我正在尝试创建一个perl命令,将部分拆分为135个线段。由于有80列,因此字符限制为10,800。这就是我到目前为止所拥有的

#!usr/bin/perl
use warnings;
use strict;

my $str = 
'>AATTCCGG
TTCCGGAA
CCGGTTAA
AAGGTTCC
>AATTCCGG';



substr($str,17) = "";

print "$str";

It splits at the 17th character but only prints that section, I want it to continue printing the rest of the data. How do i add a command that allows the rest of the data to be shown. Like it should split at every 17th character continuing. (then of course i can go back in and scale it up to the size i actually need. )

它分裂为第17个字符,但只打印该部分,我希望它继续打印其余的数据。如何添加允许显示其余数据的命令。喜欢它应该在每17个字符继续分裂。 (当然,我可以回去并将其扩展到我实际需要的尺寸。)

2 个解决方案

#1


1  

I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.

我假设“非常大的基因组”存储在一个非常大的文件中,并且按行数(而不是字符数)收集数据是好的,因为这是第一个提到的标准。

Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going

然后你可以逐行读取文件并组装行,直到有135行。然后将它们交给一个处理它的程序或例程,清空缓冲区并继续运行

use warnings;
use strict;
use feature 'say';

my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;

open my $fh, '<', $file or die "Can't open $file: $!";

my ($line_counter, @buffer);

while (<$fh>) {
    chomp;
    if ($line_counter == $num_lines_to_process) 
    {
        process_data(\@buffer);
        @buffer = ();
        $line_counter = 0;
    }
    push @buffer, $_;
    ++$line_counter;
}

process_data(\@buffer) if @buffer;  # last batch

sub process_data {
    my ($rdata) = @_;
    say for @$rdata; say '---';  # print data for a test
}

If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_; and clear that by $buffer = ''; as needed.

如果您的处理应用程序/例程需要一个字符串,您可以每次追加一个字符串而不是添加到数组$ buffer。= $ _;并通过$ buffer =''清除它;如所须。

If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', @buffer;

如果你需要传递一个字符串,但是在收集数据时也会使用数组(中间检查/修剪/处理?),然后将行收集到一个数组中并根据需要使用,并在交换之前加入一个字符串,我的$ data = join'',@ buffer;

You can also make use of the $. variable and the modulo operator (%)

你也可以使用$。变量和模运算符(%)

while (<$fh>) {
    chomp;

    push @buffer, $_;

    if ($. % $num_lines_to_process == 0)  # every $num_lines_to_process
    {
         process_data(\@buffer);
         @buffer = ();
    }
}

process_data(\@buffer) if @buffer;  # last batch

In this case we need to first store a line and then check its number, since $. (line number read from a filehandle, see docs linked above) starts from 1 (not 0).

在这种情况下,我们需要先存储一行,然后检查其编号,因为$。 (从文件句柄读取的行号,请参阅上面链接的文档)从1开始(不是0)。

#2


0  

substr returns the removed part of a string; you can just run it in a loop:

substr返回字符串的删除部分;你可以在一个循环中运行它:

while (length $str) {
    my $substr = substr $str, 0, 17, "";
    print $substr, "\n";
}

#1


1  

I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.

我假设“非常大的基因组”存储在一个非常大的文件中,并且按行数(而不是字符数)收集数据是好的,因为这是第一个提到的标准。

Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going

然后你可以逐行读取文件并组装行,直到有135行。然后将它们交给一个处理它的程序或例程,清空缓冲区并继续运行

use warnings;
use strict;
use feature 'say';

my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;

open my $fh, '<', $file or die "Can't open $file: $!";

my ($line_counter, @buffer);

while (<$fh>) {
    chomp;
    if ($line_counter == $num_lines_to_process) 
    {
        process_data(\@buffer);
        @buffer = ();
        $line_counter = 0;
    }
    push @buffer, $_;
    ++$line_counter;
}

process_data(\@buffer) if @buffer;  # last batch

sub process_data {
    my ($rdata) = @_;
    say for @$rdata; say '---';  # print data for a test
}

If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_; and clear that by $buffer = ''; as needed.

如果您的处理应用程序/例程需要一个字符串,您可以每次追加一个字符串而不是添加到数组$ buffer。= $ _;并通过$ buffer =''清除它;如所须。

If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', @buffer;

如果你需要传递一个字符串,但是在收集数据时也会使用数组(中间检查/修剪/处理?),然后将行收集到一个数组中并根据需要使用,并在交换之前加入一个字符串,我的$ data = join'',@ buffer;

You can also make use of the $. variable and the modulo operator (%)

你也可以使用$。变量和模运算符(%)

while (<$fh>) {
    chomp;

    push @buffer, $_;

    if ($. % $num_lines_to_process == 0)  # every $num_lines_to_process
    {
         process_data(\@buffer);
         @buffer = ();
    }
}

process_data(\@buffer) if @buffer;  # last batch

In this case we need to first store a line and then check its number, since $. (line number read from a filehandle, see docs linked above) starts from 1 (not 0).

在这种情况下,我们需要先存储一行,然后检查其编号,因为$。 (从文件句柄读取的行号,请参阅上面链接的文档)从1开始(不是0)。

#2


0  

substr returns the removed part of a string; you can just run it in a loop:

substr返回字符串的删除部分;你可以在一个循环中运行它:

while (length $str) {
    my $substr = substr $str, 0, 17, "";
    print $substr, "\n";
}