perl:正则表达式只匹配一些时间

时间:2021-05-16 08:56:36

My regex is only matching the some of the expressions. When I test the expression on regex101.com it works just fine... what could be the issue with my code?

我的正则表达式只匹配一些表达式。当我在regex101.com上测试表达式时它运行得很好......我的代码可能会出现什么问题?

Thanks for your help in advance.

感谢您的帮助。

Example file, "surfacecoating":

示例文件“surfacecoating”:

[
('amino acids', 339, 350), 
('copper', 71, 77), 
('copper', 0, 6), 
('copper', 291, 297), 
('amino acids', 119, 130)]

What Dumper prints out for this file (note the first 3 matches are not returned):

Dumper为此文件打印了什么(注意不返回前3个匹配项):

'surfacecoating' => {
        'copper' => '291',
        'amino acids' => '119'
    },

the code:

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

determine_cde_instances();

sub determine_cde_instances {
    my %cdeinstances;
    my %cde_instances;

    my $dir = "/results/CDE";
    opendir my $dh, $dir  or die "Can't open $dir: $!";

    while (my $file = readdir($dh)) {
        next if ($file =~ m/^\./);
        next if -d $file;

        open my $fh, '<', "$dir/$file" or die "Can't open $dir/$file: $!";

        while (my $line = <$fh>)
        {
                if (my ($instance) = $line =~ m/'(.*?)', (.*?), /)     
                {
                    my $instance = $1;
                    my $pos = $2;
                    $cde_instances{$file}{$instance} = $pos;
                }
        }
        close $fh;
    }    
    close $dh;

    print Dumper(\%cde_instances);
    return %cde_instances;
}

1 个解决方案

#1


2  

You are storing information in a hashref with keys $instance, but some of those keys in your data are the same on multiple lines. So the key 'copper' gets overwritten repeatedly, and you end up with only the last occurence. The same happens with 'amino acids'.

您使用键$ instance将信息存储在hashref中,但数据中的某些键在多行中是相同的。所以关键的“铜”会被反复覆盖,最后只会出现最后一次。 “氨基酸”也是如此。

Since those keywords to-be-hash-keys repeat you can't go with a straight hash. You'll need to come up with a different data structure and which it will be depends on what you need to do with data.

由于这些关键字要成为哈希键重复,所以不能使用直接哈希。您需要提出不同的数据结构,这取决于您需要对数据执行的操作。

A reasonable idea is to use an array, and perhaps an array with hashrefs, one for each pair

一个合理的想法是使用一个数组,也许是一个带有hashrefs的数组,每对一个数组

if ($line =~ m/'(.*?)', (.*?), /)     
{
    my %instance_pos = ($1, $2);

    push @{$cde_instances{$file}}, \%instance_pos;
}

Here each key $file in the hash %cde_instances has an arrayref as its value, carrying hashrefs for each instance-pos pair. Of course, there are other choices, this is more of an example.

这里散列%cde_instances中的每个key $文件都有一个arrayref作为其值,为每个instance-pos对携带hashrefs。当然,还有其他选择,这更像是一个例子。

This can also be written as

这也可以写成

if (my %instance_pos = $line =~ m/'(.*?)', (.*?), /) {
    push @{$cde_instances{$file}}, \%instance_pos;
}

or just

if ($line =~ m/'(.*?)', (.*?), /) {    
    push @{$cde_instances{$file}}, {$1, $2};
}

If you need to check/validate the captures then assign to two variables from regex.

如果需要检查/验证捕获,则从正则表达式分配两个变量。


With the above change and using use Data::Dump qw(dd); to print I get

通过以上更改并使用Data :: Dump qw(dd);打印我得到

{
  "data.txt" => [
    { "amino acids" => 339 },
    { copper => 71 },
    { copper => 0 },
    { copper => 291 },
    { "amino acids" => 119 },
  ],
}

Note that the numbers on the line after the first one aren't captured by your regex. I take that to be done on purpose. Please clarify it it isn't so.

请注意,正则表达式不会捕获第一个之后的行上的数字。我认为这是故意的。请澄清它不是这样。

#1


2  

You are storing information in a hashref with keys $instance, but some of those keys in your data are the same on multiple lines. So the key 'copper' gets overwritten repeatedly, and you end up with only the last occurence. The same happens with 'amino acids'.

您使用键$ instance将信息存储在hashref中,但数据中的某些键在多行中是相同的。所以关键的“铜”会被反复覆盖,最后只会出现最后一次。 “氨基酸”也是如此。

Since those keywords to-be-hash-keys repeat you can't go with a straight hash. You'll need to come up with a different data structure and which it will be depends on what you need to do with data.

由于这些关键字要成为哈希键重复,所以不能使用直接哈希。您需要提出不同的数据结构,这取决于您需要对数据执行的操作。

A reasonable idea is to use an array, and perhaps an array with hashrefs, one for each pair

一个合理的想法是使用一个数组,也许是一个带有hashrefs的数组,每对一个数组

if ($line =~ m/'(.*?)', (.*?), /)     
{
    my %instance_pos = ($1, $2);

    push @{$cde_instances{$file}}, \%instance_pos;
}

Here each key $file in the hash %cde_instances has an arrayref as its value, carrying hashrefs for each instance-pos pair. Of course, there are other choices, this is more of an example.

这里散列%cde_instances中的每个key $文件都有一个arrayref作为其值,为每个instance-pos对携带hashrefs。当然,还有其他选择,这更像是一个例子。

This can also be written as

这也可以写成

if (my %instance_pos = $line =~ m/'(.*?)', (.*?), /) {
    push @{$cde_instances{$file}}, \%instance_pos;
}

or just

if ($line =~ m/'(.*?)', (.*?), /) {    
    push @{$cde_instances{$file}}, {$1, $2};
}

If you need to check/validate the captures then assign to two variables from regex.

如果需要检查/验证捕获,则从正则表达式分配两个变量。


With the above change and using use Data::Dump qw(dd); to print I get

通过以上更改并使用Data :: Dump qw(dd);打印我得到

{
  "data.txt" => [
    { "amino acids" => 339 },
    { copper => 71 },
    { copper => 0 },
    { copper => 291 },
    { "amino acids" => 119 },
  ],
}

Note that the numbers on the line after the first one aren't captured by your regex. I take that to be done on purpose. Please clarify it it isn't so.

请注意,正则表达式不会捕获第一个之后的行上的数字。我认为这是故意的。请澄清它不是这样。