找到一组对的所有子集,其中每个值都与其他值相连接

时间:2022-04-18 21:46:42

This is a logic problem I'm not quite equipped to deal with it seems.

这是一个逻辑问题,我似乎还没有准备好去解决它。

I have a data set of paired samples which are each identified by a unique number. A pair indicates a certain level of relatedness between the samples. I want to group these samples such that every member of a group is supported by a pair to every other member of the group.

我有一组成对的样本,每个样本都有一个唯一的数。一对表示样本之间一定程度的关联性。我想对这些样本进行分组,使组中的每个成员都得到组中的每个成员的一对支持。

For example, in the data set below pairs 6, 7, and 28 constitute a group of 3. Pairs 22 through 27 constitute a group of 4, in this case there 4 groups of 3 inside this group which I don't want in the output. More difficult are pairs 10, 11, and 17 which are another group of 3.

例如,在下面的数据集中,第6、7和28对组成了一个3组。22到27对构成了一个4的组,在这种情况下,这个组中有4组3我不希望在输出中出现。更困难的是对10、11和17,这是另一组3。

Data set:

数据集:

    [,1] [,2] [1,]    6  267 [2,]    9   10 [3,]   11   12 [4,]   79   80 [5,]   96  570 [6,]  314  583 [7,]  314  584 [8,]  425  426 [9,]  427  428[10,]  427  429[11,]  427  430[12,]  427  472[13,]  427  473[14,]  427  474[15,]  428  430[16,]  428  473[17,]  429  430[18,]  430  472[19,]  430  473[20,]  430  474[21,]  472  474[22,]  517  519[23,]  517  520[24,]  517  521[25,]  519  520[26,]  519  521[27,]  520  521[28,]  583  584[29,]  649  650

A couple things about the data set: the value in column 2 of a pair will always be greater than the value in column 1 and the values in column 1 are in ascending order.

关于数据集有两点:一对的第2列的值总是大于第1列的值,第1列的值是按升序排列的。

This data set is a simple version of the problem with a limited number of pairs for any one sample (the most is 427 with 6) but more complicated data sets will have any number of levels so I need a solution that is agnostic to this factor. I think recursion of some sort is the way to go and I have been playing around with such in R but am fairly new to the concept and am definitely not getting the right answer.

这个数据集是问题的简单版本,对于任何一个示例(最多是427对6)都有有限的对,但是更复杂的数据集将有任意多个级别,所以我需要一个与这个因素无关的解决方案。我认为递归是一种方法,我一直在用R表示,但是我对这个概念还不太熟悉,我肯定没有得到正确的答案。

I can't be the first person ever to do this but I haven't found anything yet, probably because describing the problem is a bit awkward in a search engine.

我不可能是第一个这样做的人,但我还没有找到任何东西,可能是因为在搜索引擎中描述这个问题有点尴尬。

Anyway, if anyone knows anything about this I would appreciate the help!

不管怎样,如果有人知道这件事,我将感激你的帮助!

Edit -

编辑-

Expected output (each line is a group):

期望输出(每行为一组):

 [1,]    6  267 [2,]    9   10 [3,]   11   12 [4,]   79   80 [5,]   96  570 [6,]  314  583  584 [7,]  425  426 [8,]  427  428  430  473 [9,]  427  429  430[10,]  427  430  472  474[11,]  517  519  520  521[12,]  649  650

To clarify what I mean by each member of a group being supported by a pair to every other member: Group 1 has 2 members therefore requires one supporting pair which is pair 1. Group 6 has three members and therefore requires support from three pairs 6, 7, and 28 proving that each group member pairs with both of the others (this can be visualized as a triangle with each member a vertex and the connecting lines pairs). Group 8 has four members and therefore requires the support of six pairs 9, 11, 13, 15, 16, 19 (you can picture this as a square with each member a vertex and lines between all vertices giving six pairs). I hope this clarifies! Its hard to explain and the solution seems to be non-trivial.

为了阐明我所说的组中的每个成员由一对成员支持给其他成员的意思:组1有两个成员,因此需要一个支持对,即对1。6组有3个成员,因此需要来自3对6、7和28的支持,以证明每个组成员与其他成员对(这可以可视化为每个成员一个顶点和连接线对的三角形)。第8组有4个成员,因此需要6对9、11、13、15、16、19的支持(你可以把它想象成一个正方形,每个成员都有一个顶点,所有顶点之间的线都有6对)。我希望这澄清!这很难解释,而且解决方案似乎也不平凡。

2 个解决方案

#1


2  

Update 3 add pivoting

This is a suggested optimisation to the module below, which should reduce the number of recursions significantly. Add it to the end of the module code, and replace the loop for my $v ( keys %$p ) in _bron_kerbosh with for my $v ( _choose_pivot($p, $x) )

这是对下面模块的优化建议,它将显著减少递归的数量。将它添加到模块代码的末尾,将_bron_kerbosh中的$v (key %$p)循环替换为$v (_choose_pivot($p, $x))

# Find an element u of P U X such that as many as possible of its# neighbours fall in P#sub _choose_pivot {    my ( $p, $x ) = @_;    my @p = keys %$p;    my @choice = @p;    for my $u ( @p, keys %$x ) {        my $nu = $neighbours{$u};        my %nu = map +( $_ => 1 ), @$nu;        my @subset = grep { not $nu{$_} } @p;        @choice = @subset if @subset < @choice;    }    @choice;}

Update 2 with module

Wikipedia describes the Bron-Kerbosch algorithm for finding maximal cliques in a graph. It also says

*(Wikipedia)描述了Bron-Kerbosch算法,用于在图中找到最大的小团体。它还说

Although other algorithms for solving the clique problem have running times that are, in theory, better on inputs that have few maximal independent sets, the Bron–Kerbosch algorithm and subsequent improvements to it are frequently reported as being more efficient in practice than the alternatives.

尽管解决小团体问题的其他算法在理论上有更好的运行时间,在具有很少的最大独立集的输入上,Bron-Kerbosch算法和对它的后续改进在实践中经常被报告为比其他方法更有效。

So since CPAN appears to have no clique module that I can find I thought it would be useful to implement it. This is the code. You should copy and save it as Graph/Cliques/Bron_Kerbosch.pm. I shall prepare some tests and put it on CPAN shortly

由于CPAN似乎没有我能找到的小团体模块,我认为实现它是有用的。这是代码。您应该将它复制并保存为Graph/Cliques/Bron_Kerbosch.pm。我将准备一些测试,并很快把它放在CPAN上

package Graph::Cliques::Bron_Kerbosch;use strict;use warnings;use v5.8.3;use Exporter qw/ import /;our @EXPORT_OK = qw/ get_cliques /;my ( %neighbours, @cliques );sub get_cliques {    my ( $edges ) = @_;    %neighbours = ();    @cliques = ();    for my $edge ( @$edges ) {        my ( $n1, $n2 ) = @$edge;        $neighbours{$n1}{$n2} = 1;        $neighbours{$n2}{$n1} = 1;    }    $_ = [ keys %$_ ] for values %neighbours;    my ( %r, %p, %x );    $p{$_} = 1 for map @$_, @$edges;    _bron_kerbosch( \( %r, %p, %x ) );    @cliques;}sub _bron_kerbosch {    my ( $r, $p, $x ) = @_;    unless ( %$p or %$x ) {        push @cliques, [ keys %$r ];        return;    }    for my $v ( keys %$p ) {        my $nv = $neighbours{$v};        my %r_ = ( %$r, $v => 1 );        my %p_ = map { $_ => 1 } _intersect( [ keys %$p ], $nv);        my %x_ = map { $_ => 1 } _intersect( [ keys %$x ], $nv);        _bron_kerbosch( \( %r_, %p_, %x_ ) );        delete $p->{$v};        $x->{$v} = 1;    }}sub _intersect {    my ( $aa, $ab ) = @_;    my %ab = map { $_ => 1 } @$ab;    grep $ab{$_}, @$aa;}1;

And this is the program that drives the module using your own data. get_cliques executes in just under a millisecond on my system

这是一个用你自己的数据驱动模块的程序。get_cliques在我的系统上执行的时间不到一毫秒

use strict;use warnings;use Graph::Cliques::Bron_Kerbosch qw/ get_cliques /;# Read the data into an array of arrays, converting from the question's R# output. Each element of @edges contains a pair of nodes of the graph#my @edges;while ( <DATA> ) {    my @pair = split;    next unless @pair > 2 and shift( @pair ) =~ /\[/;    push @edges, \@pair;}# Call the utility function to get a list of cliques#my @groups = get_cliques( \@edges );# Extract the hash keys to change the array of hashes into an array of sorted# arrays, then sort the array first by the size of the clique and then by the# first value in each group#$_ = [ sort { $a <=> $b } @$_ ] for @groups;@groups = sort { @$a <=> @$b or $a->[0] <=> $b->[0] } @groups;print join( ' ', map { sprintf '%3d', $_ } @$_ ), "\n" for @groups;__DATA__    [,1] [,2] [1,]    6  267 [2,]    9   10 [3,]   11   12 [4,]   79   80 [5,]   96  570 [6,]  314  583 [7,]  314  584 [8,]  425  426 [9,]  427  428[10,]  427  429[11,]  427  430[12,]  427  472[13,]  427  473[14,]  427  474[15,]  428  430[16,]  428  473[17,]  429  430[18,]  430  472[19,]  430  473[20,]  430  474[21,]  472  474[22,]  517  519[23,]  517  520[24,]  517  521[25,]  519  520[26,]  519  521[27,]  520  521[28,]  583  584[29,]  649  650

output

  6 267  9  10 11  12 79  80 96 570425 426649 650314 583 584427 429 430427 428 430 473427 430 472 474517 519 520 521



Update 1

Okay what you have here is known mathematically as a graph, and what you are describing, where every value is connected to every other value, is called a complete graph

这里的数学上称为图,你所描述的,每个值都与其他值相连,被称为完整图

Knowing that lets you use Google, and there is a question "Find all complete sub-graphs within a graph" here on Stack Overflow which tells us that a complete subgraph is called a clique, which has its very own set of clique problems, of which yours is "listing all maximal cliques". Wikipedia tells us that "These problems are all hard"!

知道了这个之后,你就可以使用谷歌了,在Stack Overflow上有一个问题"找到一个图中的所有完整子图"它告诉我们一个完整的子图叫做一个小团体,它有自己的小团体问题,你的小团体问题是"列出所有的最大小团体"*告诉我们“这些问题都很棘手”!

On this basis I checked CPAN for a clique module and found Graph::Clique which I assumed I would just have to plug in to your question. However it has problems

在此基础上,我检查了CPAN的clique模块,并发现了图::clique,我认为我只需要代入你的问题。但是它有问题

  • It looks only for cliques of a specific size

    它只适用于特定尺寸的小团体。

  • It's broken, and dies with the message

    它被破坏了,并随消息而死

    Can't use string ("1") as a SCALAR ref while "strict refs" in use
  • Because of a sorting bug, it works only with numeric node names that all have the same number of digits
  • 由于排序错误,它只适用于所有数字相同的数字节点名

It also uses a brute-force technique that employs a regex method, which while quite clever is not that fast

它还使用了一种蛮力技术,使用了regex方法,虽然相当聪明,但速度并不快

As it was a better place to start than nothing I fixed it and added some calling code that checks whether a smaller clique found earlier is a subset of a larger one. The result is this program that seems to do what you want

因为这是一个比什么都不做更好的起点,所以我修改了它,并添加了一些调用代码,以检查前面发现的一个较小的集团是否是一个较大集团的子集。结果就是这个程序似乎做了你想做的事

Note though, that I think your expected data is wrong, as it contains cliques that are subsets of others in your list, as I commented beneath your question. And you can't want to include all subsets, as otherwise your example would list all node pairs instead of just some of them. There are actually seven two-node cliques in your data; [517, 521] isn't one of them because it is a subset of [517, 519, 520, 521]

请注意,我认为您的预期数据是错误的,因为它包含在您的列表中的其他子集,正如我在您的问题下面所评论的那样。你不可能想要包含所有的子集,否则你的例子将列出所有的节点对,而不是其中的一些。在你的数据中实际上有7个双节点组;[517,521]不是其中之一,因为它是[517,519,520,521]的子集

This program runs in just under six seconds on my system. The algorithm works by looking for cliques of successively larger sizes until none are found. By far the biggest delay here is establishing that there are no cliques with five nodes in your data, which takes around five seconds. Finding all of those with four nodes or less takes less than a second

这个程序在我的系统中运行不到6秒。该算法的工作原理是寻找不断增大的小团体,直到找不到为止。到目前为止,这里最大的延迟是确定数据中没有包含5个节点的cliques,这大约需要5秒钟。找到所有四个或少于四个节点的节点所需的时间不到一秒

use strict;use warnings;use List::MoreUtils qw/ uniq any all /;# Read the data into an array of arrays. Each element of @edges contains a# pair of nodes of the graph#my @edges;push @edges, [ split ] while <DATA>;# Keep asking for cliques of a larger size until we find none. Remove from# those already found any that are subsets of new ones#my @groups;for ( my $size = 2; my @cliques = get_cliques( $size, \@edges ); ++$size ) {    @cliques = map +{ map +( $_ => 1 ), split }, @cliques;    for ( my $i = 0; $i < @groups; ) {        my $group = $groups[$i];        my $subset = any {            my $clique = $_;            all { $clique->{$_} } keys %$group;        } @cliques;        if ( $subset ) {            splice @groups, $i, 1;        }        else {            ++$i;        }    }    push @groups, @cliques;}# Extract the hash keys to change the array of hashes into an array of sorted# arrays, then sort the array first by the size of the clique and then by the# first value in each group#$_ = [ sort { $a <=> $b } keys %$_ ] for @groups;@groups = sort { @$a <=> @$b or $a->[0] <=> $b->[0] } @groups;print join( ' ', map { sprintf '%3d', $_ } @$_ ), "\n" for @groups;# This subroutine is based on the non-functional `Graph::Clique` CPAN module# by Edward Wijaya, <ewijaya@singnet.com.sg>#sub get_cliques {    my ( $k, $edges ) = @_;    my $string = do {        my @vertices = sort { $a <=> $b } uniq map @$_, @$edges;        my @edges = map "$_->[0]-$_->[1]", sort { $a->[0] <=> $b->[0] } @{$edges};        local $" = ',';              # Fix SO syntax colouring "        "@vertices;@edges";    };    my $regex = join '[^;]+', ('\b(\d+)\b') x $k;    $regex .= '[^;]*;';    $regex .= "\n";    for my $i ( 1 .. $k-1 ) {        for my $j ( $i+1 .. $k ) {            $regex .= sprintf '(?=.*\b\g%d-\g%d\b)', $i, $j;            $regex .= "\n";        }    }    # Backtrack to regain all the identified k-cliques (Credit Mike Mikero)    my @cliques;    $regex .= '(?{ push (@cliques, join(" ", map ${$_}, 1..$k) ) })(*FAIL)' . "\n";    #print $regex, "\n";    {        no strict 'refs';        use re 'eval';        $string =~ /$regex/x;    }    @cliques;}__DATA__6 2679 1011 1279 8096 570314 583314 584425 426427 428427 429427 430427 472427 473427 474428 430428 473429 430430 472430 473430 474472 474517 519517 520517 521519 520519 521520 521583 584649 650

output

  6 267  9  10 11  12 79  80 96 570425 426649 650314 583 584427 429 430427 430 472 474427 428 430 473517 519 520 521



Original post

This is reasonably straightforward once you have tossed back the red herring that every member of each group must be in a pair with every other member. I believe your data is simply structured in a way such that each group is represented by every possible pair within it, and the problem is simply one of gathering together all values that are paired to any other member of each group

这是相当直接的,一旦你抛弃了“红鲱鱼”,即每个组的每个成员都必须与其他成员成一对。我相信您的数据是简单的结构化的,这样每一组都由其中的每一个可能的组合来表示,而问题仅仅是将所有的值集合在一起,并将它们配对到每个组的任何其他成员。

This code is perhaps a little dense, but all the work is done within the for loop. Two data structures are maintained in parallel. @groups is an array of hashes whose keys are the members of the group. This is just to keep the members unique even if they are added multiple times. And %group_for is a hash relating each member to the element of @groups into which it has been placed

这段代码可能有点复杂,但是所有的工作都是在for循环中完成的。两个数据结构是并行维护的。@groups是一组散列的数组,其键是组的成员。这只是为了保持成员的独特性,即使它们被添加了多次。而%group_for是一个散列,它将每个成员与已放置在其中的@groups的元素关联起来

The for loop processes each pair by looking for a group into which either of the pair has already been placed. If neither have appeared before then a new group (anonymous hash) is pushed onto the array. Finally the %groups_for hash is updated to show where both members have been placed

for循环通过查找一个组来处理每个对,该组中的任何一个都已经被放置到该组中。如果两者之前都没有出现,则会将一个新的组(匿名散列)推到数组中。最后,更新了%groups_for散列,以显示两个成员已经放置在何处

The output section converts the groups from hashes to arrays, sorts each group, and sorts all groups in order of their first member

输出部分将组从散列转换为数组,对每个组排序,并按照第一个成员的顺序对所有组进行排序。

use strict;use warnings;my @data;push @data, [ split ] while <DATA>;my @groups;my %group_for;for my $pair ( @data ) {  my $group = $group_for{$pair->[0]} || $group_for{$pair->[1]};  push @groups, $group = {} if not $group;  $group->{$_} = 1 for @$pair;  $group_for{$_} = $group for @$pair;}# Change array of hashes into array of sorted values, sort array# by first value in each group, and display#$_ = [ sort { $a <=> $b } keys %$_ ] for @groups;@groups = sort { $a->[0] <=> $b->[0] } @groups;print join(' ', map { sprintf '%3d', $_ } @$_), "\n" for @groups;__DATA__  6 267  9  10 11  12 79  80 96 570314 583314 584425 426427 428427 429427 430427 472427 473427 474428 430428 473429 430430 472430 473430 474472 474517 519517 520517 521519 520519 521520 521583 584649 650

output

  6 267  9  10 11  12 79  80 96 570314 583 584425 426427 428 429 430 472 473 474517 519 520 521649 650

#2


1  

Here's how I'd contruct the initial groups. I'm not sure I understand correctly the condition about links "to every other member", so I'll update the code after you show us the expected output for the given sample.

这是我构造初始基团的方法。我不确定我是否正确地理解了“到其他所有成员”的链接的条件,所以我将在您向我们展示给定示例的预期输出之后更新代码。

#!/usr/bin/perluse warnings;use strict;use feature qw{ say };my $group_counter = 1;my %in_group;my %members;while (<>) {    my ($key, $v1, $v2) = split;    my @groups;    for my $value ($v1, $v2) {        if (my $g = $in_group{$value}) {            # Existing groups to merge, no duplicates.            push @groups, $g unless @groups && $g == $groups[0];         }    }    { 0 => sub { # New group.          $in_group{$_} = $group_counter for $v1, $v2;          push @{ $members{$group_counter} }, [ $key, $v1, $v2 ];          $group_counter++;      },      1 => sub { # Add to 1 group.          $in_group{$_} = $groups[0] for $v1, $v2;          push @{ $members{ $groups[0] } }, [ $key, $v1, $v2 ];      },      2 => sub { # Merge 2 groups, add to the result.          $in_group{$v2} = $groups[0];          @in_group{ @$_[1, 2] } = ($groups[0]) x 2 for @{ $members{ $groups[1] } };          push @{ $members { $groups[0] } },               @{ delete $members{ $groups[1] } };      },    }->{@groups}->();}for my $g (keys %members) {    say join ' ', map $_->[0], @{ $members{$g} };}

Output (each line represents a group):

输出(每行表示一组):

[3,][2,][1,][8,][29,][5,][6,] [7,] [28,][4,][22,] [23,] [24,] [25,] [26,] [27,][9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,]

#1


2  

Update 3 add pivoting

This is a suggested optimisation to the module below, which should reduce the number of recursions significantly. Add it to the end of the module code, and replace the loop for my $v ( keys %$p ) in _bron_kerbosh with for my $v ( _choose_pivot($p, $x) )

这是对下面模块的优化建议,它将显著减少递归的数量。将它添加到模块代码的末尾,将_bron_kerbosh中的$v (key %$p)循环替换为$v (_choose_pivot($p, $x))

# Find an element u of P U X such that as many as possible of its# neighbours fall in P#sub _choose_pivot {    my ( $p, $x ) = @_;    my @p = keys %$p;    my @choice = @p;    for my $u ( @p, keys %$x ) {        my $nu = $neighbours{$u};        my %nu = map +( $_ => 1 ), @$nu;        my @subset = grep { not $nu{$_} } @p;        @choice = @subset if @subset < @choice;    }    @choice;}

Update 2 with module

Wikipedia describes the Bron-Kerbosch algorithm for finding maximal cliques in a graph. It also says

*(Wikipedia)描述了Bron-Kerbosch算法,用于在图中找到最大的小团体。它还说

Although other algorithms for solving the clique problem have running times that are, in theory, better on inputs that have few maximal independent sets, the Bron–Kerbosch algorithm and subsequent improvements to it are frequently reported as being more efficient in practice than the alternatives.

尽管解决小团体问题的其他算法在理论上有更好的运行时间,在具有很少的最大独立集的输入上,Bron-Kerbosch算法和对它的后续改进在实践中经常被报告为比其他方法更有效。

So since CPAN appears to have no clique module that I can find I thought it would be useful to implement it. This is the code. You should copy and save it as Graph/Cliques/Bron_Kerbosch.pm. I shall prepare some tests and put it on CPAN shortly

由于CPAN似乎没有我能找到的小团体模块,我认为实现它是有用的。这是代码。您应该将它复制并保存为Graph/Cliques/Bron_Kerbosch.pm。我将准备一些测试,并很快把它放在CPAN上

package Graph::Cliques::Bron_Kerbosch;use strict;use warnings;use v5.8.3;use Exporter qw/ import /;our @EXPORT_OK = qw/ get_cliques /;my ( %neighbours, @cliques );sub get_cliques {    my ( $edges ) = @_;    %neighbours = ();    @cliques = ();    for my $edge ( @$edges ) {        my ( $n1, $n2 ) = @$edge;        $neighbours{$n1}{$n2} = 1;        $neighbours{$n2}{$n1} = 1;    }    $_ = [ keys %$_ ] for values %neighbours;    my ( %r, %p, %x );    $p{$_} = 1 for map @$_, @$edges;    _bron_kerbosch( \( %r, %p, %x ) );    @cliques;}sub _bron_kerbosch {    my ( $r, $p, $x ) = @_;    unless ( %$p or %$x ) {        push @cliques, [ keys %$r ];        return;    }    for my $v ( keys %$p ) {        my $nv = $neighbours{$v};        my %r_ = ( %$r, $v => 1 );        my %p_ = map { $_ => 1 } _intersect( [ keys %$p ], $nv);        my %x_ = map { $_ => 1 } _intersect( [ keys %$x ], $nv);        _bron_kerbosch( \( %r_, %p_, %x_ ) );        delete $p->{$v};        $x->{$v} = 1;    }}sub _intersect {    my ( $aa, $ab ) = @_;    my %ab = map { $_ => 1 } @$ab;    grep $ab{$_}, @$aa;}1;

And this is the program that drives the module using your own data. get_cliques executes in just under a millisecond on my system

这是一个用你自己的数据驱动模块的程序。get_cliques在我的系统上执行的时间不到一毫秒

use strict;use warnings;use Graph::Cliques::Bron_Kerbosch qw/ get_cliques /;# Read the data into an array of arrays, converting from the question's R# output. Each element of @edges contains a pair of nodes of the graph#my @edges;while ( <DATA> ) {    my @pair = split;    next unless @pair > 2 and shift( @pair ) =~ /\[/;    push @edges, \@pair;}# Call the utility function to get a list of cliques#my @groups = get_cliques( \@edges );# Extract the hash keys to change the array of hashes into an array of sorted# arrays, then sort the array first by the size of the clique and then by the# first value in each group#$_ = [ sort { $a <=> $b } @$_ ] for @groups;@groups = sort { @$a <=> @$b or $a->[0] <=> $b->[0] } @groups;print join( ' ', map { sprintf '%3d', $_ } @$_ ), "\n" for @groups;__DATA__    [,1] [,2] [1,]    6  267 [2,]    9   10 [3,]   11   12 [4,]   79   80 [5,]   96  570 [6,]  314  583 [7,]  314  584 [8,]  425  426 [9,]  427  428[10,]  427  429[11,]  427  430[12,]  427  472[13,]  427  473[14,]  427  474[15,]  428  430[16,]  428  473[17,]  429  430[18,]  430  472[19,]  430  473[20,]  430  474[21,]  472  474[22,]  517  519[23,]  517  520[24,]  517  521[25,]  519  520[26,]  519  521[27,]  520  521[28,]  583  584[29,]  649  650

output

  6 267  9  10 11  12 79  80 96 570425 426649 650314 583 584427 429 430427 428 430 473427 430 472 474517 519 520 521



Update 1

Okay what you have here is known mathematically as a graph, and what you are describing, where every value is connected to every other value, is called a complete graph

这里的数学上称为图,你所描述的,每个值都与其他值相连,被称为完整图

Knowing that lets you use Google, and there is a question "Find all complete sub-graphs within a graph" here on Stack Overflow which tells us that a complete subgraph is called a clique, which has its very own set of clique problems, of which yours is "listing all maximal cliques". Wikipedia tells us that "These problems are all hard"!

知道了这个之后,你就可以使用谷歌了,在Stack Overflow上有一个问题"找到一个图中的所有完整子图"它告诉我们一个完整的子图叫做一个小团体,它有自己的小团体问题,你的小团体问题是"列出所有的最大小团体"*告诉我们“这些问题都很棘手”!

On this basis I checked CPAN for a clique module and found Graph::Clique which I assumed I would just have to plug in to your question. However it has problems

在此基础上,我检查了CPAN的clique模块,并发现了图::clique,我认为我只需要代入你的问题。但是它有问题

  • It looks only for cliques of a specific size

    它只适用于特定尺寸的小团体。

  • It's broken, and dies with the message

    它被破坏了,并随消息而死

    Can't use string ("1") as a SCALAR ref while "strict refs" in use
  • Because of a sorting bug, it works only with numeric node names that all have the same number of digits
  • 由于排序错误,它只适用于所有数字相同的数字节点名

It also uses a brute-force technique that employs a regex method, which while quite clever is not that fast

它还使用了一种蛮力技术,使用了regex方法,虽然相当聪明,但速度并不快

As it was a better place to start than nothing I fixed it and added some calling code that checks whether a smaller clique found earlier is a subset of a larger one. The result is this program that seems to do what you want

因为这是一个比什么都不做更好的起点,所以我修改了它,并添加了一些调用代码,以检查前面发现的一个较小的集团是否是一个较大集团的子集。结果就是这个程序似乎做了你想做的事

Note though, that I think your expected data is wrong, as it contains cliques that are subsets of others in your list, as I commented beneath your question. And you can't want to include all subsets, as otherwise your example would list all node pairs instead of just some of them. There are actually seven two-node cliques in your data; [517, 521] isn't one of them because it is a subset of [517, 519, 520, 521]

请注意,我认为您的预期数据是错误的,因为它包含在您的列表中的其他子集,正如我在您的问题下面所评论的那样。你不可能想要包含所有的子集,否则你的例子将列出所有的节点对,而不是其中的一些。在你的数据中实际上有7个双节点组;[517,521]不是其中之一,因为它是[517,519,520,521]的子集

This program runs in just under six seconds on my system. The algorithm works by looking for cliques of successively larger sizes until none are found. By far the biggest delay here is establishing that there are no cliques with five nodes in your data, which takes around five seconds. Finding all of those with four nodes or less takes less than a second

这个程序在我的系统中运行不到6秒。该算法的工作原理是寻找不断增大的小团体,直到找不到为止。到目前为止,这里最大的延迟是确定数据中没有包含5个节点的cliques,这大约需要5秒钟。找到所有四个或少于四个节点的节点所需的时间不到一秒

use strict;use warnings;use List::MoreUtils qw/ uniq any all /;# Read the data into an array of arrays. Each element of @edges contains a# pair of nodes of the graph#my @edges;push @edges, [ split ] while <DATA>;# Keep asking for cliques of a larger size until we find none. Remove from# those already found any that are subsets of new ones#my @groups;for ( my $size = 2; my @cliques = get_cliques( $size, \@edges ); ++$size ) {    @cliques = map +{ map +( $_ => 1 ), split }, @cliques;    for ( my $i = 0; $i < @groups; ) {        my $group = $groups[$i];        my $subset = any {            my $clique = $_;            all { $clique->{$_} } keys %$group;        } @cliques;        if ( $subset ) {            splice @groups, $i, 1;        }        else {            ++$i;        }    }    push @groups, @cliques;}# Extract the hash keys to change the array of hashes into an array of sorted# arrays, then sort the array first by the size of the clique and then by the# first value in each group#$_ = [ sort { $a <=> $b } keys %$_ ] for @groups;@groups = sort { @$a <=> @$b or $a->[0] <=> $b->[0] } @groups;print join( ' ', map { sprintf '%3d', $_ } @$_ ), "\n" for @groups;# This subroutine is based on the non-functional `Graph::Clique` CPAN module# by Edward Wijaya, <ewijaya@singnet.com.sg>#sub get_cliques {    my ( $k, $edges ) = @_;    my $string = do {        my @vertices = sort { $a <=> $b } uniq map @$_, @$edges;        my @edges = map "$_->[0]-$_->[1]", sort { $a->[0] <=> $b->[0] } @{$edges};        local $" = ',';              # Fix SO syntax colouring "        "@vertices;@edges";    };    my $regex = join '[^;]+', ('\b(\d+)\b') x $k;    $regex .= '[^;]*;';    $regex .= "\n";    for my $i ( 1 .. $k-1 ) {        for my $j ( $i+1 .. $k ) {            $regex .= sprintf '(?=.*\b\g%d-\g%d\b)', $i, $j;            $regex .= "\n";        }    }    # Backtrack to regain all the identified k-cliques (Credit Mike Mikero)    my @cliques;    $regex .= '(?{ push (@cliques, join(" ", map ${$_}, 1..$k) ) })(*FAIL)' . "\n";    #print $regex, "\n";    {        no strict 'refs';        use re 'eval';        $string =~ /$regex/x;    }    @cliques;}__DATA__6 2679 1011 1279 8096 570314 583314 584425 426427 428427 429427 430427 472427 473427 474428 430428 473429 430430 472430 473430 474472 474517 519517 520517 521519 520519 521520 521583 584649 650

output

  6 267  9  10 11  12 79  80 96 570425 426649 650314 583 584427 429 430427 430 472 474427 428 430 473517 519 520 521



Original post

This is reasonably straightforward once you have tossed back the red herring that every member of each group must be in a pair with every other member. I believe your data is simply structured in a way such that each group is represented by every possible pair within it, and the problem is simply one of gathering together all values that are paired to any other member of each group

这是相当直接的,一旦你抛弃了“红鲱鱼”,即每个组的每个成员都必须与其他成员成一对。我相信您的数据是简单的结构化的,这样每一组都由其中的每一个可能的组合来表示,而问题仅仅是将所有的值集合在一起,并将它们配对到每个组的任何其他成员。

This code is perhaps a little dense, but all the work is done within the for loop. Two data structures are maintained in parallel. @groups is an array of hashes whose keys are the members of the group. This is just to keep the members unique even if they are added multiple times. And %group_for is a hash relating each member to the element of @groups into which it has been placed

这段代码可能有点复杂,但是所有的工作都是在for循环中完成的。两个数据结构是并行维护的。@groups是一组散列的数组,其键是组的成员。这只是为了保持成员的独特性,即使它们被添加了多次。而%group_for是一个散列,它将每个成员与已放置在其中的@groups的元素关联起来

The for loop processes each pair by looking for a group into which either of the pair has already been placed. If neither have appeared before then a new group (anonymous hash) is pushed onto the array. Finally the %groups_for hash is updated to show where both members have been placed

for循环通过查找一个组来处理每个对,该组中的任何一个都已经被放置到该组中。如果两者之前都没有出现,则会将一个新的组(匿名散列)推到数组中。最后,更新了%groups_for散列,以显示两个成员已经放置在何处

The output section converts the groups from hashes to arrays, sorts each group, and sorts all groups in order of their first member

输出部分将组从散列转换为数组,对每个组排序,并按照第一个成员的顺序对所有组进行排序。

use strict;use warnings;my @data;push @data, [ split ] while <DATA>;my @groups;my %group_for;for my $pair ( @data ) {  my $group = $group_for{$pair->[0]} || $group_for{$pair->[1]};  push @groups, $group = {} if not $group;  $group->{$_} = 1 for @$pair;  $group_for{$_} = $group for @$pair;}# Change array of hashes into array of sorted values, sort array# by first value in each group, and display#$_ = [ sort { $a <=> $b } keys %$_ ] for @groups;@groups = sort { $a->[0] <=> $b->[0] } @groups;print join(' ', map { sprintf '%3d', $_ } @$_), "\n" for @groups;__DATA__  6 267  9  10 11  12 79  80 96 570314 583314 584425 426427 428427 429427 430427 472427 473427 474428 430428 473429 430430 472430 473430 474472 474517 519517 520517 521519 520519 521520 521583 584649 650

output

  6 267  9  10 11  12 79  80 96 570314 583 584425 426427 428 429 430 472 473 474517 519 520 521649 650

#2


1  

Here's how I'd contruct the initial groups. I'm not sure I understand correctly the condition about links "to every other member", so I'll update the code after you show us the expected output for the given sample.

这是我构造初始基团的方法。我不确定我是否正确地理解了“到其他所有成员”的链接的条件,所以我将在您向我们展示给定示例的预期输出之后更新代码。

#!/usr/bin/perluse warnings;use strict;use feature qw{ say };my $group_counter = 1;my %in_group;my %members;while (<>) {    my ($key, $v1, $v2) = split;    my @groups;    for my $value ($v1, $v2) {        if (my $g = $in_group{$value}) {            # Existing groups to merge, no duplicates.            push @groups, $g unless @groups && $g == $groups[0];         }    }    { 0 => sub { # New group.          $in_group{$_} = $group_counter for $v1, $v2;          push @{ $members{$group_counter} }, [ $key, $v1, $v2 ];          $group_counter++;      },      1 => sub { # Add to 1 group.          $in_group{$_} = $groups[0] for $v1, $v2;          push @{ $members{ $groups[0] } }, [ $key, $v1, $v2 ];      },      2 => sub { # Merge 2 groups, add to the result.          $in_group{$v2} = $groups[0];          @in_group{ @$_[1, 2] } = ($groups[0]) x 2 for @{ $members{ $groups[1] } };          push @{ $members { $groups[0] } },               @{ delete $members{ $groups[1] } };      },    }->{@groups}->();}for my $g (keys %members) {    say join ' ', map $_->[0], @{ $members{$g} };}

Output (each line represents a group):

输出(每行表示一组):

[3,][2,][1,][8,][29,][5,][6,] [7,] [28,][4,][22,] [23,] [24,] [25,] [26,] [27,][9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,]