如何在列表/字符串数组中查找类似的模式

I am looking for ways to find matching patterns in lists or arrays of strings, specifically in .NET, but algorithms or logic from other languages would be helpful.

我正在寻找在列表或字符串数组中找到匹配模式的方法,特别是在.NET中,但是其他语言的算法或逻辑会很有帮助。

Say I have 3 arrays (or in this specific case List(Of String))

假设我有3个数组(或在此特定情况下List(Of String))

Array1
"Do"
"Re"
"Mi"
"Fa"
"So"
"La"
"Ti"

Array2
"Mi"
"Fa"
"Jim"
"Bob"
"So"

Array3
"Jim"
"Bob"
"So"
"La"
"Ti"

I want to report on the occurrences of the matches of

我想报告一下比赛的发生情况

("Mi", "Fa") In Arrays (1,2)
("So") In Arrays (1,2,3)
("Jim", "Bob", "So") in Arrays (2,3)
("So", "La", "Ti") in Arrays (1, 3)

...and any others.

......和其他任何人。

I am using this to troubleshoot an issue, not to make a commercial product of it specifically, and would rather not do it by hand (there are 110 lists of about 100-200 items).

我正在使用它来解决问题,而不是专门制作它的商业产品,而不是手工制作(有110个约100-200项的清单)。

Are there any algorithms, existing code, or ideas that will help me accomplish finding the results described?

是否有任何算法,现有代码或想法可以帮助我找到所描述的结果?

8 个解决方案

#1

As others have mentioned the function you want is Intersect. If you are using .NET 3.0 consider using LINQ's Intersect function.

正如其他人提到的,你想要的功能是相交。如果您使用的是.NET 3.0,请考虑使用LINQ的Intersect功能。

See the following post for more information

有关更多信息,请参阅以下帖子

Consider using LinqPAD to experiment.

考虑使用LinqPAD进行实验。

www.linqpad.net

#2

The simplest way to code would be to build a Dictionary then loop through each item in each array. For each item do this:

最简单的编码方法是构建一个Dictionary然后遍历每个数组中的每个项目。对于每个项目执行此操作:

Check if the item is in the dictonary if so add the list to the array. If the item is not in the dictionary add it and the list.

检查项目是否在dictonary中,如果是这样,将列表添加到数组中。如果该项目不在字典中,请添加它和列表。

Since as you said this is non-production code performance doesn't matter so this approach should work fine.

既然你说这是非生产代码性能并不重要,所以这种方法应该可以正常工作。

#3

Here's a solution using SuffixTree module to locate subsequences:

这是使用SuffixTree模块定位子序列的解决方案:

#!/usr/bin/env python
from SuffixTree  import SubstringDict
from collections import defaultdict
from itertools   import groupby
from operator    import itemgetter
import sys

def main(stdout=sys.stdout):
    """
    >>> import StringIO
    >>> s = StringIO.StringIO()
    >>> main(stdout=s)
    >>> print s.getvalue()
    [['Mi', 'Fa']] In Arrays (1, 2)
    [['So', 'La', 'Ti']] In Arrays (1, 3)
    [['Jim', 'Bob', 'So']] In Arrays (2, 3)
    [['So']] In Arrays (1, 2, 3)
    <BLANKLINE>
    """
    # array of arrays of strings
    arr = [
        ["Do", "Re", "Mi", "Fa", "So", "La", "Ti",],
        ["Mi", "Fa", "Jim", "Bob", "So",],
        ["Jim", "Bob", "So", "La", "Ti",],
    ]

####    # 28 seconds  (27 seconds without lesser substrs inspection (see below))
####    N, M = 100, 100
####    import random
####    arr = [[random.randrange(100) for _ in range(M)] for _ in range(N)]

    # convert to ASCII alphabet (for SubstringDict)
    letter2item = {}
    item2letter = {}
    c = 1
    for item in (i for a in arr for i in a):
        if item not in item2letter:
            c += 1
            if c == 128:
                raise ValueError("too many unique items; "
                                 "use a less restrictive alphabet for SuffixTree")
            letter = chr(c)
            letter2item[letter] = item
            item2letter[item] = letter
    arr_ascii = [''.join(item2letter[item] for item in a) for a in arr]

    # populate substring dict (based on SuffixTree)
    substring_dict = SubstringDict()
    for i, s in enumerate(arr_ascii):
        substring_dict[s] = i+1

    # enumerate all substrings, save those that occur more than once
    substr2indices = {}
    indices2substr = defaultdict(list)
    for str_ in arr_ascii:
        for start in range(len(str_)):
            for size in reversed(range(1, len(str_) - start + 1)):
                substr = str_[start:start + size]
                if substr not in substr2indices:
                    indices = substring_dict[substr] # O(n) SuffixTree
                    if len(indices) > 1:
                        substr2indices[substr] = indices
                        indices2substr[tuple(indices)].append(substr)
####                        # inspect all lesser substrs
####                        # it could diminish size of indices2substr[ind] list
####                        # but it has no effect for input 100x100x100 (see above)
####                        for i in reversed(range(len(substr))):
####                            s = substr[:i]
####                            if s in substr2indices: continue
####                            ind = substring_dict[s]
####                            if len(ind) > len(indices):
####                                substr2indices[s] = ind
####                                indices2substr[tuple(ind)].append(s)
####                                indices = ind
####                            else:
####                                assert set(ind) == set(indices), (ind, indices)
####                                substr2indices[s] = None
####                        break # all sizes inspected, move to next `start`

    for indices, substrs in indices2substr.iteritems():
        # remove substrs that are substrs of other substrs
        substrs = sorted(substrs, key=len) # sort by size
        substrs = [p for i, p in enumerate(substrs)
                   if not any(p in q  for q in substrs[i+1:])]
        # convert letters to items and print
        items = [map(letter2item.get, substr) for substr in substrs]
        print >>stdout, "%s In Arrays %s" % (items, indices)

if __name__=="__main__":
    # test
    import doctest; doctest.testmod()
    # measure performance
    import timeit
    t = timeit.Timer(stmt='main(stdout=s)',
                     setup='from __main__ import main; from cStringIO import StringIO as S; s = S()')
    N = 1000
    milliseconds = min(t.repeat(repeat=3, number=N))
    print("%.3g milliseconds" % (1e3*milliseconds/N))

It takes about 30 seconds to process 100 lists of 100 items each. SubstringDict in the above code might be emulated by grep -F -f.

每个处理100个100个项目的列表大约需要30秒。上面代码中的SubstringDict可能由grep -F -f模拟。

Old solution:

In Python (save it to 'group_patterns.py' file):

在Python中(将其保存为'group_patterns.py'文件):

#!/usr/bin/env python
from collections import defaultdict
from itertools   import groupby

def issubseq(p, q):
    """Return whether `p` is a subsequence of `q`."""
    return any(p == q[i:i + len(p)] for i in range(len(q) - len(p) + 1))

arr = (("Do", "Re", "Mi", "Fa", "So", "La", "Ti",),
       ("Mi", "Fa", "Jim", "Bob", "So",),
       ("Jim", "Bob", "So", "La", "Ti",))

# store all patterns that occure at least twice
d = defaultdict(list) # a map: pattern -> indexes of arrays it's within
for i, a in enumerate(arr[:-1]):
    for j, q in enumerate(arr[i+1:]): 
        for k in range(len(a)):
            for size in range(1, len(a)+1-k):
                p = a[k:k + size] # a pattern
                if issubseq(p, q): # `p` occures at least twice
                    d[p] += [i+1, i+2+j]

# group patterns by arrays they are within
inarrays = lambda pair: sorted(set(pair[1]))
for key, group in groupby(sorted(d.iteritems(), key=inarrays), key=inarrays):
    patterns = sorted((pair[0] for pair in group), key=len) # sort by size
    # remove patterns that are subsequences of other patterns
    patterns = [p for i, p in enumerate(patterns)
                if not any(issubseq(p, q)  for q in patterns[i+1:])]
    print "%s In Arrays %s" % (patterns, key)

The following command:

以下命令:

$ python group_patterns.py

prints:

[('Mi', 'Fa')] In Arrays [1, 2]
[('So',)] In Arrays [1, 2, 3]
[('So', 'La', 'Ti')] In Arrays [1, 3]
[('Jim', 'Bob', 'So')] In Arrays [2, 3]

The solution is terribly inefficient.

解决方案非常低效。

#4

I hacked the program below in about 10 minutes of Perl. It's not perfect, it uses a global variable, and it just prints out the counts of every element seen by the program in each list, but it's a good approximation to what you want to do that's super-easy to code.

我在大约10分钟的Perl中攻击了下面的程序。它并不完美,它使用了一个全局变量,它只打印出每个列表中程序看到的每个元素的计数,但它很好地接近你想要做的那些超级易编码。

Do you actually want all combinations of all subsets of the elements common to each array? You could enumerate all of the elements in a smarter way if you wanted, but if you just wanted all elements that exist at least once in each array you could use the Unix command "grep -v 0" on the output below and that would show you the intersection of all elements common to all arrays. Your question is missing a little bit of detail, so I can't perfectly implement something that solves your problem.

您是否真的想要每个阵列共有的所有元素子集的所有组合?如果你愿意,你可以用更聪明的方式枚举所有元素,但是如果你只想在每个数组中至少存在一次所有元素,你可以在下面的输出上使用Unix命令“grep -v 0”,这将显示你是所有数组共有的所有元素的交集。你的问题缺少一些细节,所以我无法完美地实现解决问题的方法。

If you do more data analysis than programming, scripting can be very useful for asking questions from textual data like this. If you don't know how to code in a scripting language like this, I would spend a month or two reading about how to code in Perl, Python or Ruby. They can be wonderful for one-off hacks such as this, especially in cases when you don't really know what you want. The time and brain cost of writing a program like this is really low, so that (if you're fast) you can write and re-write it several times while still exploring the definition of your question.

如果您进行的数据分析比编程更多,那么脚本编写对于从这样的文本数据中提问是非常有用的。如果您不知道如何使用这样的脚本语言进行编码,我会花一两个月时间阅读有关如何使用Perl,Python或Ruby编写代码的内容。对于像这样的一次性黑客来说,它们非常棒,特别是在你真的不知道自己想要什么的情况下。编写这样的程序的时间和大脑成本非常低,所以(如果你很快)你可以多次编写和重写它,同时仍然在探索你的问题的定义。

#!/usr/bin/perl -w

use strict;

my @Array1 = ( "Do", "Re", "Mi", "Fa", "So", "La", "Ti");
my @Array2 = ( "Mi", "Fa", "Jim", "Bob", "So" );
my @Array3 = ( "Jim", "Bob", "So", "La", "Ti" );

my %counts;
sub count_array {
    my $array = shift;
    my $name  = shift;
    foreach my $e (@$array) {
        $counts{$e}{$name}++;
    }
}

count_array( \@Array1, "Array1" );
count_array( \@Array2, "Array2" );
count_array( \@Array3, "Array3" );

my @names = qw/ Array1 Array2 Array3 /;
print join ' ', ('element',@names);
print "\n";

my @unique_names = keys %counts;
foreach my $unique_name (@unique_names) {
    my @counts = map {
        if ( exists $counts{$unique_name}{$_} ) {
            $counts{$unique_name}{$_};
        } else {
            0;
        }
    }
    @names;

    print join ' ', ($unique_name,@counts);
    print "\n";
}

The program's output is:

该程序的输出是:

element Array1 Array2 Array3
Ti 1 0 1
La 1 0 1
So 1 1 1
Mi 1 1 0
Fa 1 1 0
Do 1 0 0
Bob 0 1 1
Jim 0 1 1
Re 1 0 0

#5

Looks like you want to use an intersection function on sets of data. Intersection picks out elements that are common in both (or more) sets.

看起来您想对数据集使用交集函数。交叉点选择两个(或更多)集合中常见的元素。

The problem with this viewpoint is that sets cannot contain more than one of each element, i.e. no more than one Jim per set, also it cannot recognize several elements in a row counting as a pattern, you can however modify a comparison function to look further to see just that.

这个观点的问题是集合不能包含每个元素中的一个以上,即每组不超过一个Jim,也无法识别作为模式计算的行中的几个元素,但是可以修改比较函数以进一步查看看到那个。

There mey be functions like intersect that works on bags (which is kind of like sets, but tolerate identical elements).

可能有像交叉这样的函数在袋子上工作(有点像套装,但容忍相同的元素)。

These functions should be standard in most languages or pretty easy to write yourself.

这些函数应该是大多数语言的标准函数,或者很容易自己编写。

#6

I'm sure there's a MUCH more elegant way, but...

我确信有更优雅的方式,但......

Since this isn't production code, why not just hack it and convert each array into a delimited string, then search each string for the pattern you want? i.e.

由于这不是生产代码,为什么不只是破解它并将每个数组转换为分隔的字符串,然后在每个字符串中搜索您想要的模式?即


        private void button1_Click(object sender, EventArgs e)
        {

            string[] array1 = { "do", "re", "mi", "fa", "so" };
            string[] array2 = { "mi", "fa", "jim", "bob", "so" };
            string[] pattern1 = { "mi", "fa" };
            MessageBox.Show(FindPatternInArray(array1, pattern1).ToString());
            MessageBox.Show(FindPatternInArray(array2, pattern1).ToString());

        }

        private bool FindPatternInArray(string[] AArray, string[] APattern)
        {
            return string.Join("~", AArray).IndexOf(string.Join("~", APattern)) >= 0;
        }

#7

First, start by counting each item. You make a temp list : "Do" = 1, "Mi" = 2, "So" = 3, etc. you can remove from the temp list all the ones that match = 1 (ex: "Do"). The temp list contains the list of non-unique items (save it somewhere).

首先,从计算每个项目开始。你创建一个临时列表:“Do”= 1,“Mi”= 2,“So”= 3等,你可以从临时列表中删除所有匹配= 1的(例如:“Do”)。临时列表包含非唯一项的列表(将其保存在某处)。

Now, you try to make lists of two from one in the temp list, and a following in the original lists. "So" + "La" = 2, "Bob" + "So" = 2, etc. Remove the ones with = 1. You have the lists of couple that appears at least twice (save it somewhere).

现在,您尝试在临时列表中创建两个列表,在原始列表中创建一个列表。 “所以”+“La”= 2,“Bob”+“So”= 2等。删除= 1的那些。你有至少出现两次的情侣列表(保存在某个地方)。

Now, try to make lists of 3 items, by taking a couple from the temp list, and take the following from the original lists. ("Mi", "Fa") + "So" = 1, ("Mi", "Fa") + "Jim" = 1, ("So", "La") + "Ti" = 2 Remove the ones with = 1. You have the lists of 3 items that appears at least twice (save it).

现在,尝试通过从临时列表中取一对来制作3个项目的列表,并从原始列表中取出以下内容。 (“Mi”,“Fa”)+“So”= 1,(“Mi”,“Fa”)+“Jim”= 1,(“So”,“La”)+“Ti”= 2删除with = 1.您有3个项目的列表,至少出现两次(保存)。

And you continue like that until the temp list is empty.

你继续这样,直到临时列表为空。

At the end, you take all the saved lists and you merge them.

最后,您将获取所有已保存的列表并将它们合并。

This algorithm is not optimal (I think we can do better with suitable data structures), but it is easy to implement :)

这个算法不是最优的(我认为我们可以用合适的数据结构做得更好),但它很容易实现:)

#8

Suppose a password consisted of a string of nine characters from the English alphabet (26 characters). If each possible password could be tested in a millisecond, how long would it take to test all possible passwords?

假设密码由英文字母表中的9个字符组成(26个字符)。如果每个可能的密码都可以在毫秒内测试,那么测试所有可能的密码需要多长时间?

#1