如何确定列表是否是另一个列表的子集?

时间:2021-07-24 23:10:47

What is efficient way to determine if a list is a subset of another list?

确定列表是否是另一个列表的子集的有效方法是什么?

Example:

is_subset(List(1,2,3,4),List(2,3))    //Returns true
is_subset(List(1,2,3,4),List(3,4,5))  //Returns false

I am mostly looking for efficient algorithm and not too concern how the list is stored. It can be stored in array, link list or other data structure.

我主要寻找有效的算法,而不是太关心列表的存储方式。它可以存储在数组,链接列表或其他数据结构中。

Thanks

EDIT: The list is sorted

编辑:列表已排序

13 个解决方案

#1


22  

Here are a few trade offs you can make. Let's assume that you have two sets of elements, S and T, drawn from a universe U. We want to determine if S≥T. In one of the given examples, we have

以下是您可以做出的一些权衡。假设您有两组元素S和T,它们来自宇宙U.我们想确定S≥T。在给出的一个例子中,我们有

S={1,2,3,4}
T={3,4,5}
U={1,2,3,4,5}

S = {1,2,3,4} T = {3,4,5} U = {1,2,3,4,5}

1. Sorted Lists (or balanced search tree)
The method suggested by most posters. If you already have sorted lists, or don't care about the length of time it takes to create them (say, you're not doing that often), then this algorithm is basically linear time and space. This is usually the best option.

1.排序列表(或平衡搜索树)大多数海报建议的方法。如果你已经有了排序列表,或者不关心创建它们所花费的时间长度(比如,你不经常这样做),那么这个算法基本上是线性时间和空间。这通常是最好的选择。

(To be fair to other choices here, the time and space bounds should actually contain factors of "Log |U|" in appropriate places, but this is usually not relivant)

(为了公平对待其他选择,时间和空间界限实际上应该在适当的位置包含“Log | U |”因子,但这通常不是重复的)

Data structures: Sorted list for each of S and T. Or a balanced search tree (e.g. AVL tree, red-black tree, B+-tree) that can be iterated over in constant space.

数据结构:S和T中每一个的排序列表。或者可以在恒定空间中迭代的平衡搜索树(例如AVL树,红黑树,B +树)。

Algorithm: For each element in T, in order, search S linearly for that element. Remember where you left off each search, and start the next search there. If every search succeeds, then S≥T.

算法:对于T中的每个元素,按顺序,线性搜索该元素。记住每次搜索都停止的地方,并在那里开始下一次搜索。如果每次搜索都成功,那么S≥T。

Time complexity: about O( |S| Log|S| + |T| Log|T| ) to create the sorted lists, O( max(|S|, |T|) ) to compare.

时间复杂度:约O(| S | Log | S | + | T | Log | T |)创建排序列表,O(max(| S |,| T |))进行比较。

Space complexity: about O( |S| + |T| )

空间复杂度:约O(| S | + | T |)

Example (C++)

#include <set>
#include <algorithm>

std::set<int> create_S()
{
    std::set<int> S;
    // note: std::set will put these in order internally
    S.insert(3);
    S.insert(2);
    S.insert(4);
    S.insert(1);
    return S;
}

std::set<int> create_T()
{
    std::set<int> T;
    // note std::set will put these in order internally
    T.insert(4);
    T.insert(3);
    T.insert(5);
    return T;
}

int main()
{
    std::set<int> S=create_S();
    std::set<int> T=create_T();
    return std::includes(S.begin(),S.end(), T.begin(), T.end());
}

2. Hash tables
Better average time complexity than with a sorted list can be obtained using hash tables. The improved behavior for large sets comes at the cost of generally poorer performance for small sets.

2.散列表使用散列表可以获得比排序列表更好的平均时间复杂度。大型集合的改进行为的代价是小型集合的性能通常较差。

As with sorted lists, I'm ignoring the complexity contributed by the size of the universe.

与排序列表一样,我忽略了宇宙大小所带来的复杂性。

Data structure: Hash table for S, anything quickly iterable for T.

数据结构:S的哈希表,任何可以快速迭代的东西。

Algorithm: Insert each element of S into its hashtable. Then, for each element in T, check to see if it's in the hash table.

算法:将S的每个元素插入其哈希表中。然后,对于T中的每个元素,检查它是否在哈希表中。

Time complexity: O( |S| + |T| ) to set up, O( |T| ) to compare.

时间复杂度:设置O(| S | + | T |),比较O(| T |)。

Space complexity: O( |S| + |T| )

空间复杂度:O(| S | + | T |)

Example (C++)

#include <tr1/unordered_set>

std::tr1::unordered_set<int> create_S()
{
    std::tr1::unordered_set<int> S;
    S.insert(3);
    S.insert(2);
    S.insert(4);
    S.insert(1);
    return S;
}

std::tr1::unordered_set<int> create_T()
{
    std::tr1::unordered_set<int> T;
    T.insert(4);
    T.insert(3);
    T.insert(5);
    return T;
}

bool includes(const std::tr1::unordered_set<int>& S, 
              const std::tr1::unordered_set<int>& T)
{
    for (std::tr1::unordered_set<int>::const_iterator iter=T.begin();
         iter!=T.end();
         ++iter)
    {
        if (S.find(*iter)==S.end())
        {
            return false;
        }
    }
    return true;
}

int main()
{
    std::tr1::unordered_set<int> S=create_S();
    std::tr1::unordered_set<int> T=create_T();
    return includes(S,T);
}

3. Bit sets
If your universe is particularly small (let's say you can only have elements 0-32), then a bitset is a reasonable solution. The running time (again, assuming you don't care about setup time) is essentially constant. In the case you do care about setup, it's still faster than creating a sorted list.

3.位集如果你的宇宙特别小(假设你只能有元素0-32),那么bitset是一个合理的解决方案。运行时间(再次,假设您不关心设置时间)基本上是不变的。如果您关心设置,它仍然比创建排序列表更快。

Unfortunately, bitsets become unwieldy very quickly for even a moderately sized universe.

不幸的是,即使是中等大小的宇宙,bitsets也很快变得笨拙。

Data structure: bit vector (usually a machine integer) for each of S and T. We might encode S=11110 and T=00111, in the given example.

数据结构:S和T中的每一个的位向量(通常是机器整数)。在给定的示例中,我们可以编码S = 11110和T = 00111。

Algorithm: Calculate the intersection, by computing the bitwise 'and' of each bit in S with the corresponding bit in T. If the result equals T, then S≥T.

算法:通过计算S中每个位的按位'和'与T中的相应位来计算交点。如果结果等于T,则S≥T。

Time complexity: O( |U| + |S| + |T| ) to setup, O( |U| ) to compare.

时间复杂度:设置O(| U | + | S | + | T |),比较O(| U |)。

Space complexity: O( |U| )

空间复杂度:O(| U |)

Example: (C++)

#include <bitset>

// bitset universe always starts at 0, so create size 6 bitsets for demonstration.
// U={0,1,2,3,4,5}

std::bitset<6> create_S()
{
    std::bitset<6> S;
    // Note: bitsets don't care about order
    S.set(3);
    S.set(2);
    S.set(4);
    S.set(1);
    return S;
}

std::bitset<6> create_T()
{
    std::bitset<6> T;
    // Note: bitsets don't care about order
    T.set(4);
    T.set(3);
    T.set(5);
    return T;
}

int main()
{
    std::bitset<6> S=create_S();
    std::bitset<6> T=create_T();

    return S & T == T;
}

4. Bloom filters
All the speed benefits of bitsets, without the pesky limitation on universe size the bitsets have. Only one down side: they sometimes (often, if you're not careful) give the wrong answer: If the algorithm says "no", then you definitely don't have inclusion. If the algorithm says "yes", you might or might not. Better accuracy is attained by choosing a large filter size, and good hash functions.

4. Bloom过滤器bitset的所有速度优势,没有bitset所具有的宇宙大小的令人讨厌的限制。只有一个缺点:他们有时(通常,如果你不小心)给出错误的答案:如果算法说“不”,那么你肯定没有包含。如果算法说“是”,您可能会也可能不会。通过选择较大的滤波器大小和良好的散列函数可以获得更高的精度。

Given that they can and will give wrong answers, Bloom filters might sound like a horrible idea. However, they have definite uses. Generally one would use Bloom filters to do many inclusion checks quickly, and then use a slower deterministic method to guarantee correctness when needed. The linked Wikipedia article mentions some applications using Bloom filters.

鉴于他们可以而且会给出错误的答案,Bloom过滤器可能听起来像一个可怕的想法。但是,它们有明确的用途。通常,人们会使用Bloom过滤器快速执行许多包含检查,然后使用较慢的确定性方法来保证需要时的正确性。链接的*文章提到了一些使用Bloom过滤器的应用程序。

Data structure: A Bloom filter is a fancy bitset. Must choose a filter size, and hash functions beforehand.

数据结构:Bloom过滤器是一个花哨的bitset。必须事先选择过滤器大小和散列函数。

Algorithm (sketch): Initialize the bitset to 0. To add an element to a bloom filter, hash it with each hash function, and set the corresponding bit in the bitset. Determining inclusion works just as for bitsets.

算法(草图):将bitset初始化为0.要将一个元素添加到bloom过滤器,请使用每个哈希函数对其进行哈希处理,并在bitset中设置相应的位。确定包含就像bitsets一样。

Time complexity: O( filter size )

时间复杂度:O(过滤器大小)

Space complexity: O( filter size )

空间复杂度:O(过滤器大小)

Probability of correctness: Always correct if it answers for "S does not include T". Something like 0.6185^(|S|x|T|/(filter size))) if it answers "S includes T". In particular, the filter size must be chosen proportional to the product of |S| and |T| to give reasonable probability of accuracy.

正确性概率:如果答案为“S不包括T”,则始终正确。如果它回答“S包括T”,则类似于0.6185 ^(| S | x | T | /(过滤器大小)))。特别是,必须根据| S |的乘积选择滤波器大小和| T |给出合理的准确概率。

#2


15  

For C++, the best way is to use std::includes algorithm:

对于C ++,最好的方法是使用std :: includes算法:

#include <algorithm>

std::list<int> l1, l2;
...
// Test whether l2 is a subset of l1
bool is_subset = std::includes(l1.begin(), l1.end(), l2.begin(), l2.end());

This requires both lists to be sorted, as specified in your question. Complexity is linear.

这需要按照问题中的规定对两个列表进行排序。复杂性是线性的。

#3


10  

Just wanted to mention that Python has a method for this:

只是想提一下Python有一个方法:

return set(list2).issubset(list1)

Or:

return set(list2) <= set(list1)

#4


7  

If both lists are ordered, one simple solution would be to simultaneously go over both lists (with a two bump pointers in both lists), and verify that all of the elements in the second list appear in the first list (until all elements are found, or until you reach a larger number in the first list).

如果两个列表都是有序的,一个简单的解决方案是同时遍历两个列表(两个列表中都有两个凹凸指针),并验证第二个列表中的所有元素是否出现在第一个列表中(直到找到所有元素) ,或直到你在第一个列表中达到更大的数字)。

A pseudo-code in C++ would look something like this:

C ++中的伪代码看起来像这样:

List l1, l2;
iterator i1 = l1.start();
iterator i2 = l2.start();
while(i1 != l1.end() && i2 != l2.end()) {
  if (*i1 == *i2) {
    i1++;
    i2++;
  } else if (*i1 > *i2) {
    return false;
  } else {
    i1++;
  }
}
return true;

(It obviously won't work as is, but the idea should be clear).

(它显然不会按原样运作,但这个想法应该是明确的)。

If the lists are not ordered, you can use a hashtable - insert all of your elements in the first list, and then check if all of the elements in the second list appear in the hashtable.

如果未对列表进行排序,则可以使用哈希表 - 在第一个列表中插入所有元素,然后检查第二个列表中的所有元素是否都显示在哈希表中。

These are algorithmic answers. In different languages, there are default built-in methods to check this.

这些是算法的答案。在不同的语言中,有默认的内置方法来检查这一点。

#5


3  

If you're concerned about ordering or continuity, you may need to use the Boyer-Moore or the Horspool algorithm.

如果您担心排序或连续性,您可能需要使用Boyer-Moore或Horspool算法。

The question is, do you want to consider [2, 1] to be a subset of [1, 2, 3]? Do you want [1, 3] to be considered a subset of [1, 2, 3]? If the answer is no to both of these, you might consider one of the algorithms linked above. Otherwise, you may want to consider a hash set.

问题是,你想考虑[2,1]是[1,2,3]的子集吗?你想[1,3]被认为是[1,2,3]的一个子集吗?如果答案都不是这两个,您可以考虑上面链接的算法之一。否则,您可能需要考虑哈希集。

#6


3  

Scala, assuming you mean subsequence by subset:

Scala,假设您的子集是子序列:

def is_subset[A,B](l1: List[A], l2: List[B]): Boolean =
  (l1 indexOfSeq l2) > 0

Anyway, a subsequence is just a substring problem. Optimal algorithms include Knuth-Morris-Pratt and Boyer-Moore, and a few more complex ones.

无论如何,子序列只是一个子串问题。最佳算法包括Knuth-Morris-Pratt和Boyer-Moore,以及一些更复杂的算法。

If you truly meant subset, though, and thus you are speaking of Sets and not Lists, you can just use the subsetOf method in Scala. Algorithms will depend on how the set is stored. The following algorithm works for a list storage, which is a very suboptimal one.

但是,如果你真正意味着子集,因此你说的是集合而不是列表,你可以在Scala中使用subsetOf方法。算法将取决于集合的存储方式。以下算法适用于列表存储,这是非常不理想的。

def is_subset[A,B](l1: List[A], l2: List[B]): Boolean = (l1, l2) match {
  case (_, Nil) => true
  case (Nil, _) => false
  case (h1 :: t1, h2 :: t2) if h1 == h2 => is_subset(t1, t2)
  case (_ :: tail, list) => is_subset(tail, list)
}

#7


3  

For indexOfSeq in scala trunk I implemented KMP, which you can examine: SequenceTemplate

对于scala trunk中的indexOfSeq,我实现了KMP,您可以检查它:SequenceTemplate

#8


1  

If you're ok with storing the data in a hashset you can simply check whether list1 contains x for each x in list2. Which will be close to O(n) in the size of list2. (Of course you can also do the same with other datastructures, but that will lead to different runtimes).

如果您可以将数据存储在哈希集中,则只需检查list1是否包含list2中每个x的x。这将与list2的大小接近O(n)。 (当然,您也可以对其他数据结构执行相同操作,但这会导致不同的运行时)。

#9


1  

This depends highly on the language/toolkit, as well as the size and storage of the lists.

这在很大程度上取决于语言/工具包,以及列表的大小和存储。

If the lists are sorted, a single loop can determine this. You can just start walking the larger list trying to find the first element of the smaller list (break if you pass it in value), then move on to the next, and continue from the current location. This is fast, since it's a one loop/one pass algorithm.

如果对列表进行排序,则单个循环可以确定这一点。您可以开始走大型列表,尝试查找较小列表的第一个元素(如果您将值传递给它,则中断),然后继续前进到下一个元素,并从当前位置继续。这很快,因为它是一个循环/一次通过算法。

For unsorted lists, it's often fastest to build some form of hash table from the first list's elements, then search each element in the second list off the hash. This is the approach that many of the .NET LINQ extensions use internally for item searching within a list, and scale quite well (although they have fairly large temporary memory requirements).

对于未排序的列表,从第一个列表的元素构建某种形式的哈希表通常最快,然后从哈希中搜索第二个列表中的每个元素。这是许多.NET LINQ扩展在内部用于列表中项目搜索的方法,并且可以很好地扩展(尽管它们具有相当大的临时内存要求)。

#10


1  

func isSubset ( @list, @possibleSubsetList ) {
    if ( size ( @possibleSubsetList ) > size ( @list ) ) {
        return false;
    }
    for ( @list : $a ) {
        if ( $a != @possibleSubsetList[0] ) {
            next;
        } else {
            pop ( @possibleSubsetList );
        }
    }
    if ( size ( @possibleSubsetList ) == 0 ) {
        return true;
    } else {
        return false;
    }
}

O(n) viola. of course, isSubset( (1,2,3,4,5), (2,4) ) will return true

O(n)中提琴。当然,isSubset((1,2,3,4,5),(2,4))将返回true

#11


0  

You should have a look at the implementation of STL method search. That is the C++ way I think this would be done.

您应该看一下STL方法搜索的实现。这就是我认为可以完成的C ++方式。

http://www.sgi.com/tech/stl/search.html

Description:

Search finds a subsequence within the range [first1, last1) that is identical to [first2, last2) when compared element-by-element.

当逐个元素进行比较时,搜索在[first1,last1]范围内查找与[first2,last2]相同的子序列。

#12


0  

You can see the problem to check if a list is a subset of another list as the same problem to verify if a substring belongs to a string. The best known algorithm for this is the KMP (Knuth-Morris-Pratt). Look at wikipedia for a pseudo-code or just use some String.contains method available in the language of your preference. =)

您可以看到问题,以检查列表是否是另一个列表的子集作为相同的问题来验证子字符串是否属于字符串。最着名的算法是KMP(Knuth-Morris-Pratt)。查看*的伪代码,或者只使用您喜欢的语言中的一些String.contains方法。 =)

#13


-1  

The efficient algorithm uses some kind of state machine, where you keep the accepting states in memory (in python):

高效的算法使用某种状态机,你在内存中保持接受状态(在python中):

def is_subset(l1, l2):
    matches = []
    for e in l1:
        # increment
        to_check = [0] + [i+1 for i in matches]
        matches = [] # nothing matches
        for i in to_check:
            if l2[i] = e:
                if i == len(l2)-1:
                    return True
                matches.append(i)
    return False

EDIT: of course if the list are sorted, you don't need that algorithm, just do:

编辑:当然,如果列表已排序,您不需要该算法,只需:

def is_subset(l1, l2):
    index = 0
    for e in l1:
        if e > l2[index]:
            return False
        elif e == l2[index]:
            index += 1
        else:
            index == 0
        if index == len(l2):
            return True
    return False

#1


22  

Here are a few trade offs you can make. Let's assume that you have two sets of elements, S and T, drawn from a universe U. We want to determine if S≥T. In one of the given examples, we have

以下是您可以做出的一些权衡。假设您有两组元素S和T,它们来自宇宙U.我们想确定S≥T。在给出的一个例子中,我们有

S={1,2,3,4}
T={3,4,5}
U={1,2,3,4,5}

S = {1,2,3,4} T = {3,4,5} U = {1,2,3,4,5}

1. Sorted Lists (or balanced search tree)
The method suggested by most posters. If you already have sorted lists, or don't care about the length of time it takes to create them (say, you're not doing that often), then this algorithm is basically linear time and space. This is usually the best option.

1.排序列表(或平衡搜索树)大多数海报建议的方法。如果你已经有了排序列表,或者不关心创建它们所花费的时间长度(比如,你不经常这样做),那么这个算法基本上是线性时间和空间。这通常是最好的选择。

(To be fair to other choices here, the time and space bounds should actually contain factors of "Log |U|" in appropriate places, but this is usually not relivant)

(为了公平对待其他选择,时间和空间界限实际上应该在适当的位置包含“Log | U |”因子,但这通常不是重复的)

Data structures: Sorted list for each of S and T. Or a balanced search tree (e.g. AVL tree, red-black tree, B+-tree) that can be iterated over in constant space.

数据结构:S和T中每一个的排序列表。或者可以在恒定空间中迭代的平衡搜索树(例如AVL树,红黑树,B +树)。

Algorithm: For each element in T, in order, search S linearly for that element. Remember where you left off each search, and start the next search there. If every search succeeds, then S≥T.

算法:对于T中的每个元素,按顺序,线性搜索该元素。记住每次搜索都停止的地方,并在那里开始下一次搜索。如果每次搜索都成功,那么S≥T。

Time complexity: about O( |S| Log|S| + |T| Log|T| ) to create the sorted lists, O( max(|S|, |T|) ) to compare.

时间复杂度:约O(| S | Log | S | + | T | Log | T |)创建排序列表,O(max(| S |,| T |))进行比较。

Space complexity: about O( |S| + |T| )

空间复杂度:约O(| S | + | T |)

Example (C++)

#include <set>
#include <algorithm>

std::set<int> create_S()
{
    std::set<int> S;
    // note: std::set will put these in order internally
    S.insert(3);
    S.insert(2);
    S.insert(4);
    S.insert(1);
    return S;
}

std::set<int> create_T()
{
    std::set<int> T;
    // note std::set will put these in order internally
    T.insert(4);
    T.insert(3);
    T.insert(5);
    return T;
}

int main()
{
    std::set<int> S=create_S();
    std::set<int> T=create_T();
    return std::includes(S.begin(),S.end(), T.begin(), T.end());
}

2. Hash tables
Better average time complexity than with a sorted list can be obtained using hash tables. The improved behavior for large sets comes at the cost of generally poorer performance for small sets.

2.散列表使用散列表可以获得比排序列表更好的平均时间复杂度。大型集合的改进行为的代价是小型集合的性能通常较差。

As with sorted lists, I'm ignoring the complexity contributed by the size of the universe.

与排序列表一样,我忽略了宇宙大小所带来的复杂性。

Data structure: Hash table for S, anything quickly iterable for T.

数据结构:S的哈希表,任何可以快速迭代的东西。

Algorithm: Insert each element of S into its hashtable. Then, for each element in T, check to see if it's in the hash table.

算法:将S的每个元素插入其哈希表中。然后,对于T中的每个元素,检查它是否在哈希表中。

Time complexity: O( |S| + |T| ) to set up, O( |T| ) to compare.

时间复杂度:设置O(| S | + | T |),比较O(| T |)。

Space complexity: O( |S| + |T| )

空间复杂度:O(| S | + | T |)

Example (C++)

#include <tr1/unordered_set>

std::tr1::unordered_set<int> create_S()
{
    std::tr1::unordered_set<int> S;
    S.insert(3);
    S.insert(2);
    S.insert(4);
    S.insert(1);
    return S;
}

std::tr1::unordered_set<int> create_T()
{
    std::tr1::unordered_set<int> T;
    T.insert(4);
    T.insert(3);
    T.insert(5);
    return T;
}

bool includes(const std::tr1::unordered_set<int>& S, 
              const std::tr1::unordered_set<int>& T)
{
    for (std::tr1::unordered_set<int>::const_iterator iter=T.begin();
         iter!=T.end();
         ++iter)
    {
        if (S.find(*iter)==S.end())
        {
            return false;
        }
    }
    return true;
}

int main()
{
    std::tr1::unordered_set<int> S=create_S();
    std::tr1::unordered_set<int> T=create_T();
    return includes(S,T);
}

3. Bit sets
If your universe is particularly small (let's say you can only have elements 0-32), then a bitset is a reasonable solution. The running time (again, assuming you don't care about setup time) is essentially constant. In the case you do care about setup, it's still faster than creating a sorted list.

3.位集如果你的宇宙特别小(假设你只能有元素0-32),那么bitset是一个合理的解决方案。运行时间(再次,假设您不关心设置时间)基本上是不变的。如果您关心设置,它仍然比创建排序列表更快。

Unfortunately, bitsets become unwieldy very quickly for even a moderately sized universe.

不幸的是,即使是中等大小的宇宙,bitsets也很快变得笨拙。

Data structure: bit vector (usually a machine integer) for each of S and T. We might encode S=11110 and T=00111, in the given example.

数据结构:S和T中的每一个的位向量(通常是机器整数)。在给定的示例中,我们可以编码S = 11110和T = 00111。

Algorithm: Calculate the intersection, by computing the bitwise 'and' of each bit in S with the corresponding bit in T. If the result equals T, then S≥T.

算法:通过计算S中每个位的按位'和'与T中的相应位来计算交点。如果结果等于T,则S≥T。

Time complexity: O( |U| + |S| + |T| ) to setup, O( |U| ) to compare.

时间复杂度:设置O(| U | + | S | + | T |),比较O(| U |)。

Space complexity: O( |U| )

空间复杂度:O(| U |)

Example: (C++)

#include <bitset>

// bitset universe always starts at 0, so create size 6 bitsets for demonstration.
// U={0,1,2,3,4,5}

std::bitset<6> create_S()
{
    std::bitset<6> S;
    // Note: bitsets don't care about order
    S.set(3);
    S.set(2);
    S.set(4);
    S.set(1);
    return S;
}

std::bitset<6> create_T()
{
    std::bitset<6> T;
    // Note: bitsets don't care about order
    T.set(4);
    T.set(3);
    T.set(5);
    return T;
}

int main()
{
    std::bitset<6> S=create_S();
    std::bitset<6> T=create_T();

    return S & T == T;
}

4. Bloom filters
All the speed benefits of bitsets, without the pesky limitation on universe size the bitsets have. Only one down side: they sometimes (often, if you're not careful) give the wrong answer: If the algorithm says "no", then you definitely don't have inclusion. If the algorithm says "yes", you might or might not. Better accuracy is attained by choosing a large filter size, and good hash functions.

4. Bloom过滤器bitset的所有速度优势,没有bitset所具有的宇宙大小的令人讨厌的限制。只有一个缺点:他们有时(通常,如果你不小心)给出错误的答案:如果算法说“不”,那么你肯定没有包含。如果算法说“是”,您可能会也可能不会。通过选择较大的滤波器大小和良好的散列函数可以获得更高的精度。

Given that they can and will give wrong answers, Bloom filters might sound like a horrible idea. However, they have definite uses. Generally one would use Bloom filters to do many inclusion checks quickly, and then use a slower deterministic method to guarantee correctness when needed. The linked Wikipedia article mentions some applications using Bloom filters.

鉴于他们可以而且会给出错误的答案,Bloom过滤器可能听起来像一个可怕的想法。但是,它们有明确的用途。通常,人们会使用Bloom过滤器快速执行许多包含检查,然后使用较慢的确定性方法来保证需要时的正确性。链接的*文章提到了一些使用Bloom过滤器的应用程序。

Data structure: A Bloom filter is a fancy bitset. Must choose a filter size, and hash functions beforehand.

数据结构:Bloom过滤器是一个花哨的bitset。必须事先选择过滤器大小和散列函数。

Algorithm (sketch): Initialize the bitset to 0. To add an element to a bloom filter, hash it with each hash function, and set the corresponding bit in the bitset. Determining inclusion works just as for bitsets.

算法(草图):将bitset初始化为0.要将一个元素添加到bloom过滤器,请使用每个哈希函数对其进行哈希处理,并在bitset中设置相应的位。确定包含就像bitsets一样。

Time complexity: O( filter size )

时间复杂度:O(过滤器大小)

Space complexity: O( filter size )

空间复杂度:O(过滤器大小)

Probability of correctness: Always correct if it answers for "S does not include T". Something like 0.6185^(|S|x|T|/(filter size))) if it answers "S includes T". In particular, the filter size must be chosen proportional to the product of |S| and |T| to give reasonable probability of accuracy.

正确性概率:如果答案为“S不包括T”,则始终正确。如果它回答“S包括T”,则类似于0.6185 ^(| S | x | T | /(过滤器大小)))。特别是,必须根据| S |的乘积选择滤波器大小和| T |给出合理的准确概率。

#2


15  

For C++, the best way is to use std::includes algorithm:

对于C ++,最好的方法是使用std :: includes算法:

#include <algorithm>

std::list<int> l1, l2;
...
// Test whether l2 is a subset of l1
bool is_subset = std::includes(l1.begin(), l1.end(), l2.begin(), l2.end());

This requires both lists to be sorted, as specified in your question. Complexity is linear.

这需要按照问题中的规定对两个列表进行排序。复杂性是线性的。

#3


10  

Just wanted to mention that Python has a method for this:

只是想提一下Python有一个方法:

return set(list2).issubset(list1)

Or:

return set(list2) <= set(list1)

#4


7  

If both lists are ordered, one simple solution would be to simultaneously go over both lists (with a two bump pointers in both lists), and verify that all of the elements in the second list appear in the first list (until all elements are found, or until you reach a larger number in the first list).

如果两个列表都是有序的,一个简单的解决方案是同时遍历两个列表(两个列表中都有两个凹凸指针),并验证第二个列表中的所有元素是否出现在第一个列表中(直到找到所有元素) ,或直到你在第一个列表中达到更大的数字)。

A pseudo-code in C++ would look something like this:

C ++中的伪代码看起来像这样:

List l1, l2;
iterator i1 = l1.start();
iterator i2 = l2.start();
while(i1 != l1.end() && i2 != l2.end()) {
  if (*i1 == *i2) {
    i1++;
    i2++;
  } else if (*i1 > *i2) {
    return false;
  } else {
    i1++;
  }
}
return true;

(It obviously won't work as is, but the idea should be clear).

(它显然不会按原样运作,但这个想法应该是明确的)。

If the lists are not ordered, you can use a hashtable - insert all of your elements in the first list, and then check if all of the elements in the second list appear in the hashtable.

如果未对列表进行排序,则可以使用哈希表 - 在第一个列表中插入所有元素,然后检查第二个列表中的所有元素是否都显示在哈希表中。

These are algorithmic answers. In different languages, there are default built-in methods to check this.

这些是算法的答案。在不同的语言中,有默认的内置方法来检查这一点。

#5


3  

If you're concerned about ordering or continuity, you may need to use the Boyer-Moore or the Horspool algorithm.

如果您担心排序或连续性,您可能需要使用Boyer-Moore或Horspool算法。

The question is, do you want to consider [2, 1] to be a subset of [1, 2, 3]? Do you want [1, 3] to be considered a subset of [1, 2, 3]? If the answer is no to both of these, you might consider one of the algorithms linked above. Otherwise, you may want to consider a hash set.

问题是,你想考虑[2,1]是[1,2,3]的子集吗?你想[1,3]被认为是[1,2,3]的一个子集吗?如果答案都不是这两个,您可以考虑上面链接的算法之一。否则,您可能需要考虑哈希集。

#6


3  

Scala, assuming you mean subsequence by subset:

Scala,假设您的子集是子序列:

def is_subset[A,B](l1: List[A], l2: List[B]): Boolean =
  (l1 indexOfSeq l2) > 0

Anyway, a subsequence is just a substring problem. Optimal algorithms include Knuth-Morris-Pratt and Boyer-Moore, and a few more complex ones.

无论如何,子序列只是一个子串问题。最佳算法包括Knuth-Morris-Pratt和Boyer-Moore,以及一些更复杂的算法。

If you truly meant subset, though, and thus you are speaking of Sets and not Lists, you can just use the subsetOf method in Scala. Algorithms will depend on how the set is stored. The following algorithm works for a list storage, which is a very suboptimal one.

但是,如果你真正意味着子集,因此你说的是集合而不是列表,你可以在Scala中使用subsetOf方法。算法将取决于集合的存储方式。以下算法适用于列表存储,这是非常不理想的。

def is_subset[A,B](l1: List[A], l2: List[B]): Boolean = (l1, l2) match {
  case (_, Nil) => true
  case (Nil, _) => false
  case (h1 :: t1, h2 :: t2) if h1 == h2 => is_subset(t1, t2)
  case (_ :: tail, list) => is_subset(tail, list)
}

#7


3  

For indexOfSeq in scala trunk I implemented KMP, which you can examine: SequenceTemplate

对于scala trunk中的indexOfSeq,我实现了KMP,您可以检查它:SequenceTemplate

#8


1  

If you're ok with storing the data in a hashset you can simply check whether list1 contains x for each x in list2. Which will be close to O(n) in the size of list2. (Of course you can also do the same with other datastructures, but that will lead to different runtimes).

如果您可以将数据存储在哈希集中,则只需检查list1是否包含list2中每个x的x。这将与list2的大小接近O(n)。 (当然,您也可以对其他数据结构执行相同操作,但这会导致不同的运行时)。

#9


1  

This depends highly on the language/toolkit, as well as the size and storage of the lists.

这在很大程度上取决于语言/工具包,以及列表的大小和存储。

If the lists are sorted, a single loop can determine this. You can just start walking the larger list trying to find the first element of the smaller list (break if you pass it in value), then move on to the next, and continue from the current location. This is fast, since it's a one loop/one pass algorithm.

如果对列表进行排序,则单个循环可以确定这一点。您可以开始走大型列表,尝试查找较小列表的第一个元素(如果您将值传递给它,则中断),然后继续前进到下一个元素,并从当前位置继续。这很快,因为它是一个循环/一次通过算法。

For unsorted lists, it's often fastest to build some form of hash table from the first list's elements, then search each element in the second list off the hash. This is the approach that many of the .NET LINQ extensions use internally for item searching within a list, and scale quite well (although they have fairly large temporary memory requirements).

对于未排序的列表,从第一个列表的元素构建某种形式的哈希表通常最快,然后从哈希中搜索第二个列表中的每个元素。这是许多.NET LINQ扩展在内部用于列表中项目搜索的方法,并且可以很好地扩展(尽管它们具有相当大的临时内存要求)。

#10


1  

func isSubset ( @list, @possibleSubsetList ) {
    if ( size ( @possibleSubsetList ) > size ( @list ) ) {
        return false;
    }
    for ( @list : $a ) {
        if ( $a != @possibleSubsetList[0] ) {
            next;
        } else {
            pop ( @possibleSubsetList );
        }
    }
    if ( size ( @possibleSubsetList ) == 0 ) {
        return true;
    } else {
        return false;
    }
}

O(n) viola. of course, isSubset( (1,2,3,4,5), (2,4) ) will return true

O(n)中提琴。当然,isSubset((1,2,3,4,5),(2,4))将返回true

#11


0  

You should have a look at the implementation of STL method search. That is the C++ way I think this would be done.

您应该看一下STL方法搜索的实现。这就是我认为可以完成的C ++方式。

http://www.sgi.com/tech/stl/search.html

Description:

Search finds a subsequence within the range [first1, last1) that is identical to [first2, last2) when compared element-by-element.

当逐个元素进行比较时,搜索在[first1,last1]范围内查找与[first2,last2]相同的子序列。

#12


0  

You can see the problem to check if a list is a subset of another list as the same problem to verify if a substring belongs to a string. The best known algorithm for this is the KMP (Knuth-Morris-Pratt). Look at wikipedia for a pseudo-code or just use some String.contains method available in the language of your preference. =)

您可以看到问题,以检查列表是否是另一个列表的子集作为相同的问题来验证子字符串是否属于字符串。最着名的算法是KMP(Knuth-Morris-Pratt)。查看*的伪代码,或者只使用您喜欢的语言中的一些String.contains方法。 =)

#13


-1  

The efficient algorithm uses some kind of state machine, where you keep the accepting states in memory (in python):

高效的算法使用某种状态机,你在内存中保持接受状态(在python中):

def is_subset(l1, l2):
    matches = []
    for e in l1:
        # increment
        to_check = [0] + [i+1 for i in matches]
        matches = [] # nothing matches
        for i in to_check:
            if l2[i] = e:
                if i == len(l2)-1:
                    return True
                matches.append(i)
    return False

EDIT: of course if the list are sorted, you don't need that algorithm, just do:

编辑:当然,如果列表已排序,您不需要该算法,只需:

def is_subset(l1, l2):
    index = 0
    for e in l1:
        if e > l2[index]:
            return False
        elif e == l2[index]:
            index += 1
        else:
            index == 0
        if index == len(l2):
            return True
    return False