类型检查模式匹配的算法?

How do you determine whether a given pattern is "good", specifically whether it is exhaustive and non-overlapping, for ML-style programming languages?

您如何确定一个给定的模式是否“好”，特别是对于ml风格的编程语言来说，它是彻底的还是不重叠的?

Suppose you have patterns like:

假设你有这样的模式:

match lst with
  x :: y :: [] -> ...
  [] -> ...

or:

或者:

match lst with
  x :: xs -> ...
  x :: [] -> ...
  [] -> ...

A good type checker would warn that the first is not exhaustive and the second is overlapping. How would the type checker make those kinds of decisions in general, for arbitrary data types?

一个好的类型检查器会警告第一个不是详尽的，第二个是重叠的。对于任意的数据类型，类型检查器通常如何做出这些类型的决定?

3 个解决方案

#1

Here's a sketch of an algorithm. It's also the basis of Lennart Augustsson's celebrated technique for compiling pattern matching efficiently. (The paper is in that incredible FPCA proceedings (LNCS 201) with oh so many hits.) The idea is to reconstruct an exhaustive, non-redundant analysis by repeatedly splitting the most general pattern into constructor cases.

这是一个算法的草图。这也是Lennart Augustsson著名的高效编译模式匹配技术的基础。(这篇论文是在令人难以置信的FPCA进程中(lncs201)，有这么多的点击。)这个想法是通过反复将最一般的模式分解成构造函数的例子来重构一个详尽的、非冗余的分析。

In general, the problem is that your program has a possibly empty bunch of ‘actual’ patterns {p₁, .., p_n}, and you want to know if they cover a given ‘ideal’ pattern q. To kick off, take q to be a variable x. The invariant, initially satisfied and subsequently maintained, is that each p_i is σ_iq for some substitution σ_i mapping variables to patterns.

一般来说，问题是您的程序可能有一堆“实际”模式{p1， .. ..pn },你想知道他们是否覆盖给定的‘理想’模式。开始,问一个变量x。不变,随后保持最初满意,是每一个πσiq替换σi变量映射到模式。

How to proceed. If n=0, the bunch is empty, so you have a possible case q that isn't covered by a pattern. Complain that the ps are not exhaustive. If σ₁ is an injective renaming of variables, then p₁ catches every case that matches q, so we're warm: if n=1, we win; if n>1 then oops, there's no way p₂ can ever be needed. Otherwise, we have that for some variable x, σ₁x is a constructor pattern. In that case split the problem into multiple subproblems, one for each constructor c_j of x's type. That is, split the original q into multiple ideal patterns q_j = [x:=c_j y₁ .. y_{arity(c_j)}]q, and refine the patterns accordingly for each q_j to maintain the invariant, dropping those that don't match.

如何进行。如果n=0，那么这一串是空的，所以你有一个可能的情况q没有被一个模式覆盖。抱怨ps不够详尽。如果σ1单射重命名变量,然后p1捕获每一个相匹配的情况下q,所以我们温暖:如果n = 1,我们赢了;如果n>1，那么，就不需要p2了。否则，对于某个变量x, 1x是一个构造函数模式。在这种情况下，将问题分解为多个子问题，每个子问题对应一个构造函数cj (x的类型)。也就是说，将原来的q分解为多个理想模式qj = [x:=cj y1 ..yarity(cj)]q，并对每个qj进行相应的优化，以保持不变，删除不匹配的。

Let's take the example with {[], x :: y :: zs} (using :: for cons). We start with

让我们以{[]，x:: y:: zs}(使用::for cons)为例。我们开始

  xs covering  {[], x :: y :: zs}

and we have [xs := []] making the first pattern an instance of the ideal. So we split xs, getting

我们有[xs:=[]]使第一个模式成为理想的实例。所以我们分开x，得到。

  [] covering {[]}
  x :: ys covering {x :: y :: zs}

The first of these is justified by the empty injective renaming, so is ok. The second takes [x := x, ys := y :: zs], so we're away again, splitting ys, getting.

第一个是由空的注入重命名来证明的，所以是可以的。第二步是(x:= x, y = y:: zs)，所以我们又离开了，分开了，得到了。

  x :: [] covering {}
  x :: y :: zs covering {x :: y :: zs}

and we can see from the first subproblem that we're banjaxed.

我们可以从第一个子问题中看出，我们是banjaxed。

The overlap case is more subtle and allows for variations, depending on whether you want to flag up any overlap, or just patterns which are completely redundant in a top-to-bottom priority order. Your basic rock'n'roll is the same. E.g., start with

重叠的情况更加微妙，并且允许变化，这取决于您是否想要标记任何重叠，或者仅仅是在一个从上到下的优先级顺序中完全冗余的模式。你的基本摇滚乐是一样的。例如,开始

  xs covering {[], ys}

with [xs := []] justifying the first of those, so split. Note that we have to refine ys with constructor cases to maintain the invariant.

用[xs:=[]]来证明第一个，所以分裂。注意，我们必须通过构造函数实例来改进ys，以保持不变。

  [] covering {[], []}
  x :: xs covering {y :: ys}

Clearly, the first case is strictly an overlap. On the other hand, when we notice that refining an actual program pattern is needed to maintain the invariant, we can filter out those strict refinements that become redundant and check that at least one survives (as happens in the :: case here).

显然，第一种情况是完全重叠的。另一方面，当我们注意到需要改进一个实际的程序模式来维护不变量时，我们可以过滤掉那些变得冗余的严格的改进，并检查至少一个存在(就像在这里发生的那样)。

So, the algorithm builds a set of ideal exhaustive overlapping patterns q in a way that's motivated by the actual program patterns p. You split the ideal patterns into constructor cases whenever the actual patterns demand more detail of a particular variable. If you're lucky, each actual pattern is covered by disjoint nonempty sets of ideal patterns and each ideal pattern is covered by just one actual pattern. The tree of case splits which yield the ideal patterns gives you the efficient jump-table driven compilation of the actual patterns.

因此，该算法构建了一组理想的完全重叠模式q，这是由实际的程序模式p所激发的。如果幸运的话，每个实际的模式都由不相交的非空集所覆盖，每个理想模式都只包含一个实际的模式。生成理想模式的case分割树为您提供了实际模式的高效跳表驱动编译。

The algorithm I've presented is clearly terminating, but if there are datatypes with no constructors, it can fail to accept that the empty set of patterns is exhaustive. This is a serious issue in dependently typed languages, where exhaustiveness of conventional patterns is undecidable: the sensible approach is to allow "refutations" as well as equations. In Agda, you can write (), pronounced "my Aunt Fanny", in any place where no constructor refinement is possible, and that absolves you from the requirement to complete the equation with a return value. Every exhaustive set of patterns can be made recognizably exhaustive by adding in enough refutations.

我给出的算法显然是终止的，但是如果有没有构造函数的数据类型，那么它就不能接受空的模式集是详尽的。这是一个严重的问题，在依赖类型的语言中，传统模式的穷尽性是不可决定的:明智的方法是允许“反驳”和方程式。在Agda中，您可以编写()、读“my Aunt Fanny”，在任何不可能进行构造函数优化的地方，这将使您免除完成具有返回值的等式的要求。每一套完整的模式都可以通过添加足够的批驳来实现。

Anyhow, that's the basic picture.

总之，这就是基本情况。

#2

Here is some code from a non-expert. It shows what the problem looks like if you restrict your patterns to list constructors. In other words, the patterns can only be used with lists that contain lists. Here are some lists like that: [], [[]], [[];[]].

下面是一些非专家的代码。它显示了如果将模式限制在列表构造函数中，会出现什么问题。换句话说，模式只能与包含列表的列表一起使用。这里有一些这样的列表:[]、[[]]、[[]]、[[]]。

If you enable -rectypes in your OCaml interpreter, this set of lists has a single type: ('a list) as 'a.

如果您在OCaml解释器中启用-rectypes，那么这组列表有一个类型:('a list)为'a。

type reclist = ('a list) as 'a

Here's a type for representing patterns that match against the reclist type:

这是一种表示与reclist类型匹配的模式的类型:

type p = Nil | Any | Cons of p * p

To translate an OCaml pattern into this form, first rewrite using (::). Then replace [] with Nil, _ with Any, and (::) with Cons. So the pattern [] :: _ translates to Cons (Nil, Any)

要将OCaml模式转换为这种形式，请首先使用(:):然后用Nil替换[]，用Any，和(::)with Cons。所以模式[]:_转换为Cons (Nil, Any)

Here is a function that matches a pattern against a reclist:

这里有一个函数，它与reclist的模式相匹配:

let rec pmatch (p: p) (l: reclist) =
    match p, l with
    | Any, _ -> true
    | Nil, [] -> true
    | Cons (p', q'), h :: t -> pmatch p' h && pmatch q' t
    | _ -> false

Here's how it looks in use. Note the use of -rectypes:

这是它的用途。注意-rectypes的使用:

$ ocaml312 -rectypes
        Objective Caml version 3.12.0

# #use "pat.ml";;
type p = Nil | Any | Cons of p * p
type reclist = 'a list as 'a
val pmatch : p -> reclist -> bool = <fun>
# pmatch (Cons(Any, Nil)) [];;
- : bool = false
# pmatch (Cons(Any, Nil)) [[]];;
- : bool = true
# pmatch (Cons(Any, Nil)) [[]; []];;
- : bool = false
# pmatch (Cons (Any, Nil)) [ [[]; []] ];;
- : bool = true
#

The pattern Cons (Any, Nil) should match any list of length 1, and it definitely seems to be working.

模式错误(Any, Nil)应该匹配任何长度为1的列表，而且它显然是有效的。

So then it seems fairly straightforward to write a function intersect that takes two patterns and returns a pattern that matches the intersection of what is matched by the two patterns. Since the patterns might not intersect at all, it returns None when there's no intersection and Some p otherwise.

因此，写一个函数的交集似乎很简单，它需要两种模式，并返回匹配这两种模式的交集的模式。由于模式可能不相交，所以当没有交集时，它将返回None，而另一些则没有返回。

let rec inter_exc pa pb =
    match pa, pb with
    | Nil, Nil -> Nil
    | Cons (a, b), Cons (c, d) -> Cons (inter_exc a c, inter_exc b d)
    | Any, b -> b
    | a, Any -> a
    | _ -> raise Not_found

let intersect pa pb =
    try Some (inter_exc pa pb) with Not_found -> None

let intersectn ps =
    (* Intersect a list of patterns.
     *)
    match ps with
    | [] -> None
    | head :: tail ->
        List.fold_left
            (fun a b -> match a with None -> None | Some x -> intersect x b)
            (Some head) tail

As a simple test, intersect the pattern [_, []] against the pattern [[], _]. The former is the same as _ :: [] :: [], and so is Cons (Any, Cons (Nil, Nil)). The latter is the same as [] :: _ :: [], and so is Cons (Nil, (Cons (Any, Nil)).

作为一个简单的测试，将模式[_，[]]与模式相关联[[]，_]。前者与_:[]:[]:[]:[]:[][]，所以是反对(任何，反对(Nil, Nil))。后者与[]::[]:[]，以及Cons (Nil， (Any, Nil))相同。

# intersect (Cons (Any, Cons (Nil, Nil))) (Cons (Nil, Cons (Any, Nil)));;
- : p option = Some (Cons (Nil, Cons (Nil, Nil)))

The result looks pretty right: [[], []].

结果看起来相当正确:[[]]。

It seems like this is enough to answer the question about overlapping patterns. Two patterns overlap if their intersection is not None.

这似乎足以回答关于重叠模式的问题。如果它们的交集不是没有重叠的，则有两种模式重叠。

For exhaustiveness you need to work with a list of patterns. Here is a function exhaust that tests whether a given list of patterns is exhaustive:

对于穷尽性，您需要使用一个模式列表。下面是一个函数排气，它测试给定的模式列表是否详尽:

let twoparts l =
    (* All ways of partitioning l into two sets.
     *)
    List.fold_left
        (fun accum x ->
            let absent = List.map (fun (a, b) -> (a, x :: b)) accum
            in
                List.fold_left (fun accum (a, b) -> (x :: a, b) :: accum)
                    absent accum)
        [([], [])] l

let unique l =
   (* Eliminate duplicates from the list.  Makes things
    * faster.
    *)
   let rec u sl=
        match sl with
        | [] -> []
        | [_] -> sl
        | h1 :: ((h2 :: _) as tail) ->
            if h1 = h2 then u tail else h1 :: u tail
    in
        u (List.sort compare l)

let mkpairs ps =
    List.fold_right
        (fun p a -> match p with Cons (x, y) -> (x, y) :: a | _ -> a) ps []

let rec submatches pairs =
    (* For each matchable subset of fsts, return a list of the
     * associated snds.  A matchable subset has a non-empty
     * intersection, and the intersection is not covered by the rest of
     * the patterns.  I.e., there is at least one thing that matches the
     * intersection without matching any of the other patterns.
     *)
    let noncovint (prs, rest) =
        let prs_firsts = List.map fst prs in
        let rest_firsts = unique (List.map fst rest) in
        match intersectn prs_firsts with
        | None -> false
        | Some i -> not (cover i rest_firsts)
    in let pairparts = List.filter noncovint (twoparts pairs)
    in
        unique (List.map (fun (a, b) -> List.map snd a) pairparts)

and cover_pairs basepr pairs =
    cover (fst basepr) (unique (List.map fst pairs)) &&
        List.for_all (cover (snd basepr)) (submatches pairs)

and cover_cons basepr ps =
    let pairs = mkpairs ps
    in let revpair (a, b) = (b, a)
    in
        pairs <> [] &&
        cover_pairs basepr pairs &&
        cover_pairs (revpair basepr) (List.map revpair pairs)

and cover basep ps =
    List.mem Any ps ||
        match basep with
        | Nil -> List.mem Nil ps
        | Any -> List.mem Nil ps && cover_cons (Any, Any) ps
        | Cons (a, b) -> cover_cons (a, b) ps

let exhaust ps =
    cover Any ps

A pattern is like a tree with Cons in the internal nodes and Nil or Any at the leaves. The basic idea is that a set of patterns is exhaustive if you always reach Any in at least one of the patterns (no matter what the input looks like). And along the way, you need to see both Nil and Cons at each point. If you reach Nil at the same spot in all the patterns, it means there's a longer input that won't be matched by any of them. On the other hand, if you see just Cons at the same spot in all the patterns, there's an input that ends at that point that won't be matched.

一个模式就像一个树，内部节点和Nil或任何在叶节点。基本的想法是，如果您总是在至少一个模式中(无论输入是什么样子)，那么一组模式是详尽的。在这个过程中，你需要在每个点看到Nil和Cons。如果你在所有模式中都在同一个点到达Nil，这意味着有一个更长的输入，不会被它们中的任何一个匹配。另一方面，如果你在所有模式中都看到相同的点，就会有一个输入端在那个点上结束，这是不匹配的。

The difficult part is checking for exhaustiveness of the two subpatterns of a Cons. This code works the way I do when I check by hand: it finds all the different subsets that could match at the left, then makes sure that the corresponding right subpatterns are exhaustive in each case. Then the same with left and right reversed. Since I'm a nonexpert (more obvious to me all the time), there are probably better ways to do this.

困难的部分是检查穷尽性的两个子模式缺点。这段代码的工作方式我做手工检查:它找到的所有不同的子集,可以匹配在左边,然后确保相应的子模式在每个案例详尽。左边和右边都是一样的。因为我是一个非专家(对我来说一直都很明显)，可能有更好的方法来做这件事。

Here is a session with this function:

这是一个有这个功能的会议:

# exhaust [Nil];;
- : bool = false
# exhaust [Any];;
- : bool = true
# exhaust [Nil; Cons (Nil, Any); Cons (Any, Nil)];;
- : bool = false
# exhaust [Nil; Cons (Any, Any)];;
- : bool = true
# exhaust [Nil; Cons (Any, Nil); Cons (Any, (Cons (Any, Any)))];;
- : bool = true

I checked this code against 30,000 randomly generated patterns, and so I have some confidence that it's right. I hope these humble observations may prove to be of some use.

我检查了3万个随机生成的模式的代码，所以我有信心它是正确的。我希望这些谦虚的意见能被证明是有用的。

#3

-1

I believe the pattern sub-language is simple enough that it's easy to analyze. This is the reason for requiring patterns to be "linear" (each variable can appear only once), and so on. With these restrictions, every pattern is a projection from a kind of nested tuple space to a restricted set of tuples. I don't think it's too difficult to check for exhaustiveness and overlap in this model.

我认为模式子语言很简单，很容易分析。这就是要求模式为“线性”的原因(每个变量只能出现一次)，等等。有了这些限制，每个模式都是从一种嵌套的tuple空间到一组受限制的元组的投影。我不认为在这个模型中检查穷尽和重叠是很困难的。

#1