使用类似语法的规则减少字符串

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules

我正在尝试找到一个合适的DP算法来简化字符串。例如,我有一个字符串a b b和一个规则列表

a b -> b

a b - > b

a b -> c

a b - > c

b a -> a

b a - > a

c c -> b

c c - > b

The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?

目的是使用这些规则获取可从给定字符串接收的所有单个字符。对于这个例子,它将是b,c。给定字符串的长度最多为200个符号。你能提示一个有效的算法吗?

Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.

规则总是2 - > 1.我有一个创建树的想法,root被赋予字符串,每个子节点在一次转换后是一个字符串,但我不确定它是否是最好的方法。

3 个解决方案

#1

If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like

如果您从右到左阅读这些规则,它们看起来与上下文无关语法的规则完全相同,并且具有基本相同的含义。您可以将自下而上的解析算法(如Earley算法)应用于您的数据,以及合适的起始规则;就像是

start <- start a
       | start b
       | start c

and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.

然后只需检查解析林中最短的启动链。最糟糕的情况当然是O(n ^ 3),但是现在Earley相当有效。

You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.

在使用导数进行解析时,您还可以生成解析林。您可以有效地检查它们的短链起始点。

#2

For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.

对于DP问题,您总是需要了解如何根据较小的子问题为大问题构建答案。假设您的函数简化了,使用长度为n的输入调用。在第一个和最后一个部分中有n-1种方式来分割输入。对于每个拆分,您应该在第一部分和最后一部分上递归调用简化函数。输入长度n的最终答案是规则允许的第一部分和最后部分的所有可能答案组合。

In Python, this can be implemented like so:

在Python中,这可以像这样实现:

rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)

@ memoize
def simplify(s):
    if len(s) == 1:  # base case to end recursion
        return set(s)

    possible_chars = set()

    # iterate over all the possible splits of s
    for i in range(1, len(s)):
        head = s[:i]
        tail = s[i:]

        # check all possible combinations of answers of sub-problems
        for c1 in simplify(head):
            for c2 in simplify(tail):
                possible_chars.update(rules.get(c1+c2, set()))

                # speed hack
                if possible_chars == all_chars: #  won't get any bigger
                    return all_chars

    return possible_chars

Quick check:

In [53]: simplify('abab')
Out[53]: {'b', 'c'}

To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.

为了使大字符串足够快(以避免指数行为),您应该使用memoize装饰器。这是解决DP问题的关键步骤,否则您只是在进行暴力计算。通过尽快从函数返回_chars == set('abc')可以获得更小的加速,因为在那时,您已经确定可以生成所有可能的结果。

Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).

运行时间分析:对于长度为n的输入,有2个长度为n-1的子串,3个长度为n-2的子串,... n个长度为1的子串,总共有O(n ^ 2)个子问题。由于记忆,每个子问题最多调用一次该函数。由于for i in range(len(s)),单个子问题的最大运行时间为O(n),因此总运行时间最多为O(n ^ 3)。

#3

Let N - length of given string and R - number of rules.

设N - 给定字符串的长度和R - 规则的数量。

Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).

以最低的方式扩展树在最坏的情况下产生计算复杂度O(NR ^ N)(类型aaa的输入字符串......和规则aa - > a)。

Proof:

Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).

树的根具有(N-1)R个孩子,其具有(N-1)R ^ 2个孩子,......,其具有(N-1)个R ^ N个孩子(叶子)。因此,总复杂度为O((N-1)R +(N-1)R ^ 2 + ...(N-1)R ^ N)= O(N(1 + R ^ 2 + ... + R ^ N))=(使用二项式定理)= O(N(R + 1)^ N)= O(NR ^ N)。

Recursive Java implementation of this naive approach:

这种天真的方法的递归Java实现:

public static void main(String[] args) {
    Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
        put("ab", new Character[]{'b', 'c'});
        put("ba", new Character[]{'a'});
        put("cc", new Character[]{'b'});
    }};
    System.out.println(simplify("abab", rules));
}

public static Set<String> simplify(String in, Map<String, Character[]> rules) {
    Set<String> result = new HashSet<String>();
    simplify(in, rules, result);
    return result;
}

private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
    if (in.length() == 1) {
        result.add(in);
    }
    for (int i = 0; i < in.length() - 1; i++) {
        String two = in.substring(i, i + 2);
        Character[] rep = rules.get(two);
        if (rep != null) {
            for (Character c : rep) {
                simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
            }
        }
    }
}

Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):

Bas Swinckels的O(RN ^ 3)Java实现(使用HashMap作为memoization缓存):

public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
    Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
    return simplify2(in, rules, cache);
}

private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
    final Set<String> cached = cache.get(in);
    if (cached != null) {
        return cached;
    }
    Set<String> ret = new HashSet<String>();
    if (in.length() == 1) {
        ret.add(in);
        return ret;
    }
    for (int i = 1; i < in.length(); i++) {
        String head = in.substring(0, i);
        String tail = in.substring(i, in.length());
        for (String c1 : simplify2(head, rules)) {
            for (String c2 : simplify2(tail, rules, cache)) {
                Character[] rep = rules.get(c1 + c2);
                if (rep != null) {
                    for (Character c : rep) {
                        ret.add(c.toString());
                    }
                }
            }
        }
    }
    cache.put(in, ret);
    return ret;
}

Output in both approaches:

两种方法的产出:

[b, c]

#1

start <- start a
       | start b
       | start c