部分匹配长度的正则表达式-字符串相似度。

Say I have the string "Torcellite" and another string "Tor" - the similarity length of these two strings is 3 since both of them begin with "Tor". Now another string "christmas" and "mas" would have a similarity of 0 since they do not begin with the same set of characters.

假设我有一个字符串“Torcellite”和另一个字符串“Tor”——这两个字符串的相似长度是3，因为它们都是从“Tor”开始的。现在，另一个字符串“christmas”和“mas”的相似度为0，因为它们开头的字符不相同。

In both cases, the second string is a suffix of the first string.

在这两种情况下，第二个字符串都是第一个字符串的后缀。

A clearer example:

一个清晰的例子:

Length of string: 1 to 10^5

字符串的长度:1到10 ^ 5

String: abaabc

字符串:abaabc

Suffixes: abaabc, baabc, aabc, abc, bc, c

后缀:abaabc, baabc, aabc, abc, bc, c。

Similarity: abaabc, none, a, ab, none, none

相似度:abaabc, none, a, ab, none, none

Similarity Length: 6, 0, 1, 2, 0, 0

相似长度:6、0、1、2、0、0

Answer: 6+0+1+2+0+0 = 9

答:6 + 0 + 1 + 2 + 0 + 0 = 9

I have an inefficient logic to find these partial suffixes matches using regex.

使用regex查找这些部分后缀匹配的逻辑效率很低。

Algorithm:

算法:

Find all the substrings of the given string.
找到给定字符串的所有子字符串。

Make a pattern from the substrings of the suffixes.

从后缀的子字符串中创建一个模式。

for(int i=1; i<substrings[i].length; i++) {
    Pattern p = Pattern.compile("^"+substrings[i].substring(0, i));
    Matcher m = p.find(string); //the given string for which similarities need to be  calculated
    if(m.find())
        similaryLengths +=  i;
}

The complexity for this becomes roughly O(n^2) since I need to run through the string for suffixes and then the substrings for patterns.

大致的复杂性成为O(n ^ 2)因为我需要运行为后缀,然后通过字符串的子模式。
I've thought of using grouping in the pattern to find the groups but I'm unsure what the regex would look like. What I have in mind is for the very first substring is: ((((((a)b)a)a)b)c) and then find the longest group match.

我考虑过在模式中使用分组来查找组，但我不确定regex会是什么样子。我想到的第一个子字符串是:((((((a)b)a) b)c)然后找到最长的组匹配。

Is there a more efficient algorithm that can achieve his?

有没有更有效的算法可以实现他的?

7 个解决方案

#1

The best approach, by far, would be to build a suffix tree on the input string. Building suffix trees takes only O(n) time where n is the length of the string. A suffix tree consists logically of a tree in which all the suffixes of the string can be found by walking from the root to each leaf. You can read Wikipedia for more details on how these trees work.

到目前为止，最好的方法是在输入字符串上构建一个后缀树。构建后缀树只需要O(n)个时间，其中n是字符串的长度。后缀树逻辑上由树组成，其中所有字符串的后缀都可以通过从根到每个叶子来找到。你可以阅读*了解这些树是如何工作的。

Essentially, a suffix tree will allow you to simply recast your current problem as one of "finding" the original string in the suffix tree. As you walk down the tree, you count the number of suffixes in each subtree, and multiply by your current match length to determine your score. This "search" takes O(n) time too.

从本质上讲，后缀树将允许您简单地将当前问题重新定义为“查找”后缀树中的原始字符串。当您沿着树向下走时，您将计算每个子树中后缀的数量，并将当前匹配长度乘以以确定您的分数。这种“搜索”也需要花费大量的时间。

So the final result is that you can solve the problem in guaranteed O(n) time and O(n) space, with O(n) preprocessing time. This is pretty efficient! And, there are no "worst cases" that produce quadratic behaviour. With that you can probably handle strings up to 10^7 in length easily.

所以最终的结果是你可以在O(n)时间和O(n)空间中，用O(n)预处理时间来解决问题。这是非常有效的!而且，也没有产生二次行为的“最坏情况”。你可以处理字符串到10 ^ 7长容易。

The only difficulty in implementation will be building the suffix tree, but you can find freely available code for that.

实现的唯一困难是构建后缀树，但是您可以找到免费的代码。

#2

Simliar algorithm as already posted by Valdar Moridin, but without the need to create substrings (every call to substring will create a new String object that contains a copy of the specified range of the char[] of its source). This won't improve the time complexity, but probably reduces the total runtime for a constant factor:

Simliar算法已经由Valdar Moridin发布，但是不需要创建子字符串(每个对子字符串的调用都将创建一个新的String对象，该对象包含其源的char[]的指定范围的副本)。这不会提高时间复杂度，但可能会减少一个常数因子的总运行时:

public static int partialSuffixMatch(CharSequence input) {
    int count = input.length();
    for (int i = 1; i < input.length(); i++) {
        for (int a = 0, b = i; b < input.length(); a++, b++) {
            if (input.charAt(a) != input.charAt(b))
                break;
            count++;
        }
    }
    return count;
}

After a short warmup, this algorithm processes a String with 10,000 equal characters in about 40 ms on my computer, and with 100,000 equal characters in about 4 seconds.

在做了一个简短的热身之后，这个算法在我的计算机上处理了一个包含10000个相等字符的字符串，在大约4秒的时间内处理了100,000个相等的字符。

#3

This is how I would do what you've described above. I don't know what this is supposed to accomplish, but since you've specified that only the start of the strings need to match, even though it's O(n^2), most of the time it will not be running for anywhere near the full length of n. Worst case is obviously a string like "aaaaaaaaaaaaaaaaa". This takes less than 5 seconds to process a string of 60,000 'a' characters on my machine.

这就是我如何做你上面描述的。我不知道这是什么应该完成,但由于你指定只需要匹配字符串的开始,即使它是O(n ^ 2),大部分时间它不会竞选接近的全部长度n。最糟糕的情况就像“aaaaaaaaaaaaaaaaa”显然是一个字符串。在我的机器上，处理60000个“a”字符所需的时间不到5秒。

I don't see any need to involve the overhead of generating and compiling regular expressions for a strict prefix match. Have I missed the point?

我认为不需要为严格的前缀匹配生成和编译正则表达式。我没说重点吗?

int similarity(String input) {
    int count = 0;
    for (int i = 0; i < input.length() ; i++) {
        String sub = input.substring(i);
        for (int j = 0; j < sub.length(); j++) {
            if (input.charAt(j) != sub.charAt(j))
                break;
            count++;
        }
    }
    return count;
}

#4

From the abaabc example, I gather you are trying to find all substrings that match the start of the original string. This can be done with a single regular expression, a bit similar to the pattern you proposed. Of course, that regex will be proportional in length to the original string. The regex itself is very straightforward; it represents the entire string, but the string's tail (of arbitrary length) is optional. Effectively, this regex matches any prefix of the string. For a string abcdef, the regex is:

从abaabc示例中，我猜您正在尝试查找与原始字符串的开始匹配的所有子字符串。这可以通过一个正则表达式完成，有点类似于您提出的模式。当然，regex将与原始字符串成比例。regex本身非常简单;它表示整个字符串，但是字符串的尾部(任意长度)是可选的。实际上，这个regex匹配字符串的任何前缀。对于字符串abcdef, regex是:

(?=(a(?:b(?:c(?:d(?:ef?)?)?)?)?))

Notes:

注:

I used (?: ... ) for every subpattern except the outer one, to avoid a lot of unnecessary captures.
我用(?:……)对于每个子模式(除了外部模式)，避免大量不必要的捕获。
I used the look-ahead pattern (?= ... ) because matches can (and will) be overlapping. Without it, the very first match (being the entire string abcdef) will consume the entire input, causing all other possible matches to be skipped.
我使用了前面的模式(?=…)因为匹配可以(也将)重叠。没有它，第一个匹配(即整个字符串abcdef)将消耗整个输入，导致所有其他可能的匹配被跳过。

Of course, abcdef is not an interesting example; it has no repeating substrings, so the regex has only one match, which is the entire string, abcdef. Your example abaabc is nicer, so I made a fiddle for it. As pointed out by you, it finds 3 matches: abaabc, a, ab.
http://regex101.com/r/vJ8uQ9/1

当然，abcdef不是一个有趣的例子;它没有重复的子字符串，所以regex只有一个匹配，即整个字符串abcdef。你的例子abaabc比较好，所以我为它做了一个小提琴。正如您所指出的，它找到了3个匹配项:abaabc, a, ab. http://regex101.com/r/vJ8uQ9/1。

Feel free to play around with it, but don't forget, for every change in the test string, you need to change the regex accordingly. For long strings, this becomes tedious. Fortunately, a simple recursive program can generate a regex for any given string.

您可以随意使用它，但不要忘记，对于测试字符串中的每一个更改，您都需要相应地更改regex。对于长字符串，这就变得很乏味了。幸运的是，一个简单的递归程序可以为任何给定的字符串生成一个regex。

function generateRegex(string input)
{
    return input.substring(0, 1) +
           (input.length > 2 ? "(?:" + generateRegex(input.substring(1)) + ")" : input.substring(1)) +
           "?";
}

string myRegex = "(?=(" + generateRegex(myInput) + "))";

I had no Java test environment at hand, but I did test it in JavaScript.
Fiddle: http://jsfiddle.net/gqehcjf9/1/

我手头没有Java测试环境，但是我用JavaScript进行了测试。小提琴:http://jsfiddle.net/gqehcjf9/1/

Performance seems OK (less than a second for a string of 9000 characters), but I did get a 'regular expression too complex' exception when testing against a string of more than 9361 characters (Firefox 31.0). I hope Java's regex engine is less restrictive. If not, then there's one possible optimization. If you are pretty sure that repeating substrings are never longer than, say, 1000 characters, then you might consider generating a regex for only the first 1000 characters of the string. You will be missing part of the first match (i.e. the entire string), but correcting that is a no-brainer.

性能看起来还可以(对于9000个字符的字符串，不到一秒钟)，但是在对超过9361个字符的字符串(Firefox 31.0)进行测试时，我得到了一个“正则表达式太复杂”的异常。我希望Java的regex引擎不那么严格。如果没有，那么有一个可能的优化。如果您非常确定重复子字符串的长度永远不会超过1000个字符，那么您可以考虑仅为字符串的前1000个字符生成一个regex。你将会错过第一场比赛的一部分(也就是整个字符串)，但是纠正这一点是很容易的。

#5

How does this perform on your dataset?

这对您的数据集如何执行?

int sum = s.length;
for (int i = 1; i < s.length; i++) {
    for (int j = i; j < s.length; j++) {
        for (int k = 0; k < s.length - j; k++) {
            if (s.charAt(i+k) != s.charAt(j+k)) break;
            sum++;
        }
    }
}

Instead of iterating i you could find the next occurrence of s.charAt(0).

你可以找到下一个出现的s。charat(0)而不是迭代i。

#6

Please kindly try the below methods. I have tested this one.
Any input string (length from 1~10^5), the execution time is less time 20ms in my PC.

请尝试以下方法。我测试过这个。任何输入字符串(长度从1 ~ 10 ^ 5),执行时间更少的时间20女士在我的电脑。

public static int oneTry(CharSequence input) {
    int tail = input.length();
    for (int i = 1; i < input.length(); i++) {
        if (input.charAt(i) == input.charAt(0)) {
            tail = i;
            break;
        }
    }

    int count = 0;

    int head = 0;
    int next = 0;
    int base = 0;
    int two = -1;
    boolean start = false;
    boolean end = false;
    for (int i = tail; i < input.length(); i++) {
        if (input.charAt(i) == input.charAt(next)) {

            count++;

            if (next>0 && !start && input.charAt(i) == input.charAt(0)) {
                base = 1;
                start = true;
            }

            if (start) {
                if (!end && input.charAt(i) == input.charAt(head)) {
                    count = count + base;
                    head++;
                    head = head < tail ? head : 0;
                    if(head == 0) {
                        base++;
                    }
                } else {
                    end = true;
                }

                if(end) {
                    if(two <0 && input.charAt(i) == input.charAt(0)) {
                        two = i;
                    }
                }
            }

            next++;

            if(i==input.length()-1 && two > 0) {
                i = two - 1;

                next = 0;
                base = 0;
                two = -1;
                start = false;
                end = false;
                head = 0;
            }

        } else {
            if(two > 0) {
                i = two - 1;

                next = 0;
                base = 0;
                two = -1;
                start = false;
                end = false;
                head = 0;
            } else {
                if(end || !start) {
                    if(input.charAt(i) == input.charAt(0)) i--;

                    next = 0;
                    base = 0;
                    two = -1;
                    start = false;
                    end = false;
                    head = 0;
                } else {
                    i--;

                    next = next - tail;
                    base = base -1;
                    two = -1;
                    start = base==0 ? false : true;
                    end = false;
                    //head = 0;
                }                   
            }
        }
    }
    count = count + input.length();
    return count;
}

#7

From my opinion, whatever method you choose to implement it, it normally will has its own worst case.
The difference is that the performance for its worst case.
For example, I have tested the isnot2bad's method , my first implementation(oneTry) & my second implemenation(secondTry) in my PC.
The test results for worst case is:
isnot2bad's method: ~330s (2*10^5), ~74s (10^5) , ~0.8 (10^4), ~0.01(10^3)
my first implementation(oneTry):~200s(2*10^5), ~ 45s(10^5) , ~0.5s(10^4), ~0.01(10^3)
my second implemenation(secondTry): ~4s (10^6), ~ 0.4s(10^5),~0.05s(10^4),~0.007(10^3)

在我看来，无论您选择什么方法来实现它，它通常都会有自己最坏的情况。不同之处在于最坏情况下的表现。例如，我测试了isnot2bad方法、我的第一个实现(oneTry)和我的PC中的第二个实现(secondTry)。坏的情况下的测试结果是:isnot2bad方法:~(2 * 10 ^ 5),330年代~ 74年代(10 ^ 5)~ 0.8(10 ^ 4)~ 0.01(10 ^ 3)我的第一个实现(oneTry):~ 200年代(2 * 10 ^ 5),45岁~(10 ^ 5)~ 0.5秒(10 ^ 4)~ 0.01(10 ^ 3)我的第二个implemenation(secondTry):~ 4 s(10 ^ 6)~ 0.4秒(10 ^ 5)~ 0.05秒(10 ^ 4)~ 0.007(10 ^ 3)

From the test results, we can see the worst performance time for "secondTry" is nearly linear with string length, while others are nearly square with string length.

从测试结果可以看出，“secondTry”的最差性能时间与字符串长度接近线性，而其他的与字符串长度接近正方形。

The idea for secondTry implementation is like this:
For any string input T(T0...Tn-1, len=n), the total string's similarity value(St) is the sum of every single char's similarity value(Si) among the string S.
e.g.: St = S0 + ...+Si+...+Sn-1
Obviously, the total number of T0 in substring [T0...Ti]>= Si >=1
The exact value of Si is equal to the total number of T0 in substring [T0...Ti], which continues matching to Ti.
For example: T="aabaab", then T2='b', only T0('a') can continue to T2, while T1('a') can't continue to T2. Therefore, S2=1
Therefore, we need to keep track wich T0 is continued (if yes, keep it in the Array, if not, remove it from the Array). Then, it's easy to calculate every Ti's similarity.
Meanwhile, in order to improve the performance, we don't need to check every conitnuing T0. Acutally, for some T0, they can be combiled together.
Because they're belong to the repeat pattern.(it could be long pattern, or short pattern).
For example:
ababababab... : T0,T2,T4,T6... can be combied together as whole one.
aaaaaaaaaa... : T0,T1,T2,T3... can be combied together as whole one.
aaaabaaaabaaaab...:
T0,T5,T10,T15... can be combied together as whole one.
T1,T2,T3 can be combied together as whole one.
T6,T7,T8 can be combied together as whole one.
...

secondTry实现的想法如下:对于任何字符串输入T(T0…)Tn-1, len=n)，总字符串的相似值(St)是字符串s中每个char的相似值(Si)的总和。显然，子字符串中T0的总数[T0…>= Si >=1 Si的确切值等于子串中T0的总数[T0…，继续与Ti匹配。例如:T="aabaab"，则T2='b'，只有T0('a')能继续到T2，而T1('a')不能继续到T2。因此，S2=1，因此，我们需要继续跟踪T0(如果是，将它保存在数组中，如果不是，将它从数组中删除)。然后，很容易计算出每个Ti的相似度。同时，为了提高性能，我们不需要检查每个coniting T0。对于某些T0，它们可以合并在一起。因为它们属于重复模式。(可以是长模式，也可以是短模式)。例如:ababababab……:T0,T2,T4,T6…可以作为完整的组合在一起。aaaaaaaaaa……:T0,T1、T2、T3……可以作为一个整体来描述。aaaabaaaabaaaab……:T0、T5、T10,T15……可以作为一个整体来描述。T1 T2 T3可以作为一个整体组合在一起。T6 T7 T8可以作为一个整体组合在一起。

The detailed implementation code is shown as below. Hope someone can post their best implementation & test results for this topic. Thanks.

详细的实现代码如下所示。希望有人能发布他们关于这个主题的最佳实现和测试结果。谢谢。

    public static List<ANode> anodes = null;
    public static List<ANode> tnodes = null;
    public static void checkANodes(CharSequence input, int num) {
        tnodes = new Vector<ANode>(); 
        for(int i=anodes.size()-1; i>=0; i--) {
            ANode anode = anodes.get(i);
            if(input.charAt(num) == input.charAt(num-anode.pos)) {
                tnodes.add(anode);
            }else {
                if(tnodes.size() > 0) {
                    // ok to do the changes
                    ANode after = tnodes.get(tnodes.size()-1);
                    tnodes.remove(after);
                    if(after.c > 1) {
                        tnodes.add(new ANode(after.pos + after.shift, after.shift ,after.c-1)); 
                        tnodes.add(new ANode(after.pos, after.pos-anode.pos + anode.shift,1));
                    }else {
                        tnodes.add(new ANode(after.pos, after.pos-anode.pos + anode.shift,1));  
                    }
                }
            }
        }

        anodes.clear();
        for(int i=tnodes.size() - 1; i >= 0; i--) {
            anodes.add(tnodes.get(i));
        }
    }

    public static int secondTry(CharSequence input) {
        anodes = new Vector<ANode>();

        int start = 0;
        for (int i = 1; i < input.length(); i++) {
            if (input.charAt(i) == input.charAt(0)) {
                start = i;
                break;
            }
        }

        int count = 0;
        int base = 0;
        for (int i = start; i < input.length(); i++) {
            checkANodes(input, i);
            if(input.charAt(0) == input.charAt(i)) {
                if(anodes.size() == 0) {
                    anodes.add(new ANode(i,  i, 1));
                }else {
                    ANode last = anodes.get(anodes.size()-1);
                    int shift = i - last.pos;
                    int mod = shift % last.shift;
                    if(mod == 0) {
                        last.c++;
                    }else {
                        anodes.add(new ANode(i, mod, 1));
                    }
                }
            }

            base = 0;
            for(ANode anode : anodes) {
                base = base + anode.c;
            }           
            count = count + base;
        }

        count = count + input.length();
        return count;
    }

public class ANode {
    public int pos = 0;
    public int c = 1;
    public int shift = 0;

    public ANode(int pos, int shift, int c) {
        this.pos = pos;
        this.shift = shift;
        this.c = c;
    }
}

#1