指向结果的Python字符串比较

Im trying to compare 2 1000 byte string and would like to know where the difference exactly starts, ie; from which byte the string is different.. Is there any function to determine it?

我试图比较2 1000字节的字符串，并想知道差异的确切开始，即;字符串从哪个字节不同..有没有任何函数来确定它？

6 个解决方案

#1

I've tried to test the answers given here and I've came up with an other, faster(in usual cases), solution, even though it is less elegant.

我试图测试这里给出的答案，我想出了另一个更快的（通常情况下）解决方案，即使它不那么优雅。

First of all let's see how fast are the proposed solutions:

首先让我们看看提出的解决方案有多快：

In [15]: def check_genexp(a, b):
    ...:     return next(idx for idx, c in enumerate(a) if c != b[idx])

In [16]: %timeit check_genexp("a"*9999 + "b", "a"*9999 + "c")
1000 loops, best of 3: 1.04 ms per loop

In [17]: from difflib import SequenceMatcher

In [18]: def check_matcher(a, b):
    ...:     return next(SequenceMatcher(a=a, b=b).get_matching_blocks())
    ...: 

In [19]: %timeit check_matcher("a"*9999+"b", "a"*9999+"c")
100 loops, best of 3: 11.5 ms per loop

As you can see the genexp is a lot faster than difflib, but this is probably due to the fact that SequenceMatcher does a lot more than finding the first non-equal character.

正如您所看到的，genexp比difflib快得多，但这可能是因为SequenceMatcher比查找第一个不相等的字符要多得多。

Now, how could we speed things up? Well, we can use "binary search"!!! The idea is that if two strings are not equal, than either the first halves are different or the second one are different(or both, but in that case we only care for the first half since we want the first differing index).

现在，我们怎么能加快速度呢？好吧，我们可以使用“二分搜索”!!!这个想法是，如果两个字符串不相等，则前半部分不同或第二部分不同（或两者兼而有之，但在这种情况下，我们只关心前半部分，因为我们需要第一个不同的索引）。

So we can do something like this:

所以我们可以这样做：

def binary_check(a, b):
    len_a, len_b = len(a), len(b)
    if len_a == len_b:
        return binary_check_helper(a, b)
    min_length, max_length = min(len_a, len_b), max(len_a, len_b)
    res = binary_check_helper(a[:min_length], b[:min_length])
    return res if res >= 0 else min_length

def binary_check_helper(a, b):
    if a == b:
        return -1
    length = len(a)

    if length == 1:
        return int(a[0] == b[0])
    else:
        half_length = length // 2
        r = binary_check_helper(a[:half_length], b[:half_length])
        if r >= 0:
            return r
        r = binary_check(a[half_length:], b[half_length:])
        if r >= 0:
            return r + half_length
        return r

And the result:

结果如下：

In [34]: %timeit binary_check("a"*9999 + "b", "a"*9999 + "c")
10000 loops, best of 3: 28.4 µs per loop

That's more than thirty five times faster than the genexp!

这比genexp快了三十五倍！

Why this works? The comparisons obviously take linear time, so it looks like we are doing a lot more work than before... and that's indeed true but it's done at the "C level", and thus the result is that this method is actually faster.

为什么会这样？比较显然需要线性时间，因此我们看起来比以前做了更多的工作......这确实是正确的，但它是在“C级”完成的，因此结果是这种方法实际上更快。

Note that this is somehow "implementation specific" because implementations such as PyPy could probably optimize the genexp into a single C-for loop and that would beat anything; also on implementations like Jython or IronPython could be a lot slower than with CPython.

请注意，这在某种程度上是“特定于实现”，因为像PyPy这样的实现可能可能将genexp优化为单个C-for循环，并且可以击败任何东西;对于像Jython或IronPython这样的实现也可能比CPython慢得多。

This method has the same asymptotic complexity of the other methods, i.e. O(n). The strings are splitted in half at most log_2(n) times and each time an equality test is done, which takes linear time. At first sight it may seem a Θ(n * logn) algorithm, but that's not the case. The recurrence equation is:

该方法具有与其他方法相同的渐近复杂度，即O（n）。字符串在最多log_2（n）次被分成两半，并且每次完成相等测试，这需要线性时间。乍一看，它似乎是一个Θ（n * logn）算法，但事实并非如此。递推方程是：

T(n) = T(n//2) + Θ(n) = Σ_{i=0}^{logn}(n/2^i)
     = Θ(n(1 + 1/2 + 1/4 + ...)) <= Θ(2n) = Θ(n)

Some more results:

更多结果：

In [37]: %timeit binary_check("a"*10**6 + "b", "a"*10**6 + "c")
100 loops, best of 3: 2.84 ms per loop

In [38]: %timeit check_genexp("a"*10**6 + "b", "a"*10**6 + "c")
10 loops, best of 3: 101 ms per loop

In [39]: %timeit binary_check(15 * "a"*10**6 + "b", 15 * "a"*10**6 + "c")
10 loops, best of 3: 53.3 ms per loop

In [40]: %timeit check_genexp(15 * "a"*10**6 + "b", 15 * "a"*10**6 + "c")
1 loops, best of 3: 1.5 s per loop

As you can see even with huge strings this method is still about thirty times faster.

正如你所看到的那样，即使使用大字符串，这种方法仍然快了大约三十倍。

Note: The downside of this solution is that it is ϴ(n) and not O(n), i.e. it always read the whole string to return the result. Even when the first character is already different. In fact:

注意：该解决方案的缺点是它是Θ（n）而不是O（n），即它总是读取整个字符串以返回结果。即使第一个角色已经不同了。事实上：

In [49]: a = "b" + "a" * 15 * 10**6
    ...: b = "c" + "a" * 15 * 10**6
    ...: 

In [50]: %timeit binary_check(a, b)
100 loops, best of 3: 10.3 ms per loop

In [51]: %timeit check_genexp(a, b)
1000000 loops, best of 3: 1.3 µs per loop

This is to be expected. However it takes very little for this solution to become more performant then the explicit loop:

这是可以预料的。但是，对于此解决方案而言，显式循环变得更加高效：

In [59]: a = "a" * 2 * 10**5 + "b" + "a" * 15*10**6
    ...: b = "a" * 2 * 10**5 + "c" + "a" * 15*10**6

In [60]: %timeit check_genexp(a, b)
10 loops, best of 3: 20.3 ms per loop

In [61]: %timeit binary_check(a, b)
100 loops, best of 3: 17.3 ms per loop

According to this simple benchmark, with a big string if the difference is farther than about 1.3% of the total length, the binary check is better.

根据这个简单的基准测试，如果差异远远超过总长度的1.3％，则使用大字符串，二进制检查更好。

It is also possible to introduce some heuristics. For example if the minimum length of the two strings is greater than a certain cutoff value you first only check whether the prefixes at that cutoff are different, if they are you can't disregard everything after that, thus avoiding comparing the whole strings. This can be trivially implemented:

也可以引入一些启发式方法。例如，如果两个字符串的最小长度大于某个截止值，则首先只检查该截止值的前缀是否不同，如果它们不能忽略之后的所有内容，从而避免比较整个字符串。这可以通过以下方式实现：

def binary_check2(a, b, cutoff=1000):
    len_a, len_b = len(a), len(b)
    if min(len_a, len_b) > cutoff:
        small_a, small_b = a[:cutoff], b[:cutoff]
        if small_a != small_b:
            return binary_check_helper(a[:cutoff], b[:cutoff])
    # same as before

Depending on the application you can choose a cutoff that minimize the average time. In any case this is an ad hoc heuristics that may or may not work well, so if you are dealing with very long strings with only short common prefixes you should use a "fail-fast" algorithm as is the genexp approach.

根据应用程序，您可以选择最小化平均时间的截止值。在任何情况下，这都是一个特殊的启发式方法，可能会或可能不会很好，所以如果你处理只有很短的公共前缀的非常长的字符串，你应该使用“fail-fast”算法，就像genexp方法一样。

^{timings performed on python3.4. Using bytes instead of unicode strings doesn't change significantly the results}

在python3.4上执行的计时。使用字节而不是unicode字符串不会显着改变结果

#2

Maybe use next plus a generator?

也许使用next加上一台发电机？

next(idx for idx,c in enumerate(your_string1) if c != your_string2[idx])

This will give you the index where the difference starts and raise StopIteration if they are equal.

这将为您提供差异开始的索引，如果它们相等则提高StopIteration。

It might even be slightly more elegant with itertools.izip:

使用itertools.izip可能会稍微优雅一点：

next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)

Example:

例：

>>> from itertools import izip
>>> s1 = 'stop at the bus'
>>> s2 = 'stop at the car'
>>> next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
12
>>> s1[12]
'b'
>>> s2[12]
'c'

#3

If you want something more complex, you can have a look at SequenceMatcher

如果你想要更复杂的东西，你可以看一下SequenceMatcher

It is a bit hairy, but very powerful. If you simply want to answer your question, then :

它有点多毛，但非常强大。如果您只是想回答您的问题，那么：

from  difflib import SequenceMatcher

s1 = 'stop at the bus'
s2 = 'stop at the car'

s = SequenceMatcher(None, s1, s2)

print s.get_matching_blocks()[0].size

returns the solution :)

返回解决方案:)

But if you want all the matches :

但如果你想要所有的比赛：

Small example:

小例子：

from  difflib import SequenceMatcher

s1 = 'stop at the bus'
s2 = 'stop at the car'

s = SequenceMatcher(None, s1, s2)

print s.get_matching_blocks()

returns

回报

[Match(a=0, b=0, size=12), Match(a=15, b=15, size=0)]

which means that the longest match in your strings is of size 12, and starts at the beginning (0). But there is another match, starting at s1[15], and of size 0 . . .

这意味着字符串中最长的匹配大小为12，并从头开始（0）。但是还有另一场比赛，从s1 [15]开始，大小为0。。。

For big strings like yours, this is something that could be really interesting. :)

对于像你这样的大字符串，这可能非常有趣。 :)

#4

for i, (x, y) in enumerate(zip(a, b)):
    if x != y:
        print('First index where strings are different:', i)
        break
else:
    print('Strings are identical.')

In Python 2.x, zip() returns a list of tuples, not a iterator. As gnibbler pointed out, if you're using Python 2.x, it might be worth your while to use izip rather than zip (izip returns a nice, memory efficient iterator that avoids evaluating all of the values at once). As I said in the commments though, in Python 3, izip has been renamed zip and the old zip is gone.

在Python 2.x中，zip（）返回元组列表，而不是迭代器。正如gnibbler指出的那样，如果你使用的是Python 2.x，那么使用izip而不是zip可能是值得的（izip返回一个很好的，内存有效的迭代器，它可以避免一次评估所有的值）。正如我在文章中所说，在Python 3中，izip已经被重命名为zip，而旧的zip已经消失了。

#5

>>> s1 = 'stop at the bus'
>>> s2 = 'stop at the car'
>>> import difflib
>>> next(difflib.SequenceMatcher(a=s1, b=s2).get_matching_blocks())
Match(a=0, b=0, size=12)

This means the the first matching block is 12 characters long.

这意味着第一个匹配块长度为12个字符。

If either a or b isn't 0, the strings differ right from the beginning

如果a或b不为0，则字符串从开头就不同

#6

This might be overkill, but since you seem to be concerned about speed, you could consider using numpy. There are probably improvements to be made (for some reason inlining made a 25 us difference for me), but this is a first step:

这可能是矫枉过正，但由于你似乎关注速度，你可以考虑使用numpy。可能会有一些改进（由于某种原因，内联对我来说有25个不同之处），但这是第一步：

>>> def diff_index(s1, s2):
...     s1 = numpy.fromstring(s1, dtype=numpy.uint8)
...     s2 = numpy.fromstring(s2, dtype=numpy.uint8)
...     return (~(s1 == s2)).nonzero()[0][0]
... 
>>> base = string.lowercase * 385
>>> s1 = base + 'a'
>>> s2 = base + 'z'
>>> diff_index(s1, s2)
10010

For differences at the end, this is a lot faster than a genex:

对于最后的差异，这比genex要快得多：

>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
1000 loops, best of 3: 1.46 ms per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.6 us per loop

It's a lot slower for differences at the very beginning...

一开始差异要慢得多......

>>> s1 = 'a' + base
>>> s2 = 'z' + base
>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
100000 loops, best of 3: 2.12 us per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.5 us per loop

But on average, it wins by an order of magnitude:

但平均而言，它赢得了一个数量级：

>>> s1 = base[:5000] + 'a' + base[5000:]
>>> s2 = base[:5000] + 'z' + base[5000:]
>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
1000 loops, best of 3: 724 us per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.2 us per loop

If speed is not a concern, though, then I'd personally go for mgilson's answer.

如果速度不是一个问题，那么我个人会去找mgilson的答案。

#1