使用正则表达式（或其他python模块）来比较文本/字符？

Let's say that my program receives an input such as a string of characters that has any type of character. For example, 'Bob's Bagel Shop'. Then it gets another string that says 'Fred's Bagel Store'. How can I use regular expressions or some other module in python to compare these and have my program tell me if at least 5 (or any number I want) of the characters are the same anywhere in the string, but all in the same order, such as the word 'Bagel'?

假设我的程序收到一个输入，例如一串具有任何类型字符的字符。例如，'Bob's Bagel Shop'。然后它得到另一个字符串，上面写着'Fred's Bagel Store'。如何在python中使用正则表达式或其他模块来比较这些并让我的程序告诉我，如果字符串中的任何字符串中的至少5个（或任何我想要的数字）是相同的，但是所有字符都在相同的顺序中，比如'Bagel'这个词？

Thanks.

谢谢。

3 个解决方案

#1

There's a Python standard library class difflib.SequenceMatcher that will help to solve your problem. Here's a code sample:

有一个Python标准库类difflib.SequenceMatcher，它将有助于解决您的问题。这是一个代码示例：

from difflib import SequenceMatcher

s1 = "Bob's Bagel Shop"
s2 = "Bill's Bagel Shop"

matcher = SequenceMatcher(a=s1, b=s2)
match = matcher.find_longest_match(0, len(s1), 0, len(s2))

Result:

结果：

Match(a=3, b=4, size=13)  # value that 'match' variable holds

The result shows that both string has equal substring with 13 characters length (starting from 3-rd char in first string and 4-th char in second string).

结果显示两个字符串具有相等的子字符串，长度为13个字符（从第一个字符串中的第3个字符开始，第二个字符串中的第4个字符串）。

You can use this match result object to get its fields as values:

您可以使用此匹配结果对象将其字段作为值：

match.size  # 13
match.a     # 3
match.b     # 4

#2

you can use itetools.combinations and then use intersection of sets to find out matching characters from both strings:

你可以使用itetools.combinations，然后使用集合的交集来找出两个字符串中的匹配字符：

from itertools import combinations
str1="Bob's Bagel Shop"
str2="Fred's Bagel Store"

def combi(strs):
    chars=''.join(strs.split())
    lis=[]
    for x in range(1,len(chars)):
        for y in combinations(chars,x):
            if ''.join(y) in chars:
                lis.append(''.join(y))
    return lis           


lis1=combi(str1)
lis2=combi(str2)
print max(set(lis1).intersection(set(lis2)),key=len)

output:

输出：

'sBagelS

#3

See

看到

String similarity metrics in Python

Python中的字符串相似性度量

or checkout the simhash module:

或签出simhash模块：

http://bibliographie-trac.ub.rub.de/browser/simhash.py

#1