I am trying to use Difflib.SequenceMatcher to compute the similarities between two files. These two files are almost identical except that one contains some extra whitespaces, empty lines and other doesn't. I am trying to use
我正在尝试使用Difflib.SequenceMatcher来计算两个文件之间的相似性。这两个文件几乎相同,只是一个包含一些额外的空格,空行和其他文件没有。我正在尝试使用
s=difflib.SequenceMatcher(isjunk,text1,text2)
ratio =s.ratio()
for this purpose.
以此目的。
So, the question is how to write the lambda expression for this isjunk method so the SequenceMatcher method will discount all the whitespaces, empty lines etc. I tried to use the parameter lambda x: x==" ", but the result isn't as great. For two closely similar text, the ratio is very low. This is highly counter intuitive.
所以,问题是如何为这个isjunk方法编写lambda表达式,所以SequenceMatcher方法将折扣所有的空格,空行等。我试图使用参数lambda x:x ==“”,但结果不是很棒。对于两个非常相似的文本,比率非常低。这非常反直觉。
For testing purpose, here are the two strings that you can use on testing:
出于测试目的,以下是您可以在测试时使用的两个字符串:
What Motivates jwovu to do your Job Well? OK, this is an entry trying to win $100 worth of software development books despite the fact that I don‘t read
是什么促使jwovu做好你的工作?好吧,这是一个试图赢得价值100美元的软件开发书籍的条目,尽管我不读
programming books. In order to win the prize you have to write an entry and
what motivatesfggmum to do your job well. Hence this post. First motivation编程书籍。为了赢得奖品,你必须写一个条目,以及什么激励fggmum做好你的工作。因此这篇文章。第一动机
money. I know, this doesn‘t sound like a great inspiration to many, and saying that money is one of the motivation factors might just blow my chances away.
钱。我知道,这对许多人来说听起来并不是一个很好的灵感,并说金钱是激励因素之一,可能会让我失去机会。
As if money is a taboo in programming world. I know there are people who can‘t be motivated by money. Mme, on the other hand, am living in a real world,
好像金钱是编程世界的禁忌。我知道有些人不能被钱所激励。嗯,另一方面,我生活在一个现实世界中,
with house mortgage to pay, myself to feed and bills to cover. So I can‘t really exclude money from my consideration. If I can get a large sum of money for
房屋抵押支付,我自己喂和账单覆盖。所以我无法将钱从我的考虑中排除。如果我能得到一大笔钱
doing a good job, then definitely boost my morale. I won‘t care whether I am using an old workstation, or forced to share rooms or cubicle with other
做得好,然后肯定会提高我的士气。我不在乎我是使用旧工作站,还是*与其他人共用房间或隔间
people, or have to put up with an annoying boss, or whatever. The fact that at the end of the day I will walk off with a large pile of money itself is enough
人们,或者不得不忍受讨厌的老板,或其他什么。事实上,在一天结束的时候,我将带着一大笔钱走下去就足够了
for me to overcome all the obstacles, put up with all the hard feelings and hurt egos, tolerate a slow computer and even endure
为了我克服所有的障碍,忍受所有的艰难感受和伤害自我,容忍一台缓慢的计算机,甚至忍受
And here's another string
这是另一个字符串
What Motivates You to do your Job Well? OK, this is an entry trying to win $100 worth of software development books, despite the fact that I don't read programming books. In order to win the prize you have to write an entry and describes what motivates you to do your job well. Hence this post.
是什么激励你做好你的工作?好吧,这是一个试图赢得价值100美元的软件开发书籍的条目,尽管事实上我没有阅读编程书籍。为了赢得奖品,你必须写一个条目,并描述促使你做好工作的动力。因此这篇文章。
First motivation, money. I know, this doesn't sound like a great inspiration to many, and saying that money is one of the motivation factors might just blow my chances away. As if money is a taboo in programming world. I know there are people who can't be motivated by money. Kudos to them. Me, on the other hand, am living in a real world, with house mortgage to pay, myself to feed and bills to cover. So I can't really exclude money from my consideration.
第一个动机,钱。我知道,这对许多人来说听起来并不是一个很好的灵感,并说金钱是激励因素之一,可能会让我失去机会。好像金钱是编程世界的禁忌。我知道有些人不能被钱所激励。感谢他们。另一方面,我生活在一个现实世界中,房屋抵押贷款需要支付,我自己要养活和支付账单。所以我无法将钱从我的考虑中排除。
If I can get a large sum of money for doing a good job, then thatwill definitely boost my morale. I won't care whether I am using an old workstation, or forced to share rooms or cubicle with other people, or have to put up with an annoying boss, or whatever. The fact that at the end of the day I will walk off with a large pile of money itself is enough for me to overcome all the obstacles, put up with all the hard feelings and hurt egos, tolerate a slow computer and even endure
如果我可以获得一大笔钱来做好工作,那那肯定会提振我的士气。我不在乎我是使用旧工作站,还是*与其他人共用房间或隔间,或者不得不忍受讨厌的老板,或其他什么。事实上,在一天结束的时候,我将带着一大笔钱离开,这足以让我克服所有的障碍,忍受所有的艰难感受和伤害自我,忍受慢速计算机,甚至忍受
I ran the above command, and set the isjunk to lambda x:x==" ", the ratio is only 0.36.
我运行了上面的命令,并将isjunk设置为lambda x:x ==“”,比率仅为0.36。
4 个解决方案
#1
6
If you match all whitespaces the similarity is better:
如果匹配所有空格,则相似性更好:
difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()
However, difflib is not ideal to such a problem because these are two nearly identical documents, but typos and such produce differences for difflib where a human wouldn't see many.
然而,difflib对于这样的问题并不理想,因为这些是两个几乎完全相同的文档,但是拼写错误会导致difflib的差异,而人类不会看到很多。
Try reading up on tf-idf, Bayesian probability, Vector space Models and w-shingling
尝试阅读tf-idf,贝叶斯概率,向量空间模型和w-shingling
I have written a an implementation of tf-idf applying it to a vector space and using the dot product as a distance measure to classify documents.
我编写了一个tf-idf的实现,将它应用于向量空间,并使用点积作为距离度量来对文档进行分类。
#2
1
I haven't used Difflib.SequenceMatcher, but have you considered pre-processing the files to remove all blank lines and whitespace (perhaps via regular expressions) and then doing the compare?
我没有使用过Difflib.SequenceMatcher,但您是否考虑过预处理文件以删除所有空行和空格(可能通过正则表达式),然后进行比较?
#3
1
Using your sample strings:
使用您的示例字符串:
>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825
Interestingly if ' ' is also included as junk:
有趣的是,如果''也包含在垃圾中:
>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744
Looks like the new lines are having a much greater affect than the spaces.
看起来新的线条比空间有更大的影响。
#4
1
Given the texts above, the test is indeed as suggested:
鉴于上述文本,测试确实如下所示:
difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()
However, to speed up things a little, you can take advantage of CPython's method-wrappers:
但是,为了加快速度,您可以利用CPython的方法包装器:
difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()
This avoids many python function calls.
这避免了许多python函数调用。
#1
6
If you match all whitespaces the similarity is better:
如果匹配所有空格,则相似性更好:
difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()
However, difflib is not ideal to such a problem because these are two nearly identical documents, but typos and such produce differences for difflib where a human wouldn't see many.
然而,difflib对于这样的问题并不理想,因为这些是两个几乎完全相同的文档,但是拼写错误会导致difflib的差异,而人类不会看到很多。
Try reading up on tf-idf, Bayesian probability, Vector space Models and w-shingling
尝试阅读tf-idf,贝叶斯概率,向量空间模型和w-shingling
I have written a an implementation of tf-idf applying it to a vector space and using the dot product as a distance measure to classify documents.
我编写了一个tf-idf的实现,将它应用于向量空间,并使用点积作为距离度量来对文档进行分类。
#2
1
I haven't used Difflib.SequenceMatcher, but have you considered pre-processing the files to remove all blank lines and whitespace (perhaps via regular expressions) and then doing the compare?
我没有使用过Difflib.SequenceMatcher,但您是否考虑过预处理文件以删除所有空行和空格(可能通过正则表达式),然后进行比较?
#3
1
Using your sample strings:
使用您的示例字符串:
>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825
Interestingly if ' ' is also included as junk:
有趣的是,如果''也包含在垃圾中:
>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744
Looks like the new lines are having a much greater affect than the spaces.
看起来新的线条比空间有更大的影响。
#4
1
Given the texts above, the test is indeed as suggested:
鉴于上述文本,测试确实如下所示:
difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()
However, to speed up things a little, you can take advantage of CPython's method-wrappers:
但是,为了加快速度,您可以利用CPython的方法包装器:
difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()
This avoids many python function calls.
这避免了许多python函数调用。