如何检测具有一定模糊性的重复文本

Some thing ago, I write small script using Text::DeDupe to remove duplicates of blog posts before I have to lay my eyes on them.

有些事情以前,我在使用Text :: DeDupe编写小脚本来删除重复的博客文章,然后才能把目光投向他们。

After reading Syntactic Clustering of the Web paper on which implementation is based, I would love to have ability to find overlapping documents (e.g. snippets of blogs as opposed to full text, maybe also quotes).

在阅读了基于实现的Web论文的语法聚类之后,我希望能够找到重叠的文档(例如博客的片段而不是全文,也可能是引号)。

Do you know of any other implementation in C, C++ or perl which I can try out before writing my own?

你知道C,C ++或perl中的任何其他实现,我可以在编写自己的实现之前尝试吗?

1 个解决方案

#1

SpotSigs seems to fit my bill just right, here are some references:

SpotSigs似乎恰好适合我的账单,这里有一些参考:

The soruce code for this module is hosted on GitHub:

该模块的soruce代码托管在GitHub上:

http://github.com/jzawodn/perl-text-spotsig

#1