Some thing ago, I write small script using Text::DeDupe to remove duplicates of blog posts before I have to lay my eyes on them.
有些事情以前,我在使用Text :: DeDupe编写小脚本来删除重复的博客文章,然后才能把目光投向他们。
After reading Syntactic Clustering of the Web paper on which implementation is based, I would love to have ability to find overlapping documents (e.g. snippets of blogs as opposed to full text, maybe also quotes).
在阅读了基于实现的Web论文的语法聚类之后,我希望能够找到重叠的文档(例如博客的片段而不是全文,也可能是引号)。
Do you know of any other implementation in C, C++ or perl which I can try out before writing my own?
你知道C,C ++或perl中的任何其他实现,我可以在编写自己的实现之前尝试吗?
1 个解决方案
#1
2
SpotSigs seems to fit my bill just right, here are some references:
SpotSigs似乎恰好适合我的账单,这里有一些参考:
- http://dbpubs.stanford.edu/pub/2008-10
- http://infoblog.stanford.edu/2008/08/spotsigs-are-stopwords-finally-good-for.html
- http://ilpubs.stanford.edu:8090/860/
The soruce code for this module is hosted on GitHub:
该模块的soruce代码托管在GitHub上:
#1
2
SpotSigs seems to fit my bill just right, here are some references:
SpotSigs似乎恰好适合我的账单,这里有一些参考:
- http://dbpubs.stanford.edu/pub/2008-10
- http://infoblog.stanford.edu/2008/08/spotsigs-are-stopwords-finally-good-for.html
- http://ilpubs.stanford.edu:8090/860/
The soruce code for this module is hosted on GitHub:
该模块的soruce代码托管在GitHub上: