模糊文档匹配/文本指纹识别的最佳库

时间:2021-11-17 18:29:03

I am thinking of building an API that would let a program submit a "fingerprint" of an academic publication, match this against a database of articles from Open Access journals, and if found, send the user the canonical citation information. Initially this would be for a specific small research field, so it wouldn't necessarily need to deal with 20 million papers to be successful (even if the 1000 most commonly cited papers in the field were covered, that would be a huge boon for productivity and collaboration).

我正在考虑构建一个API,让程序提交学术出版物的“指纹”,将其与Open Access期刊的文章数据库相匹配,如果找到,则向用户发送规范引文信息。最初这将是针对特定的小型研究领域,因此不一定需要处理2000万篇论文才能获得成功(即使该领域最常引用的1000篇论文被覆盖,这对生产力来说也是一个巨大的好处和合作)。

I wonder what library (which is able to interface with Ruby, ideally) would be the best for doing this "fingerprinting". I've seen Lucene's fuzzy match, but that seems to work on a word level, whereas in this case we would probably want to submit a much larger subset of the document. The reason to do fuzzy matches is that some people might have a Word.doc preprint, some might have the final PDF, etc.

我想知道哪个库(理想情况下能与Ruby接口)最适合做这个“指纹识别”。我已经看到了Lucene的模糊匹配,但这似乎在单词级别上工作,而在这种情况下,我们可能想要提交更大的文档子集。进行模糊匹配的原因是有些人可能有Word.doc预印本,有些人可能有最终的PDF等。

I really appreciate some of the ideas here. Googling for "perceptual hash" get me into a bunch of new material. I tried to summarize many of my findings here.

我真的很感激这里的一些想法。谷歌搜索“感性哈希”让我进入一堆新材料。我试着在这里总结一下我的许多发现。

It seems like SimHash, for example the C implementation would be the way to go, but I still need to experiment more.

看起来像SimHash,例如C实现将是要走的路,但我仍需要进行更多实验。

1 个解决方案

#1


7  

You can use pHash for this kind of job.

你可以使用pHash来完成这种工作。

And this gem will help you to get started:

这个宝石将帮助您入门:

require 'phash/text'
Phash::Text.new('first.txt') % Phash::Text.new('second.txt')

#1


7  

You can use pHash for this kind of job.

你可以使用pHash来完成这种工作。

And this gem will help you to get started:

这个宝石将帮助您入门:

require 'phash/text'
Phash::Text.new('first.txt') % Phash::Text.new('second.txt')