http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 汇总于此 还有这本书 http://www-nlp.stanford.edu/IR-book/ 里面有词向量空间 SVM 等介绍
http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27b_ir2-vectorspace-95.pdf 专门介绍向量空间
https://courses.cs.washington.edu/courses/cse573/12sp/lectures/17-ir.pdf 也提到了其他思路 貌似类似语音识别的统计模型
使用深度学习来做文档相似度计算 https://cs224d.stanford.edu/reports/PoulosJackson.pdf 还有这里 http://www.cms.waikato.ac.nz/~ml/publications/2012/JASIST2012.pdf
网页里直接比较文本相似度的 http://www.scurtu.it/documentSimilarity.html
这里汇总了一些回答 http://*.com/questions/8897593/similarity-between-two-text-documents 包括利用NLP NLTK库来做,或者是diff,skylearn词向量空间+cos
http://*.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 也有cosine相似度计算方法
lucene 3 里的cosine相似度计算方法 https://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53 注意:4和3的计算方法不一样
向量空间模型(http://*.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene):
Once you've got your data components properly standardized, then you can worry about what's better: fuzzy match, Levenshtein distance, or cosine similarity (etc.)
As I told you in my comment, I think you made a mistake somewhere. The vectors actually contain the <word,frequency>
pairs, not words
only. Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted). Consider the following example:
Document a:
A B C A A B C. D D E A B. D A B C B A.
Document b:
A B C A A B C. D A B C B A.
Vector a:
A:6, B:5, C:3, D:3, E:1
Vector b:
A:5, B:4, C:3, D:1, E:0
Which result in the following similarity measure:
(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))=
62/(8.94427*7.14143)=
0.970648
lucene里 more like this:
you may want to check the MoreLikeThis feature of lucene.
MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index.
Sample code example (java reference) -
MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"title", "author"}); // specify the fields for similiarity
Query query = mlt.like(docID); // Pass the doc id
TopDocs similarDocs = searcher.search(query, 10); // Use the searcher
if (similarDocs.totalHits == 0)
// Do handling
}
http://*.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 提到:
i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.
For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4);
How can i get the cosine similarity between these two documents?
Thank you
When indexing, there's an option to store term frequency vectors.
During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.
An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.
As Julia points out Sujit Pal's example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4.
|