关于lucene的文章相似度的问题

我想做一个文章相似度的查询，用到lucene的MoreLikeThis类，为什么总是查不到呢？是不是代码写的有问题了，各位大侠帮忙看看。我用的Lucene版本是2.9.3，这应该和版本没关系吧...

下面是测试代码：

public class TestLucene3 {
private final String indexPath = "e:\\index";
private final String content1 = "五年前，袁鹤轩是武林中的神话。袁鹤轩行走江湖后，哪怕是在最穷困的时候，也没有做过偷窃的事。故而，他的路只有一条——江湖，不在江湖中崛起，就在江湖中沉寂。";
private final String content2 = "五年前，袁鹤轩是武林中的神话。袁鹤轩行走江湖后，哪怕是在最穷困的时候，也没有做过偷窃的事。故而，他的路只有一条——江湖，不在江湖中崛起，就在江湖中沉寂。";
private final String content = "不在江湖中崛起，就在江湖中沉寂";

public List<Document> createDoc(){
List<Document> list = new ArrayList<Document>();
Document doc = new Document();
doc.add(new Field("id","1",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("content",content1,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.YES));
list.add(doc);
doc = new Document();
doc.add(new Field("id","2",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("content",content2,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.YES));
list.add(doc);
return list;
}

public void createIndex(){
try {
Directory dir = FSDirectory.open(new File(indexPath));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_29);
IndexWriter writer = new IndexWriter(dir,analyzer,true,IndexWriter.MaxFieldLength.UNLIMITED);
List<Document> docs = createDoc();
for(Document doc : docs){
writer.addDocument(doc);
}
writer.commit();
writer.optimize();
writer.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

public Reader getReader(){
StringReader sr = new StringReader(content);
return sr;
}

public void similar(){
try {
Directory dir = FSDirectory.open(new File(indexPath));
IndexReader reader = IndexReader.open(dir,true);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"id","content"});
mlt.setMaxQueryTerms(100);
Query query = mlt.like(getReader());

reader.close();
IndexSearcher searcher = new IndexSearcher(dir,true);
TopDocs topDocs = searcher.search(query,null,100);
ScoreDoc[] docs = topDocs.scoreDocs;
System.out.println(docs.length);//此处总为0，郁闷...
for(ScoreDoc doc : docs){
Document d = searcher.doc(doc.doc);
System.out.println(d);
}
searcher.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

public static void main(String[] args) {
TestLucene3 tl3 = new TestLucene3();
tl3.createIndex();
tl3.similar();
}
}

5 个解决方案

#1

正常的查询，能获取到值吗？？

#2

估计，你把



private final String content = "不在江湖中崛起，就在江湖中沉寂";

这段作为关键词，直接查询一样的没有结果。

#3

能查到了，需要调用一下MoreLikeThis中的setMinDocFreg(int minDocFreq)和setMinTermFreg(int minTermFreq),默认是5和2

#4

你调这个词元之间的距离，还不如换一个分词器。。。那样会更合里。。。

#5

标准分词器对中文支持不太好，
换其他分词器吧。比如说庖丁？？

#1