好久没写博客了。近期在用solr做一套系统,期间有不少心得尚未记录。这里先记录一下solr中自定义QParser如何与SynonymFilter和RemoveDuplicatesTokenFilter配合以实现检索时Token同义词扩展与Token去重。
起初按照solr wiki上的说明,在schema.xml里配置了如下filter:
1 |
< analyzer type = "query" > |
2 |
< tokenizer class = "org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength = "true" /> |
3 |
< filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" enablePositionIncrements = "true" /> |
4 |
< filter class = "solr.LowerCaseFilterFactory" /> |
5 |
< filter class = "solr.SynonymFilterFactory" synonyms = "synonyms.txt" ignoreCase = "true" expand = "true" /> |
6 |
< filter class = "solr.RemoveDuplicatesTokenFilterFactory" /> |
但是在实际使用过程中,发现RemoveDuplicatesTokenFilterFactory并未能过滤掉重复的Token,例如:“摩托罗拉 motorola 里程碑2代”,经过同义词扩展后(此处的同义词扩展为品牌中英文扩展,下同)变成了“摩托罗拉 摩托 motorola moto motorola 摩托罗拉 摩托 moto 里程碑 2代”,其中的【摩托罗拉】、【摩托】、【motorola】、【moto】都重复了一次。而我使用了基于DisMaxQParser的自定义Qparser,因此扩展后的同义词会对min-should-match参数带来影响,降低匹配精度。
为了看看究竟为何RemoveDuplicatesTokenFilter不起作用,打开它的源码看了一下:
02 |
public boolean incrementToken() throws IOException { |
03 |
while (input.incrementToken()) { |
04 |
final char term[] = termAttribute.buffer(); |
05 |
final int length = termAttribute.length(); |
06 |
final int posIncrement = posIncAttribute.getPositionIncrement(); |
07 |
if (posIncrement > 0 ) { |
10 |
boolean duplicate = (posIncrement == 0 && previous.contains(term, 0 , length)); |
12 |
char saved[] = new char [length]; |
13 |
System.arraycopy(term, 0 , saved, 0 , length); |
可以看出来,RemoveDuplicatesTokenFilter只对positionIncrement为0的token进行判断是否重复;但是,经过SynonymFilter扩展出的同义词,虽然positionIncrement为0,但肯定不会与原Token重复的,后面可能出现的重复的Token则因为positionIncrement必然大于0而导致无法去重了。
针对这种情况,决定从自定义的QParser入手,采用以下思路来解决问题:
- 想办法在QParser中获得Solr的TokenizerChain,从中获取SynonymFilterFactory
- 在QParser中取得分词的Analyzer,并通过Analyzer的TokenStream构建SynonymFilter实例
- 通过SynonymFilter遍历Token(调用incrementToken方法),并针对同义词扩展的positionIncrement进行逻辑判断:
- 若positionIncrement>0,则判断该词是否已经出现过,未出现则放行,并放在Set中待下次判断是否重复
- 若positionIncrement==0,则只放在Set*下次判断
经过这样的处理逻辑,实际上除了过滤掉了重复的Token,还完成了Token“归一化”的过程。因为自定义QParser在solr检索的生命周期中要先于schema.xml中配置的TokenizerChain,因此在归一化之后,还会再进行一次同义词扩展,扩展之后,不会出现重复的Token,也不影响检索的精度了。
部分代码如下:
01 |
Analyzer analyzer = req.getSchema().getQueryAnalyzer(); |
02 |
final TokenizerChain tokennizerChain = (TokenizerChain) req.getSchema().getField( "title" ).getType().getQueryAnalyzer(); |
03 |
SynonymFilterFactory sff = null ; |
04 |
for (TokenFilterFactory tf : tokennizerChain.getTokenFilterFactories()) { |
05 |
if (tf instanceof SynonymFilterFactory) { |
06 |
sff = (SynonymFilterFactory) tf; |
09 |
if ( null == analyzer) { |
13 |
StringReader reader = new StringReader(qstr); |
14 |
StringBuilder buffer = new StringBuilder( 128 ); |
15 |
Set<String> tokenSet = new LinkedHashSet<String>(); |
17 |
TokenStream tokens = analyzer.reusableTokenStream( "title" , reader); |
18 |
SynonymFilter sf = sff.create(tokens); |
20 |
TermAttribute termAtt = (TermAttribute) sf.getAttribute(TermAttribute. class ); |
21 |
PositionIncrementAttribute positionIncrementAttribute = sf.getAttribute(PositionIncrementAttribute. class ); |
22 |
OffsetAttribute offsetAttribute = sf.getAttribute(OffsetAttribute. class ); |
23 |
Set dumplicatedTokenSet = new HashSet(); |
24 |
while (sf.incrementToken()) { |
25 |
final String token = ( new String(termAtt.termBuffer(), 0 , termAtt.termLength())).toLowerCase(); |
26 |
final int posIncr = positionIncrementAttribute.getPositionIncrement(); |
28 |
if (!dumplicatedTokenSet.contains(token)) { |
29 |
dumplicatedTokenSet.add(token); |
33 |
dumplicatedTokenSet.add(token); |
37 |
for (String tok : tokenSet) { |
38 |
buffer.append(tok).append( " " ); |
40 |
if (buffer.length() > 0 ) { |
41 |
qstr = buffer.toString(); |
后记:
solr的DisjunctionMaxQuery是个很有意思的东西,抽时间好好看一下代码,总结一下。
摘自:http://www.jnan.org/archives/2011/10/hacking-solr-synonymfilter-and-removeduplicatestokenfilter-with-custom-qparser.html#more-528