lucene源码分析---7

时间:2021-07-23 03:14:11

lucene源码分析—QueryParser的parse函数

本章主要分析QueryParser类的parse函数,定义在其父类QueryParserBase中,
QueryParserBase::parse

  public Query parse(String query) throws ParseException {
ReInit(new FastCharStream(new StringReader(query)));
try {
Query res = TopLevelQuery(field);
return res!=null ? res : newBooleanQuery().build();
} catch (ParseException | TokenMgrError tme) {

} catch (BooleanQuery.TooManyClauses tmc) {

}
}

parse首先将需要搜索的字符串query封装成FastCharStream,FastCharStream实现了Java的CharStream接口,内部使用了一个缓存,并且可以方便读取并且改变读写指针。然后调用ReInit进行初始化,ReInit以及整个QueryParser都是由JavaCC根据org.apache.lucene.queryparse.classic.QueryParser.jj文件自动生成,设计到的JavaCC的知识可以从网上或者别的书上查找,本博文不会重点分析这块内容。
parse最重要的函数是TopLevelQuery,即返回顶层Query,TopLevelQuery会根据用来搜索的字符串query创建一个树形的Query结构,传入的参数field在QueryParserBase的构造函数中赋值,用来标识对哪个域进行搜索。
QueryParserBase::parse->QueryParser::TopLevelQuery

  final public Query TopLevelQuery(String field) throws ParseException {
Query q;
q = Query(field);
jj_consume_token(0);
{
if (true) return q;
}
throw new Error();
}

TopLevelQuery函数中最关键的是Query函数,由于QueryParser由JavaCC生成,这里只看QueryParser.jj文件。

QueryParser.jj::Query

Query Query(String field) :
{
List<BooleanClause> clauses = new ArrayList<BooleanClause>();
Query q, firstQuery=null;
int conj, mods;
}
{
mods=Modifiers() q=Clause(field)
{
addClause(clauses, CONJ_NONE, mods, q);
if (mods == MOD_NONE)
firstQuery=q;
}
(
conj=Conjunction() mods=Modifiers() q=Clause(field)
{ addClause(clauses, conj, mods, q); }
)*
{
if (clauses.size() == 1 && firstQuery != null)
return firstQuery;
else {
return getBooleanQuery(clauses);
}
}
}

Modifiers返回搜索字符串中的”+”或”-“,Conjunction返回连接字符串。Query首先通过Clause函数返回一个子查询,然后调用addClause函数添加该子查询,

QueryParserBase::addClause

  protected void addClause(List<BooleanClause> clauses, int conj, int mods, Query q) {
boolean required, prohibited;

...

if (required && !prohibited)
clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.add(newBooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
}

addClause函数中省略的部分是根据参数连接符conj和mods计算required和prohibited的值,然后将Query封装成BooleanClause并添加到clauses列表中。

回到Query函数中,如果子查询clauses列表只有一个子查询,就直接返回,否则通过getBooleanQuery函数封装所有的子查询并最终返回一个BooleanClause。

下面来看Clause函数,即创建一个子查询,
QueryParser.jj::Clause

Query Clause(String field) : {
Query q;
Token fieldToken=null, boost=null;
}
{
[
LOOKAHEAD(2)
(
fieldToken=<TERM> <COLON> {field=discardEscapeChar(fieldToken.image);}
| <STAR> <COLON> {field="*";}
)
]

(
q=Term(field)
| <LPAREN> q=Query(field) <RPAREN> (<CARAT> boost=<NUMBER>)?

)
{ return handleBoost(q, boost); }
}

LOOKAHEAD(2)表示要看两个符号,如果是Field,则要重新调整搜索的域。Clause函数最重要的是Term函数,该函数返回最终的Query,当然Clause函数也可以嵌套调用Query函数生成子查询。

QueryParser.jj::Term

Query Term(String field) : {
Token term, boost=null, fuzzySlop=null, goop1, goop2;
boolean prefix = false;
boolean wildcard = false;
boolean fuzzy = false;
boolean regexp = false;
boolean startInc=false;
boolean endInc=false;
Query q;
}
{
(
(
term=<TERM>
| term=<STAR> { wildcard=true; }
| term=<PREFIXTERM> { prefix=true; }
| term=<WILDTERM> { wildcard=true; }
| term=<REGEXPTERM> { regexp=true; }
| term=<NUMBER>
| term=<BAREOPER> { term.image = term.image.substring(0,1); }
)
[ fuzzySlop=<FUZZY_SLOP> { fuzzy=true; } ]
[ <CARAT> boost=<NUMBER> [ fuzzySlop=<FUZZY_SLOP> { fuzzy=true; } ] ]
{
q = handleBareTokenQuery(field, term, fuzzySlop, prefix, wildcard, fuzzy, regexp);
}
| ( ( <RANGEIN_START> {startInc=true;} | <RANGEEX_START> )
( goop1=<RANGE_GOOP>|goop1=<RANGE_QUOTED> )
[ <RANGE_TO> ]
( goop2=<RANGE_GOOP>|goop2=<RANGE_QUOTED> )
( <RANGEIN_END> {endInc=true;} | <RANGEEX_END>))
[ <CARAT> boost=<NUMBER> ]
{
boolean startOpen=false;
boolean endOpen=false;
if (goop1.kind == RANGE_QUOTED) {
goop1.image = goop1.image.substring(1, goop1.image.length()-1);
} else if ("*".equals(goop1.image)) {
startOpen=true;
}
if (goop2.kind == RANGE_QUOTED) {
goop2.image = goop2.image.substring(1, goop2.image.length()-1);
} else if ("*".equals(goop2.image)) {
endOpen=true;
}
q = getRangeQuery(field, startOpen ? null : discardEscapeChar(goop1.image), endOpen ? null : discardEscapeChar(goop2.image), startInc, endInc);
}
| term=<QUOTED>
[ fuzzySlop=<FUZZY_SLOP> ]
[ <CARAT> boost=<NUMBER> ]
{ q = handleQuotedTerm(field, term, fuzzySlop); }
)
{ return handleBoost(q, boost); }
}

如果一个查询不包括引号(QUOTED),边界符号(RANGE,例如小括号、中括号等),大部分情况下最终会通过handleBareTokenQuery函数生成一个Term,代表一个词,然后被封装成一个子查询Clause,最后被封装成一个Query,Clause和Query互相嵌套,即一个Query里可以包含多个Clause,一个Clause里又可以从一个Query开始,最终的叶子节点就是Term对应的Query。

QueryParserBase::handleBareTokenQuery

  Query handleBareTokenQuery(String qfield, Token term, Token fuzzySlop, boolean prefix, boolean wildcard, boolean fuzzy, boolean regexp) throws ParseException {
Query q;

String termImage=discardEscapeChar(term.image);
if (wildcard) {
q = getWildcardQuery(qfield, term.image);
} else if (prefix) {
q = getPrefixQuery(qfield,
discardEscapeChar(term.image.substring
(0, term.image.length()-1)));
} else if (regexp) {
q = getRegexpQuery(qfield, term.image.substring(1, term.image.length()-1));
} else if (fuzzy) {
q = handleBareFuzzy(qfield, fuzzySlop, termImage);
} else {
q = getFieldQuery(qfield, termImage, false);
}
return q;
}

举例来说,查询字符串AAA*代表prefix查询,此时参数prefix为真,A*A代表wildcard查询,此时参数wildcard为真,AA~代表fuzzy模糊查询,此时参数fuzzy为真。这里假设三个都不为真,就是一串平常的单词,最后会通过getFieldQuery生成一个Query,本文重点分析该函数。

QueryParserBase::handleBareTokenQuery->getFieldQuery

  protected Query getFieldQuery(String field, String queryText, boolean quoted) throws ParseException {
return newFieldQuery(getAnalyzer(), field, queryText, quoted);
}

protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted) throws ParseException {
BooleanClause.Occur occur = operator == Operator.AND ? BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;
return createFieldQuery(analyzer, occur, field, queryText, quoted || autoGeneratePhraseQueries, phraseSlop);
}

getAnalyzer返回QueryParserBase的init函数中设置的分词器,这里为了方便分析,假设为SimpleAnalyzer。quoted以及autoGeneratePhraseQueries表示是否创建PhraseQuery,phraseSlop为位置因子,只有PhraseQuery用得到,这里不管它。下面来看createFieldQuery函数。
QueryParserBase::handleBareTokenQuery->getFieldQuery->newFieldQuery->QueryBuilder::createFieldQuery

  protected final Query createFieldQuery(Analyzer analyzer, BooleanClause.Occur operator, String field, String queryText, boolean quoted, int phraseSlop) {

try (TokenStream source = analyzer.tokenStream(field, queryText);
CachingTokenFilter stream = new CachingTokenFilter(source)) {

TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
PositionIncrementAttribute posIncAtt = stream.addAttribute(PositionIncrementAttribute.class);

int numTokens = 0;
int positionCount = 0;
boolean hasSynonyms = false;

stream.reset();
while (stream.incrementToken()) {
numTokens++;
int positionIncrement = posIncAtt.getPositionIncrement();
if (positionIncrement != 0) {
positionCount += positionIncrement;
} else {
hasSynonyms = true;
}
}

if (numTokens == 0) {
return null;
} else if (numTokens == 1) {
return analyzeTerm(field, stream);
} else if (quoted && positionCount > 1) {
...
} else {
if (positionCount == 1) {
return analyzeBoolean(field, stream);
} else {
return analyzeMultiBoolean(field, stream, operator);
}
}
} catch (IOException e) {

}
}

关于分词器的tokenStream以及incrementToken函数在《lucene源码分析—4》中分析过了。直接看最后的结果,假设numTokens==1,则分词器的输出结果只有一个词,则使用analyzeTerm创建最终的Query;
假设positionCount == 1,则表示结果中多个词出现在同一个位置,此时使用analyzeBoolean创建Query;剩下情况表示有多个词,至少两个词出现在不同位置,使用analyzeMultiBoolean创建Query。本文只分析analyzeMultiBoolean函数,
QueryParserBase::handleBareTokenQuery->getFieldQuery->newFieldQuery->QueryBuilder::createFieldQuery->analyzeMultiBoolean

  private Query analyzeMultiBoolean(String field, TokenStream stream, BooleanClause.Occur operator) throws IOException {
BooleanQuery.Builder q = newBooleanQuery();
List<Term> currentQuery = new ArrayList<>();

TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
PositionIncrementAttribute posIncrAtt = stream.getAttribute(PositionIncrementAttribute.class);

stream.reset();
while (stream.incrementToken()) {
if (posIncrAtt.getPositionIncrement() != 0) {
add(q, currentQuery, operator);
currentQuery.clear();
}
currentQuery.add(new Term(field, termAtt.getBytesRef()));
}
add(q, currentQuery, operator);

return q.build();
}

private void add(BooleanQuery.Builder q, List<Term> current, BooleanClause.Occur operator) {
if (current.isEmpty()) {
return;
}
if (current.size() == 1) {
q.add(newTermQuery(current.get(0)), operator);
} else {
q.add(newSynonymQuery(current.toArray(new Term[current.size()])), operator);
}
}

public Builder add(Query query, Occur occur) {
clauses.add(new BooleanClause(query, occur));
return this;
}

分词器的输出结果保存在TermToBytesRefAttribute中,analyzeMultiBoolean函数将同一个起始位置不同的Term添加到列表currentQuery中,如果同一个位置只有一个Term,则将其封装成TermQuery,如果有多个Term,就封装成SynonymQuery,TermQuery和SynonymQuery最后被封装成BooleanClause,添加到BooleanQuery.Builder中的一个BooleanClause列表中。最后通过BooleanQuery.Builder的build函数根据内置的BooleanClause列表创建一个最终的BooleanClause。