Solr相似度算法二:Okapi BM25

时间:2023-03-08 19:06:48
Solr相似度算法二:Okapi BM25

In information retrievalOkapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s byStephen E. RobertsonKaren Spärck Jones, and others.

The name of the actual ranking function is BM25. To set the right context, however, it usually referred to as "Okapi BM25", since the Okapi information retrieval system, implemented at London's City University in the 1980s and 1990s, was the first system to implement this function.

BM25, and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval, such as web search.

The ranking function[edit]

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.

Given a query Solr相似度算法二:Okapi BM25, containing keywords Solr相似度算法二:Okapi BM25, the BM25 score of a document Solr相似度算法二:Okapi BM25 is:

Solr相似度算法二:Okapi BM25

where Solr相似度算法二:Okapi BM25 is Solr相似度算法二:Okapi BM25's term frequency in the document Solr相似度算法二:Okapi BM25Solr相似度算法二:Okapi BM25 is the length of the document Solr相似度算法二:Okapi BM25 in words, and Solr相似度算法二:Okapi BM25 is the average document length in the text collection from which documents are drawn. Solr相似度算法二:Okapi BM25 and Solr相似度算法二:Okapi BM25 are free parameters, usually chosen, in absence of an advanced optimization, as Solr相似度算法二:Okapi BM25 and Solr相似度算法二:Okapi BM25.[1] Solr相似度算法二:Okapi BM25 is the IDF (inverse document frequency) weight of the query term Solr相似度算法二:Okapi BM25. It is usually computed as:

Solr相似度算法二:Okapi BM25

where Solr相似度算法二:Okapi BM25 is the total number of documents in the collection, and Solr相似度算法二:Okapi BM25 is the number of documents containing Solr相似度算法二:Okapi BM25.

There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model.

Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score. This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way:

  • Each summand can be given a floor of 0, to trim out common terms;
  • The IDF function can be given a floor of a constant Solr相似度算法二:Okapi BM25, to avoid common terms being ignored at all;
  • The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.

IDF information theoretic interpretation[edit]

Here is an interpretation from information theory. Suppose a query term Solr相似度算法二:Okapi BM25 appears in Solr相似度算法二:Okapi BM25 documents. Then a randomly picked document Solr相似度算法二:Okapi BM25 will contain the term with probability Solr相似度算法二:Okapi BM25 (where Solr相似度算法二:Okapi BM25 is again the cardinality of the set of documents in the collection). Therefore, the informationcontent of the message "Solr相似度算法二:Okapi BM25 contains Solr相似度算法二:Okapi BM25" is:

Solr相似度算法二:Okapi BM25

Now suppose we have two query terms Solr相似度算法二:Okapi BM25 and Solr相似度算法二:Okapi BM25. If the two terms occur in documents entirely independently of each other, then the probability of seeing both Solr相似度算法二:Okapi BM25 and Solr相似度算法二:Okapi BM25 in a randomly picked document Solr相似度算法二:Okapi BM25 is:

Solr相似度算法二:Okapi BM25

and the information content of such an event is:

Solr相似度算法二:Okapi BM25

With a small variation, this is exactly what is expressed by the IDF component of BM25.

Modifications[edit]

  • At the extreme values of the coefficient Solr相似度算法二:Okapi BM25 BM25 turns into ranking functions known as BM11 (for Solr相似度算法二:Okapi BM25) and BM15 (for Solr相似度算法二:Okapi BM25).[2]
  • BM25F[3] is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance.
  • BM25+[4] is an extension of BM25. BM25+ was developed to address one deficiency of the standard BM25 in which the component of term frequency normalization by document length is not properly lower-bounded; as a result of this deficiency, long documents which do match the query term can often be scored unfairly by BM25 as having a similar relevancy to shorter documents that do not contain the query term at all. The scoring formula of BM25+ only has one additional free parameter Solr相似度算法二:Okapi BM25 (a default value is Solr相似度算法二:Okapi BM25 in absence of a training data) as compared with BM25:
Solr相似度算法二:Okapi BM25