为什么lucene不需要复合索引,但是关系数据库呢?

时间:2022-05-16 03:06:43

Lucene stores index for each field separetly. So when we perform query "fld1:a AND fld2:b" we iterate over Termdocs for first term and second term. This can't be faster. In case of database two separete indexes for fld1 and fld2 will work slow and only one will be used. In that case DB requres composite key for fld1 and fld2.

Lucene为每个领域存储索引。因此,当我们执行查询“fld1:a AND fld2:b”时,我们会在第一学期和第二学期迭代Termdocs。这不可能更快。在数据库的情况下,fld1和fld2的两个separete索引将运行缓慢,只使用一个。在这种情况下,DB需要fld1和fld2的复合键。

My question is. Why Can't DB utilize Lucene index algorithm for executing Boolean queries if it as fast as DB index and dosn't requires different combinations of columns?

我的问题是。为什么DB不能使用Lucene索引算法执行布尔查询,如果它与DB索引一样快并且不需要不同的列组合?

Some details of Lucene Boolean Query search: It utilize interface TermDoc. The main idea in using two methods boolean skipTo(int) and boolean next(). So it is doesn't depend on term order(popular or not popular term) because count of those method calls will be always as most infrequent term(due to skipTo method). So there are no need in hierarchical composite index, it will not bring any additional performance.

Lucene Boolean Query搜索的一些细节:它使用接口TermDoc。使用两个方法boolean skipTo(int)和boolean next()的主要思路。因此它不依赖于术语顺序(流行或非流行术语),因为这些方法调用的计数总是最不常见的术语(由于skipTo方法)。所以没有必要在分层复合索引中,它不会带来任何额外的性能。

TermDocs t1 = searcher.docs(fld1:a);
TermDocs t2 = searcher.docs(fld2:b); 
int doc = -1;
t1.next(); t2.next();
while(t1.doc()!=-1 && t2.doc()!=-1) {
if(t1.doc()<t2.doc()) {
  if(!t1.skipTo(t2.doc)) return;
}
if(t2.doc()<t1.doc()) {
 if(!t2.skipTo(t1.doc)) return;
}
if(t1.doc()==t2.doc()) {
println("found doc:"+t1.doc());
t1.next()
}
}

1 个解决方案

#1


6  

I think @Frank Farmer's comment gives you most of your answer: it's perfectly possible for an RDB to use multiple indexes even if they aren't "composite".

我认为@Frank Farmer的评论为您提供了大部分答案:即使RDB不是“复合”,它也完全有可能使用多个索引。

A more specific question has a harder answer: why don't RDBs use Lucene's multi-index-search paradigm?

一个更具体的问题有一个更难的答案:为什么RDB不使用Lucene的多索引搜索范例?

Recall that Lucene uses an inverted index with a skip list; recall also that these are only efficient if the index is extremely sparse and the number of terms is very high.

回想一下,Lucene使用带有跳过列表的倒排索引;还记得,如果指数非常稀少且术语数量非常高,这些只是有效的。

In the type of column where you're likely to do a query like where a = b, the number of possible bs is probably pretty small, and hence the index will be relatively dense. So it makes more sense to use bitmaps (like PostgreSQL does) and gain the speedup of bit-level parallelism than to store it as a skip list and deal with pointer-chasing.

在您可能进行查询的列类型中,例如a = b,可能的b数可能非常小,因此索引相对较密集。所以使用位图(比如PostgreSQL)更有意义并获得比特级并行的加速,而不是将其存储为跳过列表并处理指针追逐。

I should note that even Lucene uses bitmaps when combining filters with queries, so we might equivalently ask why Lucene doesn't use Lucene's search. My guess is that bitmaps are smaller and therefore more likely to fit in memory.

我应该注意,即使Lucene在将过滤器与查询结合使用时也会使用位图,因此我们可以等同地问为什么Lucene不使用Lucene的搜索。我的猜测是位图较小,因此更有可能适合内存。

To the best of my knowledge, this is not a huge performance gain, so you probably can't make a very strong argument for either bitmaps or skip lists in the general case. But if I had to guess why the PostgreSQL devs went the bitmap route, I think it would be this.

据我所知,这并不是一个巨大的性能提升,因此在一般情况下,你可能无法对位图或跳过列表做出非常强有力的论证。但如果我不得不猜测为什么PostgreSQL开发人员会使用位图路由,我认为就是这样。

#1


6  

I think @Frank Farmer's comment gives you most of your answer: it's perfectly possible for an RDB to use multiple indexes even if they aren't "composite".

我认为@Frank Farmer的评论为您提供了大部分答案:即使RDB不是“复合”,它也完全有可能使用多个索引。

A more specific question has a harder answer: why don't RDBs use Lucene's multi-index-search paradigm?

一个更具体的问题有一个更难的答案:为什么RDB不使用Lucene的多索引搜索范例?

Recall that Lucene uses an inverted index with a skip list; recall also that these are only efficient if the index is extremely sparse and the number of terms is very high.

回想一下,Lucene使用带有跳过列表的倒排索引;还记得,如果指数非常稀少且术语数量非常高,这些只是有效的。

In the type of column where you're likely to do a query like where a = b, the number of possible bs is probably pretty small, and hence the index will be relatively dense. So it makes more sense to use bitmaps (like PostgreSQL does) and gain the speedup of bit-level parallelism than to store it as a skip list and deal with pointer-chasing.

在您可能进行查询的列类型中,例如a = b,可能的b数可能非常小,因此索引相对较密集。所以使用位图(比如PostgreSQL)更有意义并获得比特级并行的加速,而不是将其存储为跳过列表并处理指针追逐。

I should note that even Lucene uses bitmaps when combining filters with queries, so we might equivalently ask why Lucene doesn't use Lucene's search. My guess is that bitmaps are smaller and therefore more likely to fit in memory.

我应该注意,即使Lucene在将过滤器与查询结合使用时也会使用位图,因此我们可以等同地问为什么Lucene不使用Lucene的搜索。我的猜测是位图较小,因此更有可能适合内存。

To the best of my knowledge, this is not a huge performance gain, so you probably can't make a very strong argument for either bitmaps or skip lists in the general case. But if I had to guess why the PostgreSQL devs went the bitmap route, I think it would be this.

据我所知,这并不是一个巨大的性能提升,因此在一般情况下,你可能无法对位图或跳过列表做出非常强有力的论证。但如果我不得不猜测为什么PostgreSQL开发人员会使用位图路由,我认为就是这样。