Lucene AddIndexes(合并) - 如何避免重复?

时间:2023-02-06 03:09:32

How do I make sure that when I merge a few temp indexes (that might or might not contain duplicate documents) I end up with one copy in the main index ?

当我合并一些临时索引(可能包含或可能不包含重复文档)时,如何确保在主索引中最终得到一个副本?

Thanks

1 个解决方案

#1


Here's a way: Provided that each document has an id, and that duplicate documents have the same id:

这是一种方式:假设每个文档都有一个id,并且重复的文档具有相同的id:

mark the indexes by I1..Im.
for i in 1..m, let Ci = all the indexes but Ii
  for all the documents Dj in Ii,
  let cur_term = "id:<Dj's id>"
  for Ik in Ci
    Ik.deleteDocuments(cur_term)
merge all indexes

The gist is: delete all documents having the same id as the current document from the other indexes. After having done this for all indexes, merge them. I know this is not elegant, but I do not know a better algorithm.

要点是:从其他索引中删除与当前文档具有相同ID的所有文档。在为所有索引完成此操作后,合并它们。我知道这不优雅,但我不知道更好的算法。

#1


Here's a way: Provided that each document has an id, and that duplicate documents have the same id:

这是一种方式:假设每个文档都有一个id,并且重复的文档具有相同的id:

mark the indexes by I1..Im.
for i in 1..m, let Ci = all the indexes but Ii
  for all the documents Dj in Ii,
  let cur_term = "id:<Dj's id>"
  for Ik in Ci
    Ik.deleteDocuments(cur_term)
merge all indexes

The gist is: delete all documents having the same id as the current document from the other indexes. After having done this for all indexes, merge them. I know this is not elegant, but I do not know a better algorithm.

要点是:从其他索引中删除与当前文档具有相同ID的所有文档。在为所有索引完成此操作后,合并它们。我知道这不优雅,但我不知道更好的算法。