如何更新Lucene.NET索引?

时间:2022-10-10 03:05:55

I'm developing a Desktop Search Engine in Visual Basic 9 (VS2008) using Lucene.NET (v2.0).

我正在使用Lucene.NET(v2.0)在Visual Basic 9(VS2008)中开发桌面搜索引擎。

I use the following code to initialize the IndexWriter

我使用以下代码初始化IndexWriter

Private writer As IndexWriter

writer = New IndexWriter(indexDirectory, New StandardAnalyzer(), False)

writer.SetUseCompoundFile(True)

If I select the same document folder (containing files to be indexed) twice, two different entries for each file in that document folder are created in the index.

如果我选择相同的文档文件夹(包含要编制索引的文件)两次,则会在索引中创建该文档文件夹中每个文件的两个不同条目。

I want the IndexWriter to discard any files that are already present in the Index.

我希望IndexWriter丢弃索引中已存在的所有文件。

What should I do to ensure this?

我该怎么做才能确保这一点?

7 个解决方案

#1


To update a lucene index you need to delete the old entry and write in the new entry. So you need to use an IndexReader to find the current item, use writer to delete it and then add your new item. The same will be true for multiple entries which I think is what you are trying to do.Just find all the entries, delete them all and then write in the new entries.

要更新lucene索引,您需要删除旧条目并写入新条目。因此,您需要使用IndexReader查找当前项目,使用writer删除它,然后添加新项目。对于我认为你正在尝试做的多个条目也是如此。只需找到所有条目,将它们全部删除然后写入新条目。

#2


As Steve mentioned, you need to use an instance of IndexReader and call its DeleteDocuments method. DeleteDocuments accepts either an instance of a Term object or Lucene's internal id of the document (it is generally not recommended to use the internal id as it can and will change as Lucene merges segments).

正如Steve所提到的,您需要使用IndexReader的实例并调用其DeleteDocuments方法。 DeleteDocuments接受Term对象的实例或Lucene的文档内部id(通常不建议尽可能使用内部id,并且会随着Lucene合并段而改变)。

The best way is to use a unique identifier that you've stored in the index specific to your application. For example, in an index of patients in a doctor's office, if you had a field called "patient_id" you could create a term and pass that as an argument to DeleteDocuments. See the following example (sorry, C#):

最好的方法是使用您存储在特定于应用程序的索引中的唯一标识符。例如,在医生办公室的患者索引中,如果您有一个名为“patient_id”的字段,您可以创建一个术语并将其作为参数传递给DeleteDocuments。请参阅以下示例(抱歉,C#):

int patientID = 12;
IndexReader indexReader = IndexReader.Open( indexDirectory );
indexReader.DeleteDocuments( new Term( "patient_id", patientID ) );

Then you could add the patient record again with an instance of IndexWriter. I learned a lot from this article http://www.codeproject.com/KB/library/IntroducingLucene.aspx.

然后,您可以使用IndexWriter实例再次添加患者记录。我从这篇文章中学到了很多东西http://www.codeproject.com/KB/library/IntroducingLucene.aspx。

Hope this helps.

希望这可以帮助。

#3


There are many out-of-date examples out there on deleting with an id field. The code below will work with Lucene.NET 2.4.

有关id字段的删除有许多过时的例子。下面的代码适用于Lucene.NET 2.4。

It's not necessary to open an IndexReader if you're already using an IndexWriter or to access IndexSearcher.Reader. You can use IndexWriter.DeleteDocuments(Term), but the tricky part is making sure you've stored your id field correctly in the first place. Be sure and use Field.Index.NOT_ANALYZED as the index setting on your id field when storing the document. This indexes the field without tokenizing it, which is very important, and none of the other Field.Index values will work when used this way:

如果您已经在使用IndexWriter或访问IndexSearcher.Reader,则无需打开IndexReader。您可以使用IndexWriter.DeleteDocuments(Term),但棘手的部分是确保您首先正确存储了您的id字段。在存储文档时,请确保使用Field.Index.NOT_ANALYZED作为id字段的索引设置。这会对字段编制索引而不对其进行标记,这非常重要,并且当使用这种方式时,其他任何Field.Index值都不会起作用:

IndexWriter writer = new IndexWriter("\MyIndexFolder", new StandardAnalyzer());
var doc = new Document();
var idField = new Field("id", "MyItemId", Field.Store.YES, Field.Index.NOT_ANALYZED);
doc.Add(idField);
writer.AddDocument(doc);
writer.Commit();

Now you can easily delete or update the document using the same writer:

现在,您可以使用同一个编写器轻松删除或更新文档:

Term idTerm = new Term("id", "MyItemId");
writer.DeleteDocuments(idTerm);
writer.Commit();

#4


If you want to delete all content in the index and refill it, you could use this statement

如果要删除索引中的所有内容并重新填充,可以使用此语句

writer = New IndexWriter(indexDirectory, New StandardAnalyzer(), True)

The last parameter of the IndexWriter constructor determines whether a new index is created, or whether an existing index is opened for the addition of new documents.

IndexWriter构造函数的最后一个参数确定是创建新索引,还是打开现有索引以添加新文档。

#5


There are options,listed below, which can be used as per requirements.

下面列出了一些选项,可根据要求使用。

See below code snap. [Source code in C#, please convert it into vb.net]

见下面的代码快照。 [C#中的源代码,请将其转换为vb.net]

Lucene.Net.Documents.Document doc = ConvertToLuceneDocument(id, data);
Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.Open(new DirectoryInfo(UpdateConfiguration.IndexTextFiles));
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Lucene.Net.Index.IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer, false, Lucene.Net.Index.IndexWriter.MaxFieldLength.UNLIMITED);
Lucene.Net.Index.Term idTerm = new Lucene.Net.Index.Term("id", id);

foreach (FileInfo file in new DirectoryInfo(UpdateConfiguration.UpdatePath).EnumerateFiles())
{
        Scenario 1: Single step update.
                indexWriter.UpdateDocument(idTerm, doc, analyzer);

        Scenario 2: Delete a document and then Update the document
                indexWriter.DeleteDocuments(idTerm);
                indexWriter.AddDocument(doc);

        Scenario 3: Take necessary steps if a document does not exist.

            Lucene.Net.Index.IndexReader iReader = Lucene.Net.Index.IndexReader.Open(indexWriter.GetDirectory(), true);
            Lucene.Net.Search.IndexSearcher iSearcher = new Lucene.Net.Search.IndexSearcher(iReader);
            int docCount = iSearcher.DocFreq(idTerm);
            iSearcher.Close();
            iReader.Close();
            if (docCount == 0)
            {
                    //TODO: Take necessary steps
                    //Possible Step 1: add document
                    //indexWriter.AddDocument(doc);

                    //Possible Step 2: raise the error for the unknown document
            }
}
indexWriter.Optimize();
indexWriter.Close();

#6


Unless you're only modifying a small number of documents (say, less than 10% of the total) it's almost certainly faster (your mileage may vary depending on stored/indexed fields, etc) to reindex from scratch.

除非您只修改少量文档(例如,少于总数的10%),否则几乎肯定会更快(您的里程可能会因存储/索引字段等而异)从头开始重新索引。

That said, I would always index to a temp directory, and then move the new one into place when it's done. That way, there's little downtime while the index is building, and if something goes wrong you still have a good index.

也就是说,我总是会索引到一个临时目录,然后在完成后将新的目录移动到位。这样,在索引构建时几乎没有停机时间,如果出现问题,您仍然可以获得良好的索引。

#7


One option is of course to remove a document and then to add the updated version of the document.

一种选择当然是删除文档,然后添加文档的更新版本。

Alternatively you can also use the UpdateDocument() method of the IndexWriter class:

或者,您也可以使用IndexWriter类的UpdateDocument()方法:

writer.UpdateDocument(new Term("patient_id", document.Get("patient_id")), document);

This of course requires you to have a mechanism by which you can locate the document you want to update ("patient_id" in this example).

这当然要求您有一种机制,通过该机制可以找到要更新的文档(本例中为“patient_id”)。

I have blogged more details with a more complete source code example.

我用更完整的源代码示例写了更多详细信息。

#1


To update a lucene index you need to delete the old entry and write in the new entry. So you need to use an IndexReader to find the current item, use writer to delete it and then add your new item. The same will be true for multiple entries which I think is what you are trying to do.Just find all the entries, delete them all and then write in the new entries.

要更新lucene索引,您需要删除旧条目并写入新条目。因此,您需要使用IndexReader查找当前项目,使用writer删除它,然后添加新项目。对于我认为你正在尝试做的多个条目也是如此。只需找到所有条目,将它们全部删除然后写入新条目。

#2


As Steve mentioned, you need to use an instance of IndexReader and call its DeleteDocuments method. DeleteDocuments accepts either an instance of a Term object or Lucene's internal id of the document (it is generally not recommended to use the internal id as it can and will change as Lucene merges segments).

正如Steve所提到的,您需要使用IndexReader的实例并调用其DeleteDocuments方法。 DeleteDocuments接受Term对象的实例或Lucene的文档内部id(通常不建议尽可能使用内部id,并且会随着Lucene合并段而改变)。

The best way is to use a unique identifier that you've stored in the index specific to your application. For example, in an index of patients in a doctor's office, if you had a field called "patient_id" you could create a term and pass that as an argument to DeleteDocuments. See the following example (sorry, C#):

最好的方法是使用您存储在特定于应用程序的索引中的唯一标识符。例如,在医生办公室的患者索引中,如果您有一个名为“patient_id”的字段,您可以创建一个术语并将其作为参数传递给DeleteDocuments。请参阅以下示例(抱歉,C#):

int patientID = 12;
IndexReader indexReader = IndexReader.Open( indexDirectory );
indexReader.DeleteDocuments( new Term( "patient_id", patientID ) );

Then you could add the patient record again with an instance of IndexWriter. I learned a lot from this article http://www.codeproject.com/KB/library/IntroducingLucene.aspx.

然后,您可以使用IndexWriter实例再次添加患者记录。我从这篇文章中学到了很多东西http://www.codeproject.com/KB/library/IntroducingLucene.aspx。

Hope this helps.

希望这可以帮助。

#3


There are many out-of-date examples out there on deleting with an id field. The code below will work with Lucene.NET 2.4.

有关id字段的删除有许多过时的例子。下面的代码适用于Lucene.NET 2.4。

It's not necessary to open an IndexReader if you're already using an IndexWriter or to access IndexSearcher.Reader. You can use IndexWriter.DeleteDocuments(Term), but the tricky part is making sure you've stored your id field correctly in the first place. Be sure and use Field.Index.NOT_ANALYZED as the index setting on your id field when storing the document. This indexes the field without tokenizing it, which is very important, and none of the other Field.Index values will work when used this way:

如果您已经在使用IndexWriter或访问IndexSearcher.Reader,则无需打开IndexReader。您可以使用IndexWriter.DeleteDocuments(Term),但棘手的部分是确保您首先正确存储了您的id字段。在存储文档时,请确保使用Field.Index.NOT_ANALYZED作为id字段的索引设置。这会对字段编制索引而不对其进行标记,这非常重要,并且当使用这种方式时,其他任何Field.Index值都不会起作用:

IndexWriter writer = new IndexWriter("\MyIndexFolder", new StandardAnalyzer());
var doc = new Document();
var idField = new Field("id", "MyItemId", Field.Store.YES, Field.Index.NOT_ANALYZED);
doc.Add(idField);
writer.AddDocument(doc);
writer.Commit();

Now you can easily delete or update the document using the same writer:

现在,您可以使用同一个编写器轻松删除或更新文档:

Term idTerm = new Term("id", "MyItemId");
writer.DeleteDocuments(idTerm);
writer.Commit();

#4


If you want to delete all content in the index and refill it, you could use this statement

如果要删除索引中的所有内容并重新填充,可以使用此语句

writer = New IndexWriter(indexDirectory, New StandardAnalyzer(), True)

The last parameter of the IndexWriter constructor determines whether a new index is created, or whether an existing index is opened for the addition of new documents.

IndexWriter构造函数的最后一个参数确定是创建新索引,还是打开现有索引以添加新文档。

#5


There are options,listed below, which can be used as per requirements.

下面列出了一些选项,可根据要求使用。

See below code snap. [Source code in C#, please convert it into vb.net]

见下面的代码快照。 [C#中的源代码,请将其转换为vb.net]

Lucene.Net.Documents.Document doc = ConvertToLuceneDocument(id, data);
Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.Open(new DirectoryInfo(UpdateConfiguration.IndexTextFiles));
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Lucene.Net.Index.IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer, false, Lucene.Net.Index.IndexWriter.MaxFieldLength.UNLIMITED);
Lucene.Net.Index.Term idTerm = new Lucene.Net.Index.Term("id", id);

foreach (FileInfo file in new DirectoryInfo(UpdateConfiguration.UpdatePath).EnumerateFiles())
{
        Scenario 1: Single step update.
                indexWriter.UpdateDocument(idTerm, doc, analyzer);

        Scenario 2: Delete a document and then Update the document
                indexWriter.DeleteDocuments(idTerm);
                indexWriter.AddDocument(doc);

        Scenario 3: Take necessary steps if a document does not exist.

            Lucene.Net.Index.IndexReader iReader = Lucene.Net.Index.IndexReader.Open(indexWriter.GetDirectory(), true);
            Lucene.Net.Search.IndexSearcher iSearcher = new Lucene.Net.Search.IndexSearcher(iReader);
            int docCount = iSearcher.DocFreq(idTerm);
            iSearcher.Close();
            iReader.Close();
            if (docCount == 0)
            {
                    //TODO: Take necessary steps
                    //Possible Step 1: add document
                    //indexWriter.AddDocument(doc);

                    //Possible Step 2: raise the error for the unknown document
            }
}
indexWriter.Optimize();
indexWriter.Close();

#6


Unless you're only modifying a small number of documents (say, less than 10% of the total) it's almost certainly faster (your mileage may vary depending on stored/indexed fields, etc) to reindex from scratch.

除非您只修改少量文档(例如,少于总数的10%),否则几乎肯定会更快(您的里程可能会因存储/索引字段等而异)从头开始重新索引。

That said, I would always index to a temp directory, and then move the new one into place when it's done. That way, there's little downtime while the index is building, and if something goes wrong you still have a good index.

也就是说,我总是会索引到一个临时目录,然后在完成后将新的目录移动到位。这样,在索引构建时几乎没有停机时间,如果出现问题,您仍然可以获得良好的索引。

#7


One option is of course to remove a document and then to add the updated version of the document.

一种选择当然是删除文档,然后添加文档的更新版本。

Alternatively you can also use the UpdateDocument() method of the IndexWriter class:

或者,您也可以使用IndexWriter类的UpdateDocument()方法:

writer.UpdateDocument(new Term("patient_id", document.Get("patient_id")), document);

This of course requires you to have a mechanism by which you can locate the document you want to update ("patient_id" in this example).

这当然要求您有一种机制,通过该机制可以找到要更新的文档(本例中为“patient_id”)。

I have blogged more details with a more complete source code example.

我用更完整的源代码示例写了更多详细信息。