如何将.xml格式的*文件索引到solr中

I want to index xml files of Wikipedia into Solr.

我想将*的xml文件索引到Solr中。

But I am getting an error, it is unable to index. Solr has a specific format for xml files. I changed the schema.xml and data-config.xml files to suit the tags of the wikipedia files.

但我收到一个错误，它无法索引。 Solr具有xml文件的特定格式。我更改了schema.xml和data-config.xml文件以适应*文件的标记。

Still it is unable to index the files. My actual intention is to index wikipedia which is an xml file of 30 GB.

仍然无法索引文件。我的目的是索引*，这是一个30 GB的xml文件。

How would I go about indexing all wikipedia files into Solr?

我如何将所有*文件索引到Solr？

1 个解决方案

#1

There's an example section in the DataImportHandler documentation for exactly this: indexing Wikipedia.

DataImportHandler文档中有一个示例部分就是这样：索引*。

Basically, you use the DataImportHandler and some XPath to pull the metadata you care about out of the Wikipedia XML, and put it in flat Solr field listings.

基本上，您使用DataImportHandler和一些XPath从Wikipedia XML中提取您关心的元数据，并将其放在平面Solr字段列表中。

#1