如何使用Solr索引目录中的所有csv文件？

Given a directory with hundreds of tab-delimited csv files, each of which contains no header in the first row. That means we will specify the column names by other means.These files can be located on a local disk, or HDFS.

给定一个包含数百个制表符分隔的csv文件的目录，每个文件在第一行中不包含任何标题。这意味着我们将通过其他方式指定列名。这些文件可以位于本地磁盘或HDFS上。

What is the most efficient way to index these files?

索引这些文件的最有效方法是什么？

1 个解决方案

#1

if you have a lot of files , i think there are several methods to improve indexing speed :

如果你有很多文件，我认为有几种方法可以提高索引速度：

First , if your data on a local disk , you can build index use multithreading , but need to pay attention to, each thread has its own index of an output directory. Finally merged them into an index so that improve search speed .

首先，如果你的数据在本地磁盘上，你可以使用多线程构建索引，但需要注意，每个线程都有自己的输出目录索引。最后将它们合并到一个索引中，以提高搜索速度。

Second , if your data on HDFS , i think use Hadoop MapReduce to build index is very powerful . in addition , some UDF plugins of Pig or Hive also can build index easily , but you need convert your data into hive table or make pig schemal , these is simple !

其次，如果你的数据在HDFS上，我认为使用Hadoop MapReduce来构建索引是非常强大的。另外，Pig或Hive的一些UDF插件也可以轻松构建索引，但是你需要将你的数据转换成hive表或make pig schemal，这很简单！

Third , in order to better understand above methods , maybe you can read How to make indexing faster

第三，为了更好地理解上述方法，您可以阅读如何更快地进行索引编制

#1