如何使用Solr索引目录中的所有csv文件?

时间:2022-09-06 22:49:43

Given a directory with hundreds of tab-delimited csv files, each of which contains no header in the first row. That means we will specify the column names by other means.These files can be located on a local disk, or HDFS.

给定一个包含数百个制表符分隔的csv文件的目录,每个文件在第一行中不包含任何标题。这意味着我们将通过其他方式指定列名。这些文件可以位于本地磁盘或HDFS上。

What is the most efficient way to index these files?

索引这些文件的最有效方法是什么?

1 个解决方案

#1


1  

if you have a lot of files , i think there are several methods to improve indexing speed :

如果你有很多文件,我认为有几种方法可以提高索引速度:

First , if your data on a local disk , you can build index use multithreading , but need to pay attention to, each thread has its own index of an output directory. Finally merged them into an index so that improve search speed .

首先,如果你的数据在本地磁盘上,你可以使用多线程构建索引,但需要注意,每个线程都有自己的输出目录索引。最后将它们合并到一个索引中,以提高搜索速度。

Second , if your data on HDFS , i think use Hadoop MapReduce to build index is very powerful . in addition , some UDF plugins of Pig or Hive also can build index easily , but you need convert your data into hive table or make pig schemal , these is simple !

其次,如果你的数据在HDFS上,我认为使用Hadoop MapReduce来构建索引是非常强大的。另外,Pig或Hive的一些UDF插件也可以轻松构建索引,但是你需要将你的数据转换成hive表或make pig schemal,这很简单!

Third , in order to better understand above methods , maybe you can read How to make indexing faster

第三,为了更好地理解上述方法,您可以阅读如何更快地进行索引编制

#1


1  

if you have a lot of files , i think there are several methods to improve indexing speed :

如果你有很多文件,我认为有几种方法可以提高索引速度:

First , if your data on a local disk , you can build index use multithreading , but need to pay attention to, each thread has its own index of an output directory. Finally merged them into an index so that improve search speed .

首先,如果你的数据在本地磁盘上,你可以使用多线程构建索引,但需要注意,每个线程都有自己的输出目录索引。最后将它们合并到一个索引中,以提高搜索速度。

Second , if your data on HDFS , i think use Hadoop MapReduce to build index is very powerful . in addition , some UDF plugins of Pig or Hive also can build index easily , but you need convert your data into hive table or make pig schemal , these is simple !

其次,如果你的数据在HDFS上,我认为使用Hadoop MapReduce来构建索引是非常强大的。另外,Pig或Hive的一些UDF插件也可以轻松构建索引,但是你需要将你的数据转换成hive表或make pig schemal,这很简单!

Third , in order to better understand above methods , maybe you can read How to make indexing faster

第三,为了更好地理解上述方法,您可以阅读如何更快地进行索引编制