用于Excel文件的自定义InputFormat或InputReader（xls）

I need to read a excel(xls) file stored on Hadoop cluster. Now I did some research and found out that I need to create a custom InputFormat for that. I read many articles but none of them is helpful from programming point of view. If someone can help me with sample code for writing custom inputformat so that I can understand the basics of "Programming InputFormat" and can use Apache POI library to read the excel file. I had made a mapreduce program for reading text file. Now I need help regarding the fact that even if I some how manage to code my own custom InputFormat where would I write the code in respect to the mapreduce program I have already written.

我需要读取存储在Hadoop集群上的excel(xls)文件。现在我做了一些研究,发现我需要为此创建一个自定义的InputFormat。我阅读了很多文章,但从编程的角度来看,它们都没有用。如果有人可以帮我编写自定义inputformat的示例代码,以便我可以理解“Programming InputFormat”的基础知识,并且可以使用Apache POI库来读取excel文件。我制作了一个用于阅读文本文件的mapreduce程序。现在我需要帮助,即使我有些如何设法编写我自己的自定义InputFormat,我将编写与我已经编写的mapreduce程序相关的代码。

PS:- converting the .xls file into .csv file is not an option.

PS: - 将.xls文件转换为.csv文件不是一种选择。

3 个解决方案

#1

Yes, you should create RecordReader to read each record from your excel document. Inside that record reader you should use POI like api to read from excel docs. More precisely please do the following steps:

是的,您应该创建RecordReader以从Excel文档中读取每条记录。在该记录阅读器中,你应该使用像api这样的POI来阅读excel文档。更准确地说,请执行以下步骤:

Extend FileInputFromat and create your own CustomInputFrmat and overrride getRecordReader .

扩展FileInputFromat并创建自己的CustomInputFrmat并覆盖getRecordReader。
Create a CustomRecordReader by extending RecordReader ,here you have to write how to generate a key value pair from a given filesplit. So first read bytes from filesplit and from that bufferedbytes read out desired key and value using POI.

通过扩展RecordReader创建一个CustomRecordReader,在这里你必须编写如何从给定的filesplit生成一个键值对。因此,首先从filesplit读取字节,然后从bufferedbytes读取所需的键和值,使用POI。

You can check myown CustomInputFormat and RecordReader to deal with custom data objects here myCustomInputFormat

您可以在myCustomInputFormat中检查我的CustomInputFormat和RecordReader以处理自定义数据对象

#2

Your research is correct. You need a custom InputFormat for Hadoop. If you are lucky, somebody already created one for your use case.

你的研究是正确的。您需要为Hadoop定制InputFormat。如果你很幸运,有人已经为你的用例创建了一个。

If not, I would suggest to look for a Java library that is able to read excel files. Since Excel is a proprietary file format, it is unlikely that you will find an implementation that works perfectly.

如果没有,我建议寻找一个能够读取excel文件的Java库。由于Excel是专有文件格式,因此您不太可能找到完美运行的实现。

Once you found a library that is able to read Excel files, integrate it with the InputFormat.

找到能够读取Excel文件的库后,将其与InputFormat集成。

Therefore, You have to extend the FileInputFormat of Hadoop. The getRecordReader that is being returned by your ExcelInputFormat must return the rows from your excel file. You probably also have to overwrite the getSplits() method to tell the framework not to split the file at all.

因此,您必须扩展Hadoop的FileInputFormat。 ExcelInputFormat返回的getRecordReader必须从excel文件返回行。您可能还必须覆盖getSplits()方法,以告诉框架根本不拆分文件。

#3

-1

Alternatively you can try the HadoopOffice library providing a fileformat for Excel on Hadoop/Spark/etc. It supports a lot of features including linked workbooks and encryption. It is based on Apache POI. https://github.com/ZuInnoTe/hadoopoffice/wiki

或者,您可以尝试HadoopOffice库,在Hadoop / Spark / etc上为Excel提供fileformat。它支持许多功能,包括链接的工作簿和加密。它基于Apache POI。 https://github.com/ZuInnoTe/hadoopoffice/wiki

#1