如何在hadoop2中处理多个（100s）输入文件，其中每个文件的大小小于10MB？

Let us suppose that I have 200 input files each of size 10MB. //total_size=2GB

我们假设我有200个大小为10MB的输入文件。 // TOTAL_SIZE = 2GB

How can i make these files to get stored in 16 HDFS blocks? //default_block_size=128MB

如何将这些文件存储在16个HDFS块中? // default_block_size = 128MB

By doing so, I think 16 mappers will do my work efficiently compared to 200 mappers for 200 input files.

通过这样做,我认为16个映射器将有效地完成我的工作,而200个映射器则需要200个输入文件。

3 个解决方案

#1

You cannot store multiple files inside a single block in HDFS, this is basic rule of HDFS. In your case HDFS blocks are not well used, out of total 128MB in a block only 10MB is used and remaining 118 MB cannot be used by any other file, and it will remain free. (one thing to note here is, HDFS blocks are logical, your HDFS blocks will take only 10MB of physical storage though you have set to it to 128MB.)

您无法在HDFS中的单个块中存储多个文件,这是HDFS的基本规则。在你的情况下,HDFS块没有得到很好的使用,在一个块中总共128MB,只使用了10MB,剩余的118MB不能被任何其他文件使用,并且它将保持空闲。 (这里要注意的一点是,HDFS块是合乎逻辑的,你的HDFS块只需要10MB的物理存储,尽管你已经设置为128MB。)

in short in HDFS, file to HDFS-blocks relation is one to many and HDFS-blocks to files cannot be one to many.

简而言之,在HDFS中,文件到HDFS块的关系是一对多,HDFS块到文件不能是一对多。

#2

The best option is to change your process that writes to HDFS to save a file which is equal (or) approximately equal to the block size. This will ensure that you are optimizing a block size and when any job is executed on the Hadoop Cluster, it will spin up the number of number of map tasks equal to block or split.

最好的选择是更改写入HDFS的进程以保存等于(或)大致等于块大小的文件。这将确保您正在优化块大小,并且当在Hadoop集群上执行任何作业时,它将启动等于块或拆分的映射任务数量。

An ideal approach if your input data set is too large is to compress the data further and then save in HDFS. This will reduce the footprint of data saved in the cluster and improve the performance of the job reading the data.

如果输入数据集太大,理想的方法是进一步压缩数据,然后保存在HDFS中。这将减少群集中保存的数据的占用空间,并提高读取数据的作业的性能。

#3

First of all you cannot store in that way ( 16 HDFS blocks ).

首先,你不能以这种方式存储(16个HDFS块)。

In order to spawn around 16 mappers for the files,you can use CombileFileInputFormat , so that it combines file until the provided limit is met. ( different in boundary cases ).

为了为文件生成大约16个映射器,您可以使用CombileFileInputFormat,以便它将文件组合,直到满足提供的限制。 (边界情况不同)。

Need to specify :- mapreduce.input.fileinputformat.split.maxsize and mapreduce.input.fileinputformat.split.minsize

需要指定: - mapreduce.input.fileinputformat.split.maxsize和mapreduce.input.fileinputformat.split.minsize

#1