设置映射任务的数量并减少任务。

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not display. Can someone tell me what I am doing wrong. I am using this command

我目前正在运行一项工作，我将地图任务的数量固定到20，但是得到的数字更高。我还将reduce任务设置为零，但我仍然得到一个非零的数字。MapReduce任务完成的总时间也没有显示。有人能告诉我我做错了什么吗?我使用这个命令。

hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0

Output:

输出:

11/07/30 19:48:56 INFO mapred.JobClient: Job complete: job_201107291018_0164
11/07/30 19:48:56 INFO mapred.JobClient: Counters: 18
11/07/30 19:48:56 INFO mapred.JobClient:   Job Counters 
11/07/30 19:48:56 INFO mapred.JobClient:     Launched reduce tasks=13
11/07/30 19:48:56 INFO mapred.JobClient:     Rack-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient:     Launched map tasks=24
11/07/30 19:48:56 INFO mapred.JobClient:     Data-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient:   FileSystemCounters
11/07/30 19:48:56 INFO mapred.JobClient:     FILE_BYTES_READ=4020792636
11/07/30 19:48:56 INFO mapred.JobClient:     HDFS_BYTES_READ=1556534680
11/07/30 19:48:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6026699058
11/07/30 19:48:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1928893942
11/07/30 19:48:56 INFO mapred.JobClient:   Map-Reduce Framework
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce input groups=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Combine output records=0
11/07/30 19:48:56 INFO mapred.JobClient:     Map input records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce shuffle bytes=1974162269
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Spilled Records=120000000
11/07/30 19:48:56 INFO mapred.JobClient:     Map output bytes=1928893942
11/07/30 19:48:56 INFO mapred.JobClient:     Combine input records=0
11/07/30 19:48:56 INFO mapred.JobClient:     Map output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce input records=40000000
[hcrc1425n30]s0907855:

15 个解决方案

#1

The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is just a hint to the InputFormat for the number of maps.

给定任务的映射任务的数量是由输入分裂的数量而不是由mapred.map驱动的。任务参数。对于每一个输入拆分，都会生成一个映射任务。因此，在mapreduce作业的整个生命周期中，map任务的数量等于输入拆分的数量。mapred.map。任务只是对映射数的InputFormat的一个提示。

In your example Hadoop has determined there are 24 input splits and will spawn 24 map tasks in total. But, you can control how many map tasks can be executed in parallel by each of the task tracker.

在您的示例中，Hadoop确定有24个输入拆分，并将总共生成24个映射任务。但是，您可以控制每个任务跟踪器并行执行多少映射任务。

Also, removing a space after -D might solve the problem for reduce.

此外，在-D之后删除一个空格可能会解决减少的问题。

For more information on the number of map and reduce tasks, please look at the below url

要了解更多关于map和reduce任务的信息，请查看下面的url。

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

#2

As Praveen mentions above, when using the basic FileInputFormat classes is just the number of input splits that constitute the data. The number of reducers is controlled by mapred.reduce.tasks specified in the way you have it: -D mapred.reduce.tasks=10 would specify 10 reducers. Note that the space after -D is required; if you omit the space, the configuration property is passed along to the relevant JVM, not to Hadoop.

正如Praveen所提到的，在使用基本的FileInputFormat类时，只需要将数据的输入拆分的数量。减速器的数量由mapred.reduce控制。按照您的方式指定的任务:-D mapred.reduce。任务=10可以指定10个减速器。注意-D后的空间是必需的;如果省略了空间，则配置属性将传递给相关的JVM，而不是Hadoop。

Are you specifying 0 because there is no reduce work to do? In that case, if you're having trouble with the run-time parameter, you can also set the value directly in code. Given a JobConf instance job, call

您是否指定0，因为没有减少的工作要做?在这种情况下，如果您对运行时参数有问题，您也可以直接在代码中设置值。给定一个JobConf实例作业，调用。

job.setNumReduceTasks(0);

inside, say, your implementation of Tool.run. That should produce output directly from the mappers. If your job actually produces no output whatsoever (because you're using the framework just for side-effects like network calls or image processing, or if the results are entirely accounted for in Counter values), you can disable output by also calling

在内部，比方说，你的工具的实现。这应该直接从映射器生成输出。如果您的工作实际上没有产生任何输出(因为您使用的框架只是用于网络调用或图像处理之类的副作用，或者如果结果完全用于计数器值)，您也可以通过调用禁用输出。

job.setOutputFormat(NullOutputFormat.class);

#3

It's important to keep in mind that the MapReduce framework in Hadoop allows us only to

重要的是要记住，Hadoop中的MapReduce框架只允许我们使用。

suggest the number of Map tasks for a job

建议一份工作的地图任务的数量。

which like Praveen pointed out above will correspond to the number of input splits for the task. Unlike it's behavior for the number of reducers (which is directly related to the number of files output by the MapReduce job) where we can

正如Praveen所指出的，这将对应于任务的输入分裂的数量。不同的是，它的行为是为了减少(直接关系到MapReduce作业的文件输出的数量)。

demand that it provide n reducers.

要求它提供n个减速器。

#4

To explain it with a example:

用一个例子来解释:

Assume your hadoop input file size is 2 GB and you set block size as 64 MB so 32 Mappers tasks are set to run while each mapper will process 64 MB block to complete the Mapper Job of your Hadoop Job.

假设您的hadoop输入文件大小为2 GB，并且您将块大小设置为64 MB，因此将运行32个mapper任务，而每个mapper将处理64 MB的块，以完成hadoop作业的mapper作业。

==> Number of mappers set to run are completely dependent on 1) File Size and 2) Block Size

==>的mappers的数量将完全依赖于文件大小和块大小。

Assume you have running hadoop on a cluster size of 4: Assume you set mapred.map.tasks and mapred.reduce.tasks parameters in your conf file to the nodes as follows:

假设您在一个集群大小的4上运行hadoop:假设您设置了mapred.map。任务和mapred.reduce。您的conf文件中的任务参数如下:

Node 1: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 2: mapred.map.tasks = 2 and mapred.reduce.tasks = 2
Node 3: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 4: mapred.map.tasks = 1 and mapred.reduce.tasks = 1

Assume you set the above paramters for 4 of your nodes in this cluster. If you notice Node 2 has set only 2 and 2 respectively because the processing resources of the Node 2 might be less e.g(2 Processors, 2 Cores) and Node 4 is even set lower to just 1 and 1 respectively might be due to processing resources on that node is 1 processor, 2 cores so can't run more than 1 mapper and 1 reducer task.

假设您在这个集群中为4个节点设置了上述参数。如果您注意到节点2只设置了2和2，因为节点2的处理资源可能更少。g(2个处理器，2个核)和节点4甚至可以分别设置为1和1，可能是由于该节点上的处理资源是1个处理器，2个核心不能运行超过1个mapper和1个减少任务。

So when you run the job Node 1, Node 2, Node 3, Node 4 are configured to run a max. total of (4+2+4+1)11 mapper tasks simultaneously out of 42 mapper tasks that needs to be completed by the Job. After each Node completes its map tasks it will take the remaining mapper tasks left in 42 mapper tasks.

因此，当您运行作业节点1时，节点2、节点3、节点4配置为运行一个max。在需要完成的42个mapper任务中同时完成(4+2+4+1)11个mapper任务。在每个节点完成其映射任务之后，它将在42个mapper任务中保留剩下的mapper任务。

Now comming to reducers, as you set mapred.reduce.tasks = 0 so we only get mapper output in to 42 files(1 file for each mapper task) and no reducer output.

现在开始减少，当你设置mapred。reduce。任务= 0，所以我们只在42个文件中得到mapper输出(每个mapper任务有一个文件)，并且没有减少输出。

#5

In the newer version of Hadoop, there are much more granular mapreduce.job.running.map.limit and mapreduce.job.running.reduce.limit which allows you to set the mapper and reducer count irrespective of hdfs file split size. This is helpful if you are under constraint to not take up large resources in the cluster.

在新版本的Hadoop中，有更细粒度的mapreduce.job.running.map.limit和mapreduce.job.running.reduce.limit，它允许您设置mapper和reducer count，而不考虑hdfs文件分割大小。如果您受到约束，不占用集群中的大量资源，那么这是很有帮助的。

JIRA

#6

From your log I understood that you have 12 input files as there are 12 local maps generated. Rack Local maps are spawned for the same file if some of the blocks of that file are in some other data node. How many data nodes you have?

从您的日志中，我知道您有12个输入文件，因为有12个本地地图生成。如果该文件的某些块位于其他数据节点中，则为同一个文件生成一个本地映射。你有多少个数据节点?

#7

In your example, the -D parts are not picked up:

在你的例子中，-D部分没有被拾取:

hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0

They should come after the classname part like this:

它们应该在类名部分之后出现:

hadoop jar Test_Parallel_for.jar Test_Parallel_for -Dmapred.map.tasks=20 -Dmapred.reduce.tasks=0 Matrix/test4.txt Result 3

A space after -D is allowed though.

在d之后的空间是允许的。

Also note that changing the number of mappers is probably a bad idea as other people have mentioned here.

还要注意的是，改变mappers的数量可能是一个坏主意，正如其他人在这里提到的。

#8

Number of map tasks is directly defined by number of chunks your input is splitted. The size of data chunk (i.e. HDFS block size) is controllable and can be set for an individual file, set of files, directory(-s). So, setting specific number of map tasks in a job is possible but involves setting a corresponding HDFS block size for job's input data. mapred.map.tasks can be used for that too but only if its provided value is greater than number of splits for job's input data.

映射任务的数量直接由您的输入被分割的块的数量来定义。数据块(即HDFS块大小)的大小是可控的，可以设置为单个文件、文件集、目录(-s)。因此，在工作中设置特定数量的映射任务是可能的，但是需要为job的输入数据设置相应的HDFS块大小。mapred.map。任务也可以使用，但前提是它提供的值大于工作输入数据的分割数。

Controlling number of reducers via mapred.reduce.tasks is correct. However, setting it to zero is a rather special case: the job's output is an concatenation of mappers' outputs (non-sorted). In Matt's answer one can see more ways to set the number of reducers.

通过mapred控制减速器的数量。任务是正确的。但是，将其设置为0是一个相当特殊的情况:作业的输出是一个mappers输出(未排序)的连接。在马特的回答中，我们可以看到更多的方法来确定减速器的数量。

#9

One way you can increase the number of mappers is to give your input in the form of split files [you can use linux split command]. Hadoop streaming usually assigns that many mappers as there are input files[if there are a large number of files] if not it will try to split the input into equal sized parts.

可以增加mappers的数量的一种方法是，以分割文件的形式提供您的输入[您可以使用linux split命令]。Hadoop流通常会分配许多映射器，因为有大量的输入文件(如果有大量的文件)，如果不是，它会尝试将输入分成相等大小的部分。

#10

Use -D property=value rather than -D property = value (eliminate extra whitespaces). Thus -D mapred.reduce.tasks=value would work fine.

使用-D属性=值，而不是-D属性=值(消除额外的空白)。因此- d mapred.reduce。任务=价值会更好。
Setting number of map tasks doesnt always reflect the value you have set since it depends on split size and InputFormat used.

设置映射任务的数量并不总是反映您设置的值，因为它取决于所使用的分割大小和InputFormat。
Setting the number of reduces will definitely override the number of reduces set on cluster/client-side configuration.

设置减少的数量一定会覆盖在集群/客户端配置上减少设置的数量。

#11

I agree the number mapp task depends upon the input split but in some of the scenario i could see its little different

我同意mapp任务的数量取决于输入的分割，但在某些情况下，我可以看到它的不同。

case-1 I created a simple mapp task only it creates 2 duplicate out put file (data ia same) command I gave below

case-1我创建了一个简单的mapp任务，它只创建了我下面给出的2个复制的put文件(data ia)命令。

bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -D mapred.reduce.tasks=0 -input /home/sample.csv -output /home/sample_csv112.txt -mapper /home/amitav/workpython/readcsv.py

bin / hadoop jar contrib /流/ hadoop-streaming-1.2.1。jar - d mapred.reduce。任务= 0固化/home/sample.csv与产出/home/sample_csv112。txt mapper /home/amitav/workpython/readcsv.py

Case-2 So I restrcted the mapp task to 1 the out put came correctly with one output file but one reducer also lunched in the UI screen although I restricted the reducer job. The command is given below.

所以我将mapp的任务限制为1输出文件的输出文件正确，但在UI屏幕上也有一个缩减器，尽管我限制了减缩作业。命令如下所示。

bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -D mapred.map.tasks=1 mapred.reduce.tasks=0 -input /home/sample.csv -output /home/sample_csv115.txt -mapper /home/amitav/workpython/readcsv.py

bin / hadoop jar contrib /流/ hadoop-streaming-1.2.1。jar - d mapred.map。= 1 mapred.reduce任务。任务= 0固化/home/sample.csv与产出/home/sample_csv115。txt mapper /home/amitav/workpython/readcsv.py

#12

The first part has already been answered, "just a suggestion" The second part has also been answered, "remove extra spaces around =" If both these didnt work, are you sure you have implemented ToolRunner ?

第一部分已经回答了，“只是一个建议”第二部分也被回答了，“删除多余的空间=”如果这两个都没有工作，你确定你已经实现了工具跑?

#13

Number of map task depends on File size, If you want n number of Map, divide the file size by n as follows:

地图任务的数量取决于文件大小，如果你想要n个地图，将文件大小除以n如下:

conf.set("mapred.max.split.size", "41943040"); // maximum split file size in bytes
conf.set("mapred.min.split.size", "20971520"); // minimum split file size in bytes

#14

-1

Folks from this theory it seems we cannot run map reduce jobs in parallel.

从这个理论来看，我们似乎不能并行地减少工作。

Lets say I configured total 5 mapper jobs to run on particular node.Also I want to use this in such a way that JOB1 can use 3 mappers and JOB2 can use 2 mappers so that job can run in parallel. But above properties are ignored then how can execute jobs in parallel.

假设我配置了总共5个mapper作业以在特定节点上运行。另外，我还想用这个方法，JOB1可以使用3个mapper, JOB2可以使用两个mapper，这样作业就可以并行运行了。但是，上面的属性被忽略了，那么如何并行执行作业呢?

#15

-1

From what I understand reading above, it depends on the input files. If Input Files are 100 means - Hadoop will create 100 map tasks. However, it depends on the Node configuration on How Many can be run at one point of time. If a node is configured to run 10 map tasks - only 10 map tasks will run in parallel by picking 10 different input files out of the 100 available. Map tasks will continue to fetch more files as and when it completes processing of a file.

从上面我所理解的，它依赖于输入文件。如果输入文件是100,Hadoop将创建100个地图任务。但是，它依赖于节点配置，在一个时间点上可以运行多少个节点。如果一个节点被配置为运行10个map任务，那么只有10个map任务将在100个可用的100个不同的输入文件中并行运行。Map任务将继续获取更多的文件，当它完成一个文件的处理时。

#1