为什么我的火花流工作有这么多的任务

I wonder why there are so many task number in my spark streaming job ? and it becomes bigger and bigger...

我想知道为什么我的火花流工作中有这么多的任务编号?它变得越来越大......

after 3.2 hours' running, it grow to 120020... and after one day's running, it will grow to one million... why?

运行3.2小时后,它会增长到120020 ......经过一天的运行,它会增长到一百万......为什么?

3 个解决方案

#1

This SparkUI feature means that some stage dependencies might have been computed, or not, but were skipped because their output was already available. Therefore they show as skipped.

此SparkUI功能意味着某些阶段依赖项可能已计算或未计算,但由于其输出已可用而被跳过。因此他们显示为跳过。

Please not the might, meaning that until the job finishes Spark don't know for sure whether it will need to go back and re-compute some stages that were initially skipped.

请不要强制,这意味着在工作结束之前Spark不确定是否需要返回并重新计算最初跳过的一些阶段。

[1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L189

#2

The nature of a streaming application is to run the same process for each batch of data over time. It looks like you're trying to run with a 1-second batch interval and each interval might spawn several jobs. You show 585 jobs in 3.2 hours, not 120020. However, it also kind of looks like your processing finishes in nowhere like 1 second. I imagine your scheduling delay is very very high. This is a symptom of having far too small a batch interval, I would guess.

流应用程序的本质是随着时间的推移为每批数据运行相同的过程。看起来你试图以1秒的批处理间隔运行,每个间隔可能产生几个作业。您在3.2小时内显示585个作业,而不是120020.但是,它看起来有点像您的处理完成无处不在1秒。我想你的调度延迟非常高。我猜这是一个批次间隔太小的症状。

#3

I would strongly recommend that you check the parameter spark.streaming.blockInterval, which is a very important one. By default it's 0.5 seconds, i.e. create one task every 0.5 seconds.

我强烈建议您检查参数spark.streaming.blockInterval,这是一个非常重要的参数。默认情况下为0.5秒,即每0.5秒创建一个任务。

So maybe you can try to increase the spark.streaming.blockInterval to be 1min or 10min then the number of tasks should decrease.

所以也许你可以尝试将spark.streaming.blockInterval增加到1分钟或10分钟,然后任务数量应该减少。

My intuition is simply because your consumer is as fast as the producer, so as the time going, more and more tasks are accumulated for further consumption.

我的直觉仅仅是因为你的消费者和生产者一样快,所以随着时间的推移,越来越多的任务被积累以供进一步消费。

It may due to your Spark cluster's incapacity to process such a large batch. It may also be related the checkpoint interval time, maybe you are setting it too large or too small. It may also be related to your settings of Parallelism, Partitions or Data Locality etc.

这可能是由于您的Spark群集无法处理如此大的批次。它也可能与检查点间隔时间有关,也许你将它设置得太大或太小。它也可能与您的并行,分区或数据位置等设置有关。

good luck

Read this

Tuning Spark Streaming for Throughput

为吞吐量调整Spark Streaming

How-to: Tune Your Apache Spark Jobs (Part 1)

操作方法:调整Apache Spark工作(第1部分)

How-to: Tune Your Apache Spark Jobs (Part 2)

操作方法:调整Apache Spark工作(第2部分)

#1