我们可以在火花流中的foreachRDD中开始一个新线程吗?

时间:2022-04-19 02:07:32

I want to start a child-thread in foreachRDD.

我想在foreachRDD中启动一个子线程。

My situation is:

我的情况是:

the job is reading from a hdfs dir continuously, and every 100 batches, I want to launch a model training task (I will make a snapshot of the rdds at that time and start the training task. the training task takes a very long time(2 hours), and I don't want the training task influence reading new batch of data.

这项工作是连续读取hdfs目录,每100批次,我想发起一个模型训练任务(我将在那时制作rdds的快照并开始训练任务。训练任务需要很长时间( 2小时),我不希望培训任务影响阅读新一批数据。

Is starting a new child thread a good solution? Could the child thread use SparkContext in the main thread and use the rdd in main thread?

开始一个新的子线程是一个很好的解决方案?子线程可以在主线程中使用SparkContext并在主线程中使用rdd吗?

1 个解决方案

#1


You don't need to start a new thread in RDD operations. To start a new job in every a hundred batch, you can add a BatchListner to count number of batches, and start a new job when number equals to 100. BatchListener Example

您不需要在RDD操作中启动新线程。要在每一百个批处理中启动一个新作业,可以添加一个BatchListner来计算批次数,并在number等于100时启动一个新作业.BatchListener示例

#1


You don't need to start a new thread in RDD operations. To start a new job in every a hundred batch, you can add a BatchListner to count number of batches, and start a new job when number equals to 100. BatchListener Example

您不需要在RDD操作中启动新线程。要在每一百个批处理中启动一个新作业,可以添加一个BatchListner来计算批次数,并在number等于100时启动一个新作业.BatchListener示例