I want to start a child-thread in foreachRDD.
我想在foreachRDD中启动一个子线程。
My situation is:
我的情况是:
the job is reading from a hdfs dir continuously, and every 100 batches, I want to launch a model training task (I will make a snapshot of the rdds at that time and start the training task. the training task takes a very long time(2 hours), and I don't want the training task influence reading new batch of data.
这项工作是连续读取hdfs目录,每100批次,我想发起一个模型训练任务(我将在那时制作rdds的快照并开始训练任务。训练任务需要很长时间( 2小时),我不希望培训任务影响阅读新一批数据。
Is starting a new child thread a good solution? Could the child thread use SparkContext in the main thread and use the rdd in main thread?
开始一个新的子线程是一个很好的解决方案?子线程可以在主线程中使用SparkContext并在主线程中使用rdd吗?
1 个解决方案
#1
You don't need to start a new thread in RDD operations. To start a new job in every a hundred batch, you can add a BatchListner to count number of batches, and start a new job when number equals to 100. BatchListener Example
您不需要在RDD操作中启动新线程。要在每一百个批处理中启动一个新作业,可以添加一个BatchListner来计算批次数,并在number等于100时启动一个新作业.BatchListener示例
#1
You don't need to start a new thread in RDD operations. To start a new job in every a hundred batch, you can add a BatchListner to count number of batches, and start a new job when number equals to 100. BatchListener Example
您不需要在RDD操作中启动新线程。要在每一百个批处理中启动一个新作业,可以添加一个BatchListner来计算批次数,并在number等于100时启动一个新作业.BatchListener示例