从checkpoint加载的spark streamingcontext没有hadoopConf设置

时间:2021-12-01 20:50:00

Can't recover from checkpointing to Azure blob storage with a wasbs://... url

无法使用wasbs:// ... url从检查点恢复到Azure blob存储

Using Standalone Spark 2.0.2 in cluster mode.

在群集模式下使用Standalone Spark 2.0.2。

val ssc = StreamingContext.getOrCreate(checkpointPath, () => createSSC(), hadoopConf)

I set fs.azure and fs.azure.account.key.$account.blob.core.windows.net via the hadoopConf in hadoopConf.set and redundantly in the createSSC function via sparkSession.sparkContext.hadoopConfiguration.set

我通过hadoopConf.set中的hadoopConf设置fs.azure和fs.azure.account.key。$ account.blob.core.windows.net,并通过sparkSession.sparkContext.hadoopConfiguration.set在createSSC函数中冗余设置。

The job successfully writes checkpointing files while running and runs until I stop it.

作业在运行时成功写入检查点文件并运行直到我停止它。

When I restart it, the context created from checkpointing data doesn't have the hadoopConf info to re-access wasbs:// storage and throws an error saying it can't create container with anonymous access.

当我重新启动它时,从检查点数据创建的上下文没有hadoopConf信息来重新访问wasbs://存储并抛出一个错误,说它无法创建具有匿名访问权限的容器。

What am I missing? I've found a couple similar posts about S3 but no clear solution.

我错过了什么?我发现了一些关于S3的类似帖子,但没有明确的解决方案。

The error:

从checkpoint加载的spark streamingcontext没有hadoopConf设置

More details: this happens after restarting from checkpointing inside the kafka 0.10.1.1 connector and I've confirmed that the sparkContext.hadoopConf attached to that RDD does have the correct key.

更多细节:从kafka 0.10.1.1连接器中的checkpointing重新启动后会发生这种情况,并且我已经确认连接到该RDD的sparkContext.hadoopConf确实具有正确的密钥。

1 个解决方案

#1


0  

Workaround:

Put the key in the spark core-site.xml. I was trying to avoid this because the credentials are a deploy-time setting - I won't set these at compile time or docker image build time.

将密钥放在spark core-site.xml中。我试图避免这种情况,因为凭据是部署时设置 - 我不会在编译时或docker映像构建时设置它们。

Before my container calls spark-submit, it now creates the /opt/spark/conf/core-site.xml file from the template below:

在我的容器调用spark-submit之前,它现在从下面的模板创建/opt/spark/conf/core-site.xml文件:

<?xml version="1.0"?>
<configuration>

  <property>
    <name>fs.azure</name>
    <value>org.apache.hadoop.fs.azure.NativeAzureFileSystem</value>
  </property>

  <property>
    <name>fs.azure.account.key.[CHECKPOINT_BLOB_ACCOUNT].blob.core.windows.net</name>
    <value>[CHECKPOINT_BLOB_KEY]</value>
  </property>

</configuration>

#1


0  

Workaround:

Put the key in the spark core-site.xml. I was trying to avoid this because the credentials are a deploy-time setting - I won't set these at compile time or docker image build time.

将密钥放在spark core-site.xml中。我试图避免这种情况,因为凭据是部署时设置 - 我不会在编译时或docker映像构建时设置它们。

Before my container calls spark-submit, it now creates the /opt/spark/conf/core-site.xml file from the template below:

在我的容器调用spark-submit之前,它现在从下面的模板创建/opt/spark/conf/core-site.xml文件:

<?xml version="1.0"?>
<configuration>

  <property>
    <name>fs.azure</name>
    <value>org.apache.hadoop.fs.azure.NativeAzureFileSystem</value>
  </property>

  <property>
    <name>fs.azure.account.key.[CHECKPOINT_BLOB_ACCOUNT].blob.core.windows.net</name>
    <value>[CHECKPOINT_BLOB_KEY]</value>
  </property>

</configuration>