如何在私有子网上运行Dataflow python?

时间:2021-11-21 15:33:37

Apache Beam 2.1.0 added support for submitting jobs on the Dataflow runner on private subnetworks and without public IPs, which we needed to satisfy our firewall rules. I planned to use a squid proxy to access apt-get, pip, etc to install python dependencies; a proxy instance is already running and we set the proxies inside our setup.py script.

Apache Beam 2.1.0增加了对在私有子网上的Dataflow运行器上提交作业以及没有公共IP的支持,我们需要这些IP来满足防火墙规则。我计划使用squid代理访问apt-get,pip等来安装python依赖项;代理实例已在运行,我们在setup.py脚本中设置了代理。

python $DIR/submit.py \
       --runner DataflowRunner \
       --no_use_public_ips \
       --subnetwork regions/us-central1/subnetworks/$PRIVATESUBNET \
       --staging_location $BUCKET/staging \
       --temp_location $BUCKET/temp \
       --project $PROJECT \
       --setup_file $DIR/setup.py \
       --job_name $JOB_NAME

When I try to run via the python API I error out during worker-startup before I get a chance to enable the proxy. It looks to me like each worker first tries to install the dataflow sdk:

当我尝试通过python API运行时,我在工作器启动期间出错,然后才有机会启用代理。在我看来,每个工作人员首先尝试安装数据流sdk:

如何在私有子网上运行Dataflow python?

and during that it tries to update requests and fails to connect to pip:

并且在此期间它尝试更新请求并且无法连接到pip:

如何在私有子网上运行Dataflow python?

None of my code has been executed at this point, so I can't see a way to avoid this error before setting up the proxy. Is there any way to launch dataflow python workers on a private subnet?

此时我的代码都没有被执行,因此在设置代理之前我无法找到避免此错误的方法。有没有办法在私有子网上启动数据流python worker?

1 个解决方案

#1


3  

I managed to solve this with a NAT gateway instead of a proxy. Following along with the instructions under special configurations - I had to edit one of the steps to automatically route Dataflow worker instances through the gateway:

我设法用NAT网关而不是代理来解决这个问题。遵循特殊配置下的说明 - 我必须编辑其中一个步骤,以通过网关自动路由Dataflow工作器实例:

gcloud compute routes create no-ip-internet-route --network my-network \
    --destination-range 0.0.0.0/0 \
    --next-hop-instance nat-gateway \
    --next-hop-instance-zone us-central1-a \
    --tags dataflow --priority 800

I used the tag dataflow instead of no-ip, which is the network tag for all Dataflow workers.

我使用标记数据流而不是no-ip,这是所有Dataflow工作者的网络标记。

The NAT gateway seems like an easier solution than a proxy in this case, since it will route the traffic without having to configure the workers.

在这种情况下,NAT网关似乎比代理更容易解决,因为它将路由流量而无需配置工作者。

#1


3  

I managed to solve this with a NAT gateway instead of a proxy. Following along with the instructions under special configurations - I had to edit one of the steps to automatically route Dataflow worker instances through the gateway:

我设法用NAT网关而不是代理来解决这个问题。遵循特殊配置下的说明 - 我必须编辑其中一个步骤,以通过网关自动路由Dataflow工作器实例:

gcloud compute routes create no-ip-internet-route --network my-network \
    --destination-range 0.0.0.0/0 \
    --next-hop-instance nat-gateway \
    --next-hop-instance-zone us-central1-a \
    --tags dataflow --priority 800

I used the tag dataflow instead of no-ip, which is the network tag for all Dataflow workers.

我使用标记数据流而不是no-ip,这是所有Dataflow工作者的网络标记。

The NAT gateway seems like an easier solution than a proxy in this case, since it will route the traffic without having to configure the workers.

在这种情况下,NAT网关似乎比代理更容易解决,因为它将路由流量而无需配置工作者。