Apache Beam 2.1.0 added support for submitting jobs on the Dataflow runner on private subnetworks and without public IPs, which we needed to satisfy our firewall rules. I planned to use a squid proxy to access apt-get
, pip
, etc to install python dependencies; a proxy instance is already running and we set the proxies inside our setup.py script.
Apache Beam 2.1.0增加了对在私有子网上的Dataflow运行器上提交作业以及没有公共IP的支持,我们需要这些IP来满足防火墙规则。我计划使用squid代理访问apt-get,pip等来安装python依赖项;代理实例已在运行,我们在setup.py脚本中设置了代理。
python $DIR/submit.py \
--runner DataflowRunner \
--no_use_public_ips \
--subnetwork regions/us-central1/subnetworks/$PRIVATESUBNET \
--staging_location $BUCKET/staging \
--temp_location $BUCKET/temp \
--project $PROJECT \
--setup_file $DIR/setup.py \
--job_name $JOB_NAME
When I try to run via the python API I error out during worker-startup before I get a chance to enable the proxy. It looks to me like each worker first tries to install the dataflow sdk:
当我尝试通过python API运行时,我在工作器启动期间出错,然后才有机会启用代理。在我看来,每个工作人员首先尝试安装数据流sdk:
and during that it tries to update requests
and fails to connect to pip
:
并且在此期间它尝试更新请求并且无法连接到pip:
None of my code has been executed at this point, so I can't see a way to avoid this error before setting up the proxy. Is there any way to launch dataflow python workers on a private subnet?
此时我的代码都没有被执行,因此在设置代理之前我无法找到避免此错误的方法。有没有办法在私有子网上启动数据流python worker?
1 个解决方案
#1
3
I managed to solve this with a NAT gateway instead of a proxy. Following along with the instructions under special configurations - I had to edit one of the steps to automatically route Dataflow worker instances through the gateway:
我设法用NAT网关而不是代理来解决这个问题。遵循特殊配置下的说明 - 我必须编辑其中一个步骤,以通过网关自动路由Dataflow工作器实例:
gcloud compute routes create no-ip-internet-route --network my-network \
--destination-range 0.0.0.0/0 \
--next-hop-instance nat-gateway \
--next-hop-instance-zone us-central1-a \
--tags dataflow --priority 800
I used the tag dataflow
instead of no-ip
, which is the network tag for all Dataflow workers.
我使用标记数据流而不是no-ip,这是所有Dataflow工作者的网络标记。
The NAT gateway seems like an easier solution than a proxy in this case, since it will route the traffic without having to configure the workers.
在这种情况下,NAT网关似乎比代理更容易解决,因为它将路由流量而无需配置工作者。
#1
3
I managed to solve this with a NAT gateway instead of a proxy. Following along with the instructions under special configurations - I had to edit one of the steps to automatically route Dataflow worker instances through the gateway:
我设法用NAT网关而不是代理来解决这个问题。遵循特殊配置下的说明 - 我必须编辑其中一个步骤,以通过网关自动路由Dataflow工作器实例:
gcloud compute routes create no-ip-internet-route --network my-network \
--destination-range 0.0.0.0/0 \
--next-hop-instance nat-gateway \
--next-hop-instance-zone us-central1-a \
--tags dataflow --priority 800
I used the tag dataflow
instead of no-ip
, which is the network tag for all Dataflow workers.
我使用标记数据流而不是no-ip,这是所有Dataflow工作者的网络标记。
The NAT gateway seems like an easier solution than a proxy in this case, since it will route the traffic without having to configure the workers.
在这种情况下,NAT网关似乎比代理更容易解决,因为它将路由流量而无需配置工作者。