从App Engine执行数据流作业

I am relatively new in GCP technology. Currently, I am a doing POC to create a scheduled dataflow job that ingest (insert) data from google cloud storage to BigQuery. After reading some tutorials and documentations, I come up with the following:

我是GCP技术的新手。目前，我正在做POC来创建一个计划的数据流作业，将数据从谷歌云存储中提取（插入）到BigQuery。阅读完一些教程和文档后，我想出了以下内容：

I first create a dataflow job that read an avro file and load it to BigQuery. This dataflow has been tested and worked well.

我首先创建一个数据流作业，读取avro文件并将其加载到BigQuery。此数据流已经过测试并且运行良好。

(self.pipeline
     | output_table + ': read table ' >> ReadFromAvro(storage_input_path)
     | output_table + ': filter columns' >> beam.Map(self.__filter_columns, columns=columns)
     | output_table + ': write to BigQuery' >> beam.Write(
        beam.io.BigQuerySink(output_table,               
   create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,                               
   write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)))

In order to create the scheduled job, I then created a simple web service as follows:

为了创建预定作业，我创建了一个简单的Web服务，如下所示：

import logging
from flask import Flask
from common.tableLoader import TableLoader
from ingestion import IngestionToBigQuery
from common.configReader import ConfigReader
app = Flask(__name__)
@app.route('/')
def hello():
     """Return a friendly HTTP greeting."""
    logging.getLogger().setLevel(logging.INFO)
    config = ConfigReader('columbus-config')  # TODO read from args
    tables = TableLoader('experience')
    ingestor = IngestionToBigQuery(config.configuration, tables.list_of_tables)
    ingestor.ingest_table()
    return 'Hello World!'```

I also created the app.yaml:

我还创建了app.yaml：

 runtime: python
 env: flex
 entrypoint: gunicorn -b :$PORT recsys_data_pipeline.main:app
 threadsafe: yes
 runtime_config:
    python_version: 2
    resources:
    memory_gb: 2.0

Then, I deployed it using this command gcloud app deploy, yet, I got the following errors:

然后，我使用此命令gcloud app deploy部署它，但是，我收到以下错误：

default[20170417t173837]  ERROR:root:The gcloud tool was not found.
default[20170417t173837]  Traceback (most recent call last):    
File "/env/local/lib/python2.7/site-packages/apache_beam/internal/gcp/auth.py", line 109, in _refresh      ['gcloud', 'auth', 'print-access-token'], stdout=processes.PIPE)    
File "/env/local/lib/python2.7/site-packages/apache_beam/utils/processes.py", line 52, in Popen      return subprocess.Popen(*args, **kwargs)    
File "/usr/lib/python2.7/subprocess.py", line 710, in __init__      errread, errwrite)    File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child      raise child_exception  OSError: [Errno 2] No such file or directory

From the message above, I found that error was coming from the apache_beam auth.py class, specifically, it was coming from the following function:

从上面的消息中，我发现错误来自apache_beam auth.py类，具体来说，它来自以下函数：

def _refresh(self, http_request):
   """Gets an access token using the gcloud client."""
   try:
     gcloud_process = processes.Popen(['gcloud', 'auth', 'print-access-token'], stdout=processes.PIPE)
   except OSError as exn:
     logging.error('The gcloud tool was not found.', exc_info=True)
     raise AuthenticationException('The gcloud tool was not found: %s' % exn)
  output, _ = gcloud_process.communicate()
  self.access_token = output.strip()

which was invoked when the credentials (service_acount_name and service_acount_key are not given:

在凭据（未给出service_acount_name和service_acount_key时调用）：

if google_cloud_options.service_account_name:
      if not google_cloud_options.service_account_key_file:
        raise AuthenticationException(
            'key file not provided for service account.')
      if not os.path.exists(google_cloud_options.service_account_key_file):
        raise AuthenticationException(
            'Specified service account key file does not exist.')

else:
      try:
        credentials = _GCloudWrapperCredentials(user_agent)
        # Check if we are able to get an access token. If not fallback to
        # application default credentials.
        credentials.get_access_token()
        return credentials

So I have two questions:

所以我有两个问题：

Is there a way to "attach" the credentials (the service_acount_name and service_acount_key) somewhere in my code or in the config file (for instance: in app.yaml)?
有没有办法在我的代码或配置文件中的某处“附加”凭证（service_acount_name和service_acount_key）（例如：在app.yaml中）？
What are the best practices to trigger a dataflow job from google app engine?
从谷歌应用引擎触发数据流作业的最佳做法是什么？

Thank you so much, any suggestions and comments would be really helpful!

非常感谢，任何建议和意见都会非常有帮助！

1 个解决方案

#1

Please take a look at an official example of this at https://github.com/amygdala/gae-dataflow .

请访问https://github.com/amygdala/gae-dataflow查看此官方示例。

#1

Please take a look at an official example of this at https://github.com/amygdala/gae-dataflow .

请访问https://github.com/amygdala/gae-dataflow查看此官方示例。

秒客网

从App Engine执行数据流作业

1 个解决方案

#1

#1

相关文章