数据流管道“与服务失去联系”

时间:2021-11-21 15:33:31

I'm running into trouble with an Apache Beam pipline on Google Cloud Dataflow.

我在Google Cloud Dataflow上使用Apache Beam pipline遇到了麻烦。

The pipeline is simple: Reading json from GCS, extracting text from some nested fields, writing back to GCS.

管道很简单:从GCS读取json,从一些嵌套字段中提取文本,写回GCS。

It works fine when testing with a smaller subset of input files but when I run it on the full data set, I get the following error (after running fine through around 260M items).

使用较小的输入文件子集进行测试时工作正常但是当我在完整数据集上运行时,我得到以下错误(在运行大约260M项目之后)。

Somehow the "worker eventually lost contact with the service"

不知何故,“工人最终失去了与服务的联系”

  (8662a188e74dae87): Workflow failed. Causes: (95e9c3f710c71bc2): S04:ReadFromTextWithFilename/Read+FlatMap(extract_text_from_raw)+RemoveLineBreaks+FormatText+WriteText/Write/WriteImpl/WriteBundles/Do+WriteText/Write/WriteImpl/Pair+WriteText/Write/WriteImpl/WindowInto(WindowIntoFn)+WriteText/Write/WriteImpl/GroupByKey/Reify+WriteText/Write/WriteImpl/GroupByKey/Write failed., (da6389e4b594e34b): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on: 
  extract-tags-150110997000-07261602-0a01-harness-jzcn,
  extract-tags-150110997000-07261602-0a01-harness-828c,
  extract-tags-150110997000-07261602-0a01-harness-3w45,
  extract-tags-150110997000-07261602-0a01-harness-zn6v

The Stacktrace shows a Failed to update work status/Progress reporting thread got error error:

Stacktrace显示无法更新工作状态/进度报告线程获取错误错误:

Exception in worker loop: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 776, in run deferred_exception_details=deferred_exception_details) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 629, in do_work exception_details=exception_details) File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 168, in wrapper return fun(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 490, in report_completion_status exception_details=exception_details) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 298, in report_status work_executor=self._work_executor) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 333, in report_status self._client.projects_locations_jobs_workItems.ReportStatus(request)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 467, in ReportStatus config, request, global_params=global_params) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 723, in _RunMethod return self.ProcessHttpResponse(method_config, http_response, request) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 729, in ProcessHttpResponse self.__ProcessHttpResponse(method_config, http_response, request)) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 600, in __ProcessHttpResponse http_response.request_url, method_config, request) HttpError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/qollaboration-live/locations/us-central1/jobs/2017-07-26_16_02_36-1885237888618334364/workItems:reportStatus?alt=json>: response: <{'status': '400', 'content-length': '360', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Wed, 26 Jul 2017 23:54:12 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "code": 400, "message": "(7f8a0ec09d20c3a3): Failed to publish the result of the work update. Causes: (7f8a0ec09d20cd48): Failed to update work status. Causes: (afa1cd74b2e65619): Failed to update work status., (afa1cd74b2e65caa): Work \"6306998912537661254\" not leased (or the lease was lost).", "status": "INVALID_ARGUMENT" } } >

And Finally:

最后:

HttpError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/[projectid-redacted]/locations/us-central1/jobs/2017-07-26_18_28_43-10867107563808864085/workItems:reportStatus?alt=json>: response: <{'status': '400', 'content-length': '358', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Thu, 27 Jul 2017 02:00:10 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "code": 400, "message": "(5845363977e915c1): Failed to publish the result of the work update. Causes: (5845363977e913a8): Failed to update work status. Causes: (44379dfdb8c2b47): Failed to update work status., (44379dfdb8c2e88): Work \"9100669328839864782\" not leased (or the lease was lost).", "status": "INVALID_ARGUMENT" } } >
at __ProcessHttpResponse (/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py:600)
at ProcessHttpResponse (/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py:729)
at _RunMethod (/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py:723)
at ReportStatus (/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py:467)
at report_status (/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py:333)
at report_status (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:298)
at report_completion_status (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:490)
at wrapper (/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py:168)
at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:629)
at run (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:776)

This looks like an error to the data flow internals to me. Can anyone confirm? Are there any workarounds?

这看起来像是对我的数据流内部错误。谁能确认一下?有没有解决方法?

1 个解决方案

#1


0  

The HttpError typically appears after the workflow has failed and is part of the failure/teardown process.

HttpError通常在工作流失败后出现,并且是故障/拆卸过程的一部分。

It looks like there were others error reported in your pipeline, such as the following. Note that if the same elements fail 4 times it will be marked failing.

看起来您的管道中报告了其他错误,例如以下内容。请注意,如果相同的元素失败4次,则会标记为失败。

Try looking the Stack Traces section in the UI to identify the other errors and their stack traces. Since this only occurs on the larger dataset, consider the possibility of their being malformed elements that only exist in the larger dataset.

尝试查看UI中的Stack Traces部分,以识别其他错误及其堆栈跟踪。由于这仅发生在较大的数据集上,因此请考虑它们是仅存在于较大数据集中的格式错误元素的可能性。

#1


0  

The HttpError typically appears after the workflow has failed and is part of the failure/teardown process.

HttpError通常在工作流失败后出现,并且是故障/拆卸过程的一部分。

It looks like there were others error reported in your pipeline, such as the following. Note that if the same elements fail 4 times it will be marked failing.

看起来您的管道中报告了其他错误,例如以下内容。请注意,如果相同的元素失败4次,则会标记为失败。

Try looking the Stack Traces section in the UI to identify the other errors and their stack traces. Since this only occurs on the larger dataset, consider the possibility of their being malformed elements that only exist in the larger dataset.

尝试查看UI中的Stack Traces部分,以识别其他错误及其堆栈跟踪。由于这仅发生在较大的数据集上,因此请考虑它们是仅存在于较大数据集中的格式错误元素的可能性。