On a pipeline defined using the latest Apache Beam SDK for Python 2.2.0, I get this error when running a simple pipeline that reads and writes a BigQuery table.
在使用最新的Apache Beam SDK for Python 2.2.0定义的管道上,运行读取和写入BigQuery表的简单管道时出现此错误。
Since a few rows have timestamps with year < 1900, the read operation fails. How can I patch this dataflow_worker package?
由于少数行具有年份<1900的时间戳,因此读取操作失败。如何修补此dataflow_worker包?
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
(4d31192aa4aec063): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
op.start()
File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
def start(self):
File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
with self.scoped_start_state:
File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
with self.spec.source.reader() as reader:
File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
for value in reader:
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativefileio.py", line 198, in __iter__
for record in self.read_next_block():
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativeavroio.py", line 95, in read_next_block
yield self.decode_record(record)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 110, in decode_record
record, self.source.table_schema)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 104, in _fix_field_values
record[field.name], field)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 83, in _fix_field_value
return dt.strftime('%Y-%m-%d %H:%M:%S.%f UTC')
ValueError: year=200 is before 1900; the datetime strftime() methods require year >= 1900
1 个解决方案
#1
0
Unfortunately, you cannot patch it to work with timestamps because that is the internal implementation of Google's Apache Beam runner: Dataflow. So you will have to wait until this is fixed by Google (should this be identified as a bug). Please, report it as soon as possible because this is more a limitation of Python's version used rather than a bug.
不幸的是,您无法修补它以使用时间戳,因为这是Google的Apache Beam运行程序的内部实现:Dataflow。因此,您必须等到Google修复此问题(这应该被识别为错误)。请尽快报告,因为这更多是Python使用版本的限制而不是错误。
The problem comes from strftime
as you can see in the error. The documentation explicitly mentions it won't work with any year prior to 1900. A workaround, on your end though, is to convert the timestamp to a string (you can do this in BigQuery as specified in the documentation). And then in your Beam pipeline you can reconvert it again to some timestamp or whatever suits you best.
问题来自strftime,你可以在错误中看到。文档明确提到它不适用于1900年以前的任何一年。但是,最后的解决方法是将时间戳转换为字符串(您可以在文档中指定的BigQuery中执行此操作)。然后在您的Beam管道中,您可以将其重新转换为某个时间戳或任何最适合您的时间段。
You also have an example on how to convert a datetime
object to a string as the template of your error in answer. In the same question there is another answer that explains what has happened with this bug and how has it been solved (in Python) and what you can do. Unfortunately, the solution seems to avoid using strftime
at all, and use some alternative instead.
您还有一个示例,说明如何将datetimeobject转换为字符串作为错误的答案模板。在同一个问题中,还有另一个答案可以解释这个错误发生了什么,它是如何解决的(在Python中)以及你可以做些什么。不幸的是,解决方案似乎完全避免使用strftime,而是使用一些替代方案。
#1
0
Unfortunately, you cannot patch it to work with timestamps because that is the internal implementation of Google's Apache Beam runner: Dataflow. So you will have to wait until this is fixed by Google (should this be identified as a bug). Please, report it as soon as possible because this is more a limitation of Python's version used rather than a bug.
不幸的是,您无法修补它以使用时间戳,因为这是Google的Apache Beam运行程序的内部实现:Dataflow。因此,您必须等到Google修复此问题(这应该被识别为错误)。请尽快报告,因为这更多是Python使用版本的限制而不是错误。
The problem comes from strftime
as you can see in the error. The documentation explicitly mentions it won't work with any year prior to 1900. A workaround, on your end though, is to convert the timestamp to a string (you can do this in BigQuery as specified in the documentation). And then in your Beam pipeline you can reconvert it again to some timestamp or whatever suits you best.
问题来自strftime,你可以在错误中看到。文档明确提到它不适用于1900年以前的任何一年。但是,最后的解决方法是将时间戳转换为字符串(您可以在文档中指定的BigQuery中执行此操作)。然后在您的Beam管道中,您可以将其重新转换为某个时间戳或任何最适合您的时间段。
You also have an example on how to convert a datetime
object to a string as the template of your error in answer. In the same question there is another answer that explains what has happened with this bug and how has it been solved (in Python) and what you can do. Unfortunately, the solution seems to avoid using strftime
at all, and use some alternative instead.
您还有一个示例,说明如何将datetimeobject转换为字符串作为错误的答案模板。在同一个问题中,还有另一个答案可以解释这个错误发生了什么,它是如何解决的(在Python中)以及你可以做些什么。不幸的是,解决方案似乎完全避免使用strftime,而是使用一些替代方案。