确定导致Google Dataflow作业失败的特定输入数据

时间:2022-02-02 19:18:35

I had an issue where I was using Dataflow to parse a text file and then put relevant data into BQ. The issue was seemingly caused by a single line of malformed input in the text file. I was able to fix the error, but it got me thinking: if I had a line of input that was hosing Dataflow, is there any way I could find out the line? This would make one part of Dataflow debugging much easier, especially if your input file was a few billion lines and you had to track down the one line causing problems.

我有一个问题,我使用Dataflow来解析文本文件,然后将相关数据放入BQ。该问题似乎是由文本文件中的单行错误输入引起的。我能够修复错误,但它让我思考:如果我有一行输入正在输入Dataflow,有什么方法可以找到该行吗?这将使Dataflow调试的一部分变得更加容易,特别是如果您的输入文件是几十亿行并且您必须追踪导致问题的一行。

As an example, let's say I'm posting data think I think is an integer to BigQuery. I might create my schema like this:

举个例子,假设我发布的数据认为我认为是BigQuery的整数。我可能会像这样创建我的架构:

List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(newTableFieldSchema().setName("ItemNum").setType("INTEGER"));

And I might map the input data into the BigQuery schema with this function:

我可以使用此函数将输入数据映射到BigQuery模式:

    public void processElement(ProcessContext c) {

        TableRow row = new TableRow();
        row.set("ItemNum", c.element()); 
        c.output(row);
    }

But when Dataflow hits my malformed input (where it is not an integer), I get an error like this:

但是当Dataflow命中我的格式错误的输入(它不是一个整数)时,我得到一个如下错误:

Workflow failed. Causes: (30d455a6f7aaaaaa): BigQuery job "dataflow_job_3518531384490999999" in project "project-name" finished with error(s): job error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range).

工作流程失败。原因:(30d455a6f7aaaaaa):项目“project-name”中的BigQuery作业“dataflow_job_3518531384490999999”以错误结束:作业错误:无法将值转换为整数(错误值或超出范围)。,错误:无法转换值到整数(坏值或超出范围)。,错误:无法将值转换为整数(错误值或超出范围)。,错误:无法将值转换为整数(错误值或超出范围)。,错误:无法将值转换为整数(错误值或超出范围)。,错误:无法将值转换为整数(错误值或超出范围)。

In this particular case I should be verifying my input is an integer as expected before trying to put it into BigQuery (and then logging any data that fails validation). But the general question remains--let's say I want to see the input that caused this error, since (I think) I'm performing all appropriate input validation already and have no idea what sort of malformed data might cause this. How would I do that? I'm thinking some sort of try/catch type trick (possibly involving a log message) could work, but I'm not really sure how to make that happen.

在这种特殊情况下,我应该在尝试将其放入BigQuery之前验证我的输入是否为预期的整数(然后记录任何未通过验证的数据)。但一般的问题仍然存在 - 让我说我想看到导致此错误的输入,因为(我认为)我已经执行了所有适当的输入验证,并且不知道哪种格式错误的数据可能会导致这种情况。我该怎么办?我在想某种尝试/捕获类型技巧(可能涉及日志消息)可以工作,但我不确定如何实现这一点。

Thanks!

谢谢!

1 个解决方案

#1


1  

The approach you suggest (using a try/catch, logging your parse errors separately) is a good way to go at the moment. We are actively studying options to equip pipeline writers to handle these type of issues.

你建议的方法(使用try / catch,单独记录你的解析错误)是一个很好的方法。我们正在积极研究为管道编写者提供处理这类问题的选项。

#1


1  

The approach you suggest (using a try/catch, logging your parse errors separately) is a good way to go at the moment. We are actively studying options to equip pipeline writers to handle these type of issues.

你建议的方法(使用try / catch,单独记录你的解析错误)是一个很好的方法。我们正在积极研究为管道编写者提供处理这类问题的选项。