如何使用BigQuery测试Dataflow Pipeline

时间:2021-12-05 14:54:57

I'd like to test my pipeline. My pipeline extract data from BigQuery, then store data to GCS and S3. Although there are some information about pipeline test here, https://cloud.google.com/dataflow/pipelines/testing-your-pipeline, it does not include about data model of extracting data from BigQuery.

我想测试我的管道。我的管道从BigQuery中提取数据,然后将数据存储到GCS和S3。虽然这里有一些关于管道测试的信息,https://cloud.google.com/dataflow/pipelines/testing-your-pipeline,但它不包括从BigQuery提取数据的数据模型。

I found following example for it, but it lacks of comment, so little bit difficult to understand. https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/test/java/com/google/cloud/dataflow/examples/cookbook/BigQueryTornadoesTest.java

我找到了以下示例,但它没有评论,所以有点难以理解。 https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/test/java/com/google/cloud/dataflow/examples/cookbook/BigQueryTornadoesTest.java

Are there any good documents for test my pipeline?

有没有好的文件来测试我的管道?

1 个解决方案

#1


1  

In order to properly integration test your entire pipeline, please create a small amount of sample data stored in BigQuery. Also, please create a sample bucket/folder in S3 and GCS to store your output. Then run your pipeline as you normally would, using PipelineOptions to specify the test BQ table. You can use the DirectPipelineRunner if you want to run locally. It will probably be easiest to create a script which will first run the pipeline, then down the data from S3 and GCS and verify you see what you expect.

为了正确地集成测试整个管道,请创建存储在BigQuery中的少量样本数据。另外,请在S3和GCS中创建一个示例桶/文件夹来存储您的输出。然后像往常一样运行管道,使用PipelineOptions指定测试BQ表。如果要在本地运行,可以使用DirectPipelineRunner。最简单的方法是创建一个首先运行管道的脚本,然后从S3和GCS中删除数据并验证您是否看到了预期。

If you want to just test your pipeline's transforms on some offline data, then please follow the WordCount example.

如果您只想在某些离线数据上测试管道的转换,请按照WordCount示例进行操作。

#1


1  

In order to properly integration test your entire pipeline, please create a small amount of sample data stored in BigQuery. Also, please create a sample bucket/folder in S3 and GCS to store your output. Then run your pipeline as you normally would, using PipelineOptions to specify the test BQ table. You can use the DirectPipelineRunner if you want to run locally. It will probably be easiest to create a script which will first run the pipeline, then down the data from S3 and GCS and verify you see what you expect.

为了正确地集成测试整个管道,请创建存储在BigQuery中的少量样本数据。另外,请在S3和GCS中创建一个示例桶/文件夹来存储您的输出。然后像往常一样运行管道,使用PipelineOptions指定测试BQ表。如果要在本地运行,可以使用DirectPipelineRunner。最简单的方法是创建一个首先运行管道的脚本,然后从S3和GCS中删除数据并验证您是否看到了预期。

If you want to just test your pipeline's transforms on some offline data, then please follow the WordCount example.

如果您只想在某些离线数据上测试管道的转换,请按照WordCount示例进行操作。