从Google Cloud BigQuery读取数据

时间:2022-02-02 15:45:51

I am new to Pipeline world and Google API DataFlow.

我是Pipeline world和Google API DataFlow的新手。

I want to read data from BigQuery with sqlQuery. When I read all database it's working OK.

我想用sqlQuery从BigQuery读取数据。当我读取所有数据库时,它工作正常。

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> qData = p.apply(
     BigQueryIO.Read
         .named("Read")
         .from("test:DataSetTest.data"));

But when I use fromQuery I got error.

但是当我使用fromQuery时,我得到了错误。

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> qData = p.apply(
     BigQueryIO.Read
         .named("Read")
         .fromQuery("SELECT * FROM DataSetTest.data"));

Error:

错误:

Exception in thread "main" java.lang.IllegalArgumentException: Validation of query "SELECT * FROM DataSetTest.data" failed. If the query depends on an earlier stage of the pipeline, This validation can be disabled using #withoutValidation.

线程“main”中的异常java.lang.IllegalArgumentException:查询“SELECT * FROM DataSetTest.data”的验证失败。如果查询依赖于管道的早期阶段,则可以使用#withoutValidation禁用此验证。

at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.dryRunQuery(BigQueryIO.java:449)

在com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.dryRunQuery(BigQueryIO.java:449)

at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.validate(BigQueryIO.java:432)

在com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.validate(BigQueryIO.java:432)

at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:357)

在com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:357)

at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:267)

在com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:267)

at com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47)

在com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47)

at com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:151)

在com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:151)

at Test.java.packageid.StarterPipeline.main(StarterPipeline.java:72)

在Test.java.packageid.StarterPipeline.main(StarterPipeline.java:72)

Caused by: java.lang.NullPointerException: Required parameter projectId must be specified.

引起:java.lang.NullPointerException:必须指定必需参数projectId。

at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)

在com.google.api.client.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)

at com.google.api.client.util.Preconditions.checkNotNull(Preconditions.java:140)

在com.google.api.client.util.Preconditions.checkNotNull(Preconditions.java:140)

at com.google.api.services.bigquery.Bigquery$Jobs$Query.(Bigquery.java:1751)

在com.google.api.services.bigquery.Bigquery $ Jobs $ Query。(Bigquery.java:1751)

at com.google.api.services.bigquery.Bigquery$Jobs.query(Bigquery.java:1724)

在com.google.api.services.bigquery.Bigquery $ Jobs.query(Bigquery.java:1724)

at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.dryRunQuery(BigQueryIO.java:445)

在com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.dryRunQuery(BigQueryIO.java:445)

... 6 more

......还有6个

What is the problem here?

这里有什么问题?

UPDATE:

更新:

I set project by "options.setProject".

我通过“options.setProject”设置项目。

PipelineOptions options = PipelineOptionsFactory.create();
    Pipeline p = Pipeline.create(options);
    options.setProject("test");
    PCollection<TableRow> qData = p.apply(
         BigQueryIO.Read
             .named("Read")
             .fromQuery("SELECT * FROM DataSetTest.data"));

But now I got this message. Table is not found.

但现在我收到了这条消息。找不到表格。

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832", "reason" : "notFound" } ], "message" : "Not found: Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832" }

原因:com.google.api.client.googleapis.json.GoogleJsonResponseException:404 Not Found {“code”:404,“errors”:[{“domain”:“global”,“message”:“Not found:Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832“,”reason“:”notFound“}],”message“:”未找到:表测试:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832“}

1 个解决方案

#1


3  

All resources in Google Cloud Platform, including BigQuery tables and Dataflow jobs, are associated with a cloud project. Specifying the project is necessary when interacting with GCP resources.

Google Cloud Platform中的所有资源(包括BigQuery表和Dataflow作业)都与云项目相关联。在与GCP资源交互时,必须指定项目。

The exception trace is saying that no cloud project is set for the BigQueryIO.Read transform: Caused by: java.lang.NullPointerException: Required parameter projectId must be specified.

异常跟踪表明没有为BigQueryIO.Read转换设置云项目:引起:java.lang.NullPointerException:必须指定必需参数projectId。

Dataflow controls the default value of the cloud project via its PipelineOptions API. Dataflow will default to using the project across its APIs, including BigQueryIO.

Dataflow通过其PipelineOptions API控制云项目的默认值。 Dataflow将默认在其API中使用项目,包括BigQueryIO。

Normally, we recommend constructing the PipelineOptions from command line arguments using PipelineOptionsFactory.fromArgs(String) API. In this case, you'd just pass --project=YOUR_PROJECT on the command line.

通常,我们建议使用PipelineOptionsFactory.fromArgs(String)API从命令行参数构造PipelineOptions。在这种情况下,您只需在命令行上传递--project = YOUR_PROJECT。

Alternatively, this can be set manually in the code, as follows:

或者,可以在代码中手动设置,如下所示:

GcpOptions gcpOptions = options.as(GcpOptions.class);
options.setProject("YOUR_PROJECT");

Finally, starting with the version 1.4.0 of the Dataflow SDK for Java, Dataflow will default to using the cloud project set via gcloud config set project <project>. You can still override it via PipelineOptions, but don't need to. This may have worked in some scenarios even before version 1.4.0, but may not have been reliable in all scenarios or combinations of versions of Cloud SDK and Dataflow SDK.

最后,从Dataflow SDK for Java的1.4.0版本开始,Dataflow将默认使用通过gcloud config set project 设置的云项目。您仍然可以通过PipelineOptions覆盖它,但不需要。在版本1.4.0之前,这可能在某些情况下有效,但在所有方案或Cloud SDK和Dataflow SDK版本的组合中可能都不可靠。

#1


3  

All resources in Google Cloud Platform, including BigQuery tables and Dataflow jobs, are associated with a cloud project. Specifying the project is necessary when interacting with GCP resources.

Google Cloud Platform中的所有资源(包括BigQuery表和Dataflow作业)都与云项目相关联。在与GCP资源交互时,必须指定项目。

The exception trace is saying that no cloud project is set for the BigQueryIO.Read transform: Caused by: java.lang.NullPointerException: Required parameter projectId must be specified.

异常跟踪表明没有为BigQueryIO.Read转换设置云项目:引起:java.lang.NullPointerException:必须指定必需参数projectId。

Dataflow controls the default value of the cloud project via its PipelineOptions API. Dataflow will default to using the project across its APIs, including BigQueryIO.

Dataflow通过其PipelineOptions API控制云项目的默认值。 Dataflow将默认在其API中使用项目,包括BigQueryIO。

Normally, we recommend constructing the PipelineOptions from command line arguments using PipelineOptionsFactory.fromArgs(String) API. In this case, you'd just pass --project=YOUR_PROJECT on the command line.

通常,我们建议使用PipelineOptionsFactory.fromArgs(String)API从命令行参数构造PipelineOptions。在这种情况下,您只需在命令行上传递--project = YOUR_PROJECT。

Alternatively, this can be set manually in the code, as follows:

或者,可以在代码中手动设置,如下所示:

GcpOptions gcpOptions = options.as(GcpOptions.class);
options.setProject("YOUR_PROJECT");

Finally, starting with the version 1.4.0 of the Dataflow SDK for Java, Dataflow will default to using the cloud project set via gcloud config set project <project>. You can still override it via PipelineOptions, but don't need to. This may have worked in some scenarios even before version 1.4.0, but may not have been reliable in all scenarios or combinations of versions of Cloud SDK and Dataflow SDK.

最后,从Dataflow SDK for Java的1.4.0版本开始,Dataflow将默认使用通过gcloud config set project 设置的云项目。您仍然可以通过PipelineOptions覆盖它,但不需要。在版本1.4.0之前,这可能在某些情况下有效,但在所有方案或Cloud SDK和Dataflow SDK版本的组合中可能都不可靠。