在Dataflow中读取BigQuery联合表作为源会引发错误

时间:2021-05-10 15:25:34

I have a federated source in BigQuery which is pointing to some CSV files in GCS.

我在BigQuery中有一个联合源,它指向GCS中的一些CSV文件。

When I try to read to the federated BigQuery table as a source for a Dataflow pipeline, it throws the following error:

当我尝试读取联合BigQuery表作为Dataflow管道的源时,它会抛出以下错误:

    1226 [main] ERROR com.google.cloud.dataflow.sdk.util.BigQueryTableRowIterator  - Error reading from BigQuery table Federated_test_dataflow of dataset CPT_7414_PLAYGROUND : 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Cannot list a table of type EXTERNAL.",
    "reason" : "invalid"
  } ],
  "message" : "Cannot list a table of type EXTERNAL."
}

Does Dataflow not support federated sources in BigQuery, or am I doing something wrong? I do know that I could read the files from GCS directly into my pipeline, but I'd prefer to work with BigQuery TableRow objects instead due to the design of the application.

Dataflow不支持BigQuery中的联合源,还是我做错了什么?我知道我可以直接将GCS中的文件读入我的管道,但由于应用程序的设计,我更愿意使用BigQuery TableRow对象。

 PCollection<TableRow> results = pipeline.apply("fed-test", BigQueryIO.Read.from("<project_id>:CPT_7414_PLAYGROUND.Federated_test_dataflow")).apply(ParDo.of(new DoFn<TableRow, TableRow>() {
        @Override
        public void processElement(ProcessContext c) throws Exception {
            System.out.println(c.element());
        }
    }));

2 个解决方案

#1


3  

As Michael says, BigQuery does not support directly reading from EXTERNAL (federated tables) or VIEWs: even reading effectively takes a query.

正如迈克尔所说,BigQuery不支持直接从EXTERNAL(联合表)或VIEW中读取:即使阅读也会有效地进行查询。

To read from these tables in Dataflow, you can instead use

要从Dataflow中的这些表中读取,您可以改为使用

BigQueryIO.Read.fromQuery("SELECT * FROM table_or_view_name")

which will issue the query and save the result to a temporary table, and then begin the read process. Of course, this will incur the costs of querying on BigQuery, so if you wish to read from the same VIEW or EXTERNAL table repeatedly you may want to manually create the table.

它将发出查询并将结果保存到临时表,然后开始读取过程。当然,这将导致查询BigQuery的成本,因此如果您希望重复读取相同的VIEW或EXTERNAL表,您可能需要手动创建表。

#2


4  

The Dataflow BigQuery source was designed to read BigQuery managed tables of type "TABLE". (The type definition can be found at https://cloud.google.com/bigquery/docs/reference/v2/tables#type.) EXTERNAL and VIEW tables are not supported.

Dataflow BigQuery源旨在读取“TABLE”类型的BigQuery托管表。 (可在https://cloud.google.com/bigquery/docs/reference/v2/tables#type上找到类型定义。)不支持EXTERNAL和VIEW表。

The BigQuery "federated table" feature allows bigquery to directly query data in places like Google Cloud Storage. Dataflow can also read files from Google Cloud Storage, so you should be able to point your Dataflow computation directly at the sources you want to read.

BigQuery“联合表”功能允许bigquery直接在Google云端存储等位置查询数据。数据流还可以从Google云端存储中读取文件,因此您应该能够直接在要读取的源上指向数据流计算。

#1


3  

As Michael says, BigQuery does not support directly reading from EXTERNAL (federated tables) or VIEWs: even reading effectively takes a query.

正如迈克尔所说,BigQuery不支持直接从EXTERNAL(联合表)或VIEW中读取:即使阅读也会有效地进行查询。

To read from these tables in Dataflow, you can instead use

要从Dataflow中的这些表中读取,您可以改为使用

BigQueryIO.Read.fromQuery("SELECT * FROM table_or_view_name")

which will issue the query and save the result to a temporary table, and then begin the read process. Of course, this will incur the costs of querying on BigQuery, so if you wish to read from the same VIEW or EXTERNAL table repeatedly you may want to manually create the table.

它将发出查询并将结果保存到临时表,然后开始读取过程。当然,这将导致查询BigQuery的成本,因此如果您希望重复读取相同的VIEW或EXTERNAL表,您可能需要手动创建表。

#2


4  

The Dataflow BigQuery source was designed to read BigQuery managed tables of type "TABLE". (The type definition can be found at https://cloud.google.com/bigquery/docs/reference/v2/tables#type.) EXTERNAL and VIEW tables are not supported.

Dataflow BigQuery源旨在读取“TABLE”类型的BigQuery托管表。 (可在https://cloud.google.com/bigquery/docs/reference/v2/tables#type上找到类型定义。)不支持EXTERNAL和VIEW表。

The BigQuery "federated table" feature allows bigquery to directly query data in places like Google Cloud Storage. Dataflow can also read files from Google Cloud Storage, so you should be able to point your Dataflow computation directly at the sources you want to read.

BigQuery“联合表”功能允许bigquery直接在Google云端存储等位置查询数据。数据流还可以从Google云端存储中读取文件,因此您应该能够直接在要读取的源上指向数据流计算。