将存储在Google云存储中的数据加载到BigQuery的多字符分隔符

时间:2021-12-31 15:25:45

I want to load data with multiple character delimiter to BigQuery. BQ load command currently does not support multiple character delimiter. It supports only single character delimiter like '|', '$', '~' etc

我想将具有多个字符分隔符的数据加载到BigQuery。 BQ load命令目前不支持多个字符分隔符。它仅支持单个字符分隔符,如“|”,“$”,“〜”等

I know there is a dataflow approach where it will read data from those files and write to BigQuery. But I have a large number of small files(each file of 400MB) which have to be written a separate partition of a table(partition numbering around 700). This approach is slow with dataflow because I have to currently start a different dataflow job for writing each file to a separate table using a for loop. This approach is running for more than 24 hours and still not complete.

我知道有一种数据流方法,它将从这些文件中读取数据并写入BigQuery。但是我有大量的小文件(每个400MB的文件)必须写一个表的单独分区(分区编号大约700)。这种方法在数据流方面很慢,因为我必须启动一个不同的数据流作业,使用for循环将每个文件写入一个单独的表。此方法运行超过24小时仍未完成。

So is there any other approach to load these multiple files having multiple character delimiter to each partition of BigQuery?

那么有没有其他方法来加载这些具有多个字符分隔符的多个文件到BigQuery的每个分区?

2 个解决方案

#1


1  

From the Dataflow perspective, you can make this easier by uploading multiple files in each pipeline. You can have a for loop in your main method while assembling the pipeline, essentially having many Read -> Write to BigQuery steps.

从Dataflow的角度来看,您可以通过在每个管道中上传多个文件来简化这一过程。在组装管道时,您可以在main方法中使用for循环,基本上有许多Read - > Write to BigQuery步骤。

See also Strategy for loading data into BigQuery and Google cloud Storage from local disk for more information.

有关详细信息,另请参阅从本地磁盘将数据加载到BigQuery和Google云存储的策略。

#2


0  

My lazy approach to these problems: Don't parse in Dataflow, just send each row raw to BigQuery (one column per row).

我对这些问题的懒惰方法:不要在Dataflow中解析,只需将每行原始发送到BigQuery(每行一列)。

Then you can parse inside BigQuery with a JS UDF.

然后,您可以使用JS UDF解析BigQuery内部。

#1


1  

From the Dataflow perspective, you can make this easier by uploading multiple files in each pipeline. You can have a for loop in your main method while assembling the pipeline, essentially having many Read -> Write to BigQuery steps.

从Dataflow的角度来看,您可以通过在每个管道中上传多个文件来简化这一过程。在组装管道时,您可以在main方法中使用for循环,基本上有许多Read - > Write to BigQuery步骤。

See also Strategy for loading data into BigQuery and Google cloud Storage from local disk for more information.

有关详细信息,另请参阅从本地磁盘将数据加载到BigQuery和Google云存储的策略。

#2


0  

My lazy approach to these problems: Don't parse in Dataflow, just send each row raw to BigQuery (one column per row).

我对这些问题的懒惰方法:不要在Dataflow中解析,只需将每行原始发送到BigQuery(每行一列)。

Then you can parse inside BigQuery with a JS UDF.

然后,您可以使用JS UDF解析BigQuery内部。