
时间:2021-12-31 15:25:45

I want to load data with multiple character delimiter to BigQuery. BQ load command currently does not support multiple character delimiter. It supports only single character delimiter like '|', '$', '~' etc

我想将具有多个字符分隔符的数据加载到BigQuery。 BQ load命令目前不支持多个字符分隔符。它仅支持单个字符分隔符,如“|”,“$”,“〜”等

I know there is a dataflow approach where it will read data from those files and write to BigQuery. But I have a large number of small files(each file of 400MB) which have to be written a separate partition of a table(partition numbering around 700). This approach is slow with dataflow because I have to currently start a different dataflow job for writing each file to a separate table using a for loop. This approach is running for more than 24 hours and still not complete.


So is there any other approach to load these multiple files having multiple character delimiter to each partition of BigQuery?


2 个解决方案



From the Dataflow perspective, you can make this easier by uploading multiple files in each pipeline. You can have a for loop in your main method while assembling the pipeline, essentially having many Read -> Write to BigQuery steps.

从Dataflow的角度来看,您可以通过在每个管道中上传多个文件来简化这一过程。在组装管道时,您可以在main方法中使用for循环,基本上有许多Read - > Write to BigQuery步骤。

See also Strategy for loading data into BigQuery and Google cloud Storage from local disk for more information.




My lazy approach to these problems: Don't parse in Dataflow, just send each row raw to BigQuery (one column per row).


Then you can parse inside BigQuery with a JS UDF.

然后,您可以使用JS UDF解析BigQuery内部。



From the Dataflow perspective, you can make this easier by uploading multiple files in each pipeline. You can have a for loop in your main method while assembling the pipeline, essentially having many Read -> Write to BigQuery steps.

从Dataflow的角度来看,您可以通过在每个管道中上传多个文件来简化这一过程。在组装管道时,您可以在main方法中使用for循环,基本上有许多Read - > Write to BigQuery步骤。

See also Strategy for loading data into BigQuery and Google cloud Storage from local disk for more information.




My lazy approach to these problems: Don't parse in Dataflow, just send each row raw to BigQuery (one column per row).


Then you can parse inside BigQuery with a JS UDF.

然后,您可以使用JS UDF解析BigQuery内部。