将存储在Google云存储中的数据加载到BigQuery的多字符分隔符

I want to load data with multiple character delimiter to BigQuery. BQ load command currently does not support multiple character delimiter. It supports only single character delimiter like '|', '$', '~' etc

我想将具有多个字符分隔符的数据加载到BigQuery。 BQ load命令目前不支持多个字符分隔符。它仅支持单个字符分隔符，如“|”，“$”，“〜”等

I know there is a dataflow approach where it will read data from those files and write to BigQuery. But I have a large number of small files(each file of 400MB) which have to be written a separate partition of a table(partition numbering around 700). This approach is slow with dataflow because I have to currently start a different dataflow job for writing each file to a separate table using a for loop. This approach is running for more than 24 hours and still not complete.

我知道有一种数据流方法，它将从这些文件中读取数据并写入BigQuery。但是我有大量的小文件（每个400MB的文件）必须写一个表的单独分区（分区编号大约700）。这种方法在数据流方面很慢，因为我必须启动一个不同的数据流作业，使用for循环将每个文件写入一个单独的表。此方法运行超过24小时仍未完成。

So is there any other approach to load these multiple files having multiple character delimiter to each partition of BigQuery?

那么有没有其他方法来加载这些具有多个字符分隔符的多个文件到BigQuery的每个分区？

2 个解决方案

#1

From the Dataflow perspective, you can make this easier by uploading multiple files in each pipeline. You can have a for loop in your main method while assembling the pipeline, essentially having many Read -> Write to BigQuery steps.

从Dataflow的角度来看，您可以通过在每个管道中上传多个文件来简化这一过程。在组装管道时，您可以在main方法中使用for循环，基本上有许多Read - > Write to BigQuery步骤。

有关详细信息，另请参阅从本地磁盘将数据加载到BigQuery和Google云存储的策略。

#2

My lazy approach to these problems: Don't parse in Dataflow, just send each row raw to BigQuery (one column per row).

我对这些问题的懒惰方法：不要在Dataflow中解析，只需将每行原始发送到BigQuery（每行一列）。

Then you can parse inside BigQuery with a JS UDF.

然后，您可以使用JS UDF解析BigQuery内部。

#1