使用BigQuery读取JSON文件来制作表格

时间:2021-08-22 15:29:47

I'm new to Google Dataflow, and can't get this thing to work with JSON. I've been reading throughout the documentation, but can't solve my problem.

我是Google Dataflow的新手,无法使用JSON。我一直在阅读整个文档,但无法解决我的问题。

So, following the WordCount example i figured how data is loaded from .csv file with next line

因此,在WordCount示例之后,我想到了如何使用下一行从.csv文件加载数据

PCollection<String> input = p.apply(TextIO.Read.from(options.getInputFile()));

where inputFile in .csv file from my gcloud bucket. I can transform read lines from .csv with:

其中来自我的gcloud存储桶的.csv文件中的inputFile。我可以用.csv转换读取行:

PCollection<TableRow> table = input.apply(ParDo.of(new ExtractParametersFn()));

(Extract ParametersFn defined by me). So far so good!

(提取我定义的ParametersFn)。到现在为止还挺好!


But then I realize my .csv file is too big and had to convert it to JSON (https://cloud.google.com/bigquery/preparing-data-for-bigquery). Since BigQueryIO is supposedly better for reading JSON, I tried with the following code:

但后来我意识到我的.csv文件太大了,不得不将其转换为JSON(https://cloud.google.com/bigquery/preparing-data-for-bigquery)。由于BigQueryIO应该更适合阅读JSON,我尝试使用以下代码:

 PCollection<TableRow> table = p.apply(BigQueryIO.Read.from(options.getInputFile()));

(inputFile is then JSON file and the output when reading with BigQuery is PCollection with TableRows) I tried with TextIO too (which returns PCollection with Strings) and neither of the two IO options work.

(inputFile是JSON文件,使用BigQuery读取时的输出是带TableRows的PCollection)我也尝试使用TextIO(它返回带有字符串的PCollection),两个IO选项都不起作用。

What am I missing? The documentation is really not that detailed to find an answer there, but perhaps some of you guys already dealt with this problem before?

我错过了什么?文档真的不是那么详细,在那里找到答案,但也许你们中的一些人之前已经处理过这个问题?

Any suggestions would be very appreciated. :)

任何建议将非常感激。 :)

1 个解决方案

#1


3  

I believe there are two options to consider:

我相信有两种选择:

  1. Use TextIO with TableRowJsonCoder to ingest the JSON files (e.g., like it is done in the TopWikipediaSessions example);
  2. 使用带有TableRowJsonCoder的TextIO来摄取JSON文件(例如,像在TopWikipediaSessions示例中一样);
  3. Import the JSON files into a bigquery table (https://cloud.google.com/bigquery/loading-data-into-bigquery), and then use BigQueryIO.Read to read from the table.
  4. 将JSON文件导入bigquery表(https://cloud.google.com/bigquery/loading-data-into-bigquery),然后使用BigQueryIO.Read从表中读取。

#1


3  

I believe there are two options to consider:

我相信有两种选择:

  1. Use TextIO with TableRowJsonCoder to ingest the JSON files (e.g., like it is done in the TopWikipediaSessions example);
  2. 使用带有TableRowJsonCoder的TextIO来摄取JSON文件(例如,像在TopWikipediaSessions示例中一样);
  3. Import the JSON files into a bigquery table (https://cloud.google.com/bigquery/loading-data-into-bigquery), and then use BigQueryIO.Read to read from the table.
  4. 将JSON文件导入bigquery表(https://cloud.google.com/bigquery/loading-data-into-bigquery),然后使用BigQueryIO.Read从表中读取。