使用Python将表从云存储加载到BigQuery

时间:2021-12-29 14:25:43

Could someone share an example of a job config for uploading json newline_delimited file to a new Bigquery table, please?

有人可以分享一个作业配置示例,将json newline_delimited文件上传到新的Bigquery表吗?

Trying to do this based on google docs with no success so far.

尝试基于谷歌文档这样做到目前为止没有成功。

1 个解决方案

#1


1  

This example from GCP repository is a good one for loading data from GCS.

GCP存储库中的这个示例非常适合从GCS加载数据。

The only thing you will have to adapt in your code is setting the job.source_format to be the new delimited json file, like so:

您需要在代码中进行调整的唯一方法是将job.source_format设置为新的分隔json文件,如下所示:

def load_data_from_gcs(dataset_name, table_name, source):
    bigquery_client = bigquery.Client()
    dataset = bigquery_client.dataset(dataset_name)
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, source)

    job.source_format = 'NEWLINE_DELIMITED_JSON'
    job.begin()

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))

(The correct thing would be to receive this parameter as input in your function but this works as an example).

(正确的做法是在函数中接收此参数作为输入,但这只是一个例子)。

Also, the table should already exist when you run this code (I looked for schema auto-detection in the Python API but it seems there isn't one yet).

此外,当您运行此代码时,该表应该已经存在(我在Python API中查找了模式自动检测,但似乎还没有一个)。

#1


1  

This example from GCP repository is a good one for loading data from GCS.

GCP存储库中的这个示例非常适合从GCS加载数据。

The only thing you will have to adapt in your code is setting the job.source_format to be the new delimited json file, like so:

您需要在代码中进行调整的唯一方法是将job.source_format设置为新的分隔json文件,如下所示:

def load_data_from_gcs(dataset_name, table_name, source):
    bigquery_client = bigquery.Client()
    dataset = bigquery_client.dataset(dataset_name)
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, source)

    job.source_format = 'NEWLINE_DELIMITED_JSON'
    job.begin()

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))

(The correct thing would be to receive this parameter as input in your function but this works as an example).

(正确的做法是在函数中接收此参数作为输入,但这只是一个例子)。

Also, the table should already exist when you run this code (I looked for schema auto-detection in the Python API but it seems there isn't one yet).

此外,当您运行此代码时,该表应该已经存在(我在Python API中查找了模式自动检测,但似乎还没有一个)。