用于BigQuerySink的bigquery.TableSchema的JSON表模式

时间:2021-02-27 15:30:03

I have a non-trivial table schema (involving nested and repeated fields) defined in JSON format (with the name, type, mode attributes) and stored in a file. It has been successfully used to populate a bigquery table with bq load command.

我有一个以JSON格式定义的非平凡的表模式(涉及嵌套和重复的字段)(具有名称,类型,模式属性)并存储在文件中。它已成功用于使用bq load命令填充bigquery表。

But when I try to do the same thing with Dataflow Python SDK and BigQuerySink, the schema argument needs to be either a comma-separated list of 'name':'type' elements, or a bigquery.TableSchema object.

但是当我尝试使用Dataflow Python SDK和BigQuerySink做同样的事情时,schema参数需要是逗号分隔的'name'列表:'type'元素或bigquery.TableSchema对象。

Is there any convenient way of getting my JSON schema to a bigquery.TableSchema, or do I have to transform it to a name:value list?

有没有方便的方法将我的JSON模式提供给bigquery.TableSchema,还是我必须将其转换为名称:值列表?

2 个解决方案

#1


6  

Currently you cannot directly specify a JSON schema. You have to specify the schema either as a string that contains a comma separated list of fields or a bigquery.TableSchema object.

目前,您无法直接指定JSON架构。您必须将模式指定为包含逗号分隔的字段列表的字符串或bigquery.TableSchema对象。

If the schema is complex and contains nested and/or repeated fields, we recommend building a bigquery.TableSchema object.

如果架构很复杂并且包含嵌套和/或重复的字段,我们建议构建一个bigquery.TableSchema对象。

Here is an example bigquery.TableSchema object with nested and repeated fields.

这是一个带有嵌套和重复字段的bigquery.TableSchema对象示例。

from apitools.clients import bigquery

table_schema = bigquery.TableSchema()

# ‘string’ field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'fullName'
field_schema.type = 'string'
field_schema.mode = 'required'
table_schema.fields.append(field_schema)

# ‘integer’ field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'age'
field_schema.type = 'integer'
field_schema.mode = 'nullable'
table_schema.fields.append(field_schema)

# nested field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'phoneNumber'
field_schema.type = 'record'
field_schema.mode = 'nullable'

area_code = bigquery.TableFieldSchema()
area_code.name = 'areaCode'
area_code.type = 'integer'
area_code.mode = 'nullable'
field_schema.fields.append(area_code)

number = bigquery.TableFieldSchema()
number.name = 'number'
number.type = 'integer'
number.mode = 'nullable'
field_schema.fields.append(number)
table_schema.fields.append(field_schema)

# repeated field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'children'
field_schema.type = 'string'
field_schema.mode = 'repeated'
table_schema.fields.append(field_schema)

#2


2  

I had the same problem. In my case I already had some json loaded in bigquery with a schema automatically generated.

我有同样的问题。在我的情况下,我已经在bigquery中加载了一些json,并自动生成了一个模式。

So I was able to get the autogenerated schemawith the command:

所以我能够通过命令获得自动生成的模式:

bq show --format prettyjson my-gcp-project:my-bq-table |jq .schema > my-bq-table.json

the schema can then be transformed into a bigquery.TableSchema with this snippet

然后可以使用此代码段将架构转换为bigquery.TableSchema

from apache_beam.io.gcp.internal.clients import bigquery


def _get_field_schema(**kwargs):
    field_schema = bigquery.TableFieldSchema()
    field_schema.name = kwargs['name']
    field_schema.type = kwargs.get('type', 'STRING')
    field_schema.mode = kwargs.get('mode', 'NULLABLE')
    fields = kwargs.get('fields')
    if fields:
        for field in fields:
            field_schema.fields.append(_get_field_schema(**field))
    return field_schema


def _inject_fields(fields, table_schema):
    for field in fields:
        table_schema.fields.append(_get_field_schema(**field))


def parse_bq_json_schema(schema):
    table_schema = bigquery.TableSchema()
    _inject_fields(schema['fields'], table_schema)
    return table_schema

It will work with the bigquery json schema specification and if you are lazy like me you can avoid to specify type and mode if you are happy with a field that is a nullable string by default.

它将与bigquery json模式规范一起使用,如果你像我一样懒,如果你对默认情况下可以为空的字符串的字段感到满意,则可以避免指定类型和模式。

#1


6  

Currently you cannot directly specify a JSON schema. You have to specify the schema either as a string that contains a comma separated list of fields or a bigquery.TableSchema object.

目前,您无法直接指定JSON架构。您必须将模式指定为包含逗号分隔的字段列表的字符串或bigquery.TableSchema对象。

If the schema is complex and contains nested and/or repeated fields, we recommend building a bigquery.TableSchema object.

如果架构很复杂并且包含嵌套和/或重复的字段,我们建议构建一个bigquery.TableSchema对象。

Here is an example bigquery.TableSchema object with nested and repeated fields.

这是一个带有嵌套和重复字段的bigquery.TableSchema对象示例。

from apitools.clients import bigquery

table_schema = bigquery.TableSchema()

# ‘string’ field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'fullName'
field_schema.type = 'string'
field_schema.mode = 'required'
table_schema.fields.append(field_schema)

# ‘integer’ field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'age'
field_schema.type = 'integer'
field_schema.mode = 'nullable'
table_schema.fields.append(field_schema)

# nested field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'phoneNumber'
field_schema.type = 'record'
field_schema.mode = 'nullable'

area_code = bigquery.TableFieldSchema()
area_code.name = 'areaCode'
area_code.type = 'integer'
area_code.mode = 'nullable'
field_schema.fields.append(area_code)

number = bigquery.TableFieldSchema()
number.name = 'number'
number.type = 'integer'
number.mode = 'nullable'
field_schema.fields.append(number)
table_schema.fields.append(field_schema)

# repeated field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'children'
field_schema.type = 'string'
field_schema.mode = 'repeated'
table_schema.fields.append(field_schema)

#2


2  

I had the same problem. In my case I already had some json loaded in bigquery with a schema automatically generated.

我有同样的问题。在我的情况下,我已经在bigquery中加载了一些json,并自动生成了一个模式。

So I was able to get the autogenerated schemawith the command:

所以我能够通过命令获得自动生成的模式:

bq show --format prettyjson my-gcp-project:my-bq-table |jq .schema > my-bq-table.json

the schema can then be transformed into a bigquery.TableSchema with this snippet

然后可以使用此代码段将架构转换为bigquery.TableSchema

from apache_beam.io.gcp.internal.clients import bigquery


def _get_field_schema(**kwargs):
    field_schema = bigquery.TableFieldSchema()
    field_schema.name = kwargs['name']
    field_schema.type = kwargs.get('type', 'STRING')
    field_schema.mode = kwargs.get('mode', 'NULLABLE')
    fields = kwargs.get('fields')
    if fields:
        for field in fields:
            field_schema.fields.append(_get_field_schema(**field))
    return field_schema


def _inject_fields(fields, table_schema):
    for field in fields:
        table_schema.fields.append(_get_field_schema(**field))


def parse_bq_json_schema(schema):
    table_schema = bigquery.TableSchema()
    _inject_fields(schema['fields'], table_schema)
    return table_schema

It will work with the bigquery json schema specification and if you are lazy like me you can avoid to specify type and mode if you are happy with a field that is a nullable string by default.

它将与bigquery json模式规范一起使用,如果你像我一样懒,如果你对默认情况下可以为空的字符串的字段感到满意,则可以避免指定类型和模式。