I have a non-trivial table schema (involving nested and repeated fields) defined in JSON format (with the name, type, mode attributes) and stored in a file. It has been successfully used to populate a bigquery table with bq load command.
我有一个以JSON格式定义的非平凡的表模式(涉及嵌套和重复的字段)(具有名称,类型,模式属性)并存储在文件中。它已成功用于使用bq load命令填充bigquery表。
But when I try to do the same thing with Dataflow Python SDK and BigQuerySink, the schema
argument needs to be either a comma-separated list of 'name':'type'
elements, or a bigquery.TableSchema
object.
但是当我尝试使用Dataflow Python SDK和BigQuerySink做同样的事情时,schema参数需要是逗号分隔的'name'列表:'type'元素或bigquery.TableSchema对象。
Is there any convenient way of getting my JSON schema to a bigquery.TableSchema
, or do I have to transform it to a name:value
list?
有没有方便的方法将我的JSON模式提供给bigquery.TableSchema,还是我必须将其转换为名称:值列表?
2 个解决方案
#1
6
Currently you cannot directly specify a JSON schema. You have to specify the schema either as a string that contains a comma separated list of fields or a bigquery.TableSchema
object.
目前,您无法直接指定JSON架构。您必须将模式指定为包含逗号分隔的字段列表的字符串或bigquery.TableSchema对象。
If the schema is complex and contains nested and/or repeated fields, we recommend building a bigquery.TableSchema
object.
如果架构很复杂并且包含嵌套和/或重复的字段,我们建议构建一个bigquery.TableSchema对象。
Here is an example bigquery.TableSchema
object with nested and repeated fields.
这是一个带有嵌套和重复字段的bigquery.TableSchema对象示例。
from apitools.clients import bigquery
table_schema = bigquery.TableSchema()
# ‘string’ field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'fullName'
field_schema.type = 'string'
field_schema.mode = 'required'
table_schema.fields.append(field_schema)
# ‘integer’ field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'age'
field_schema.type = 'integer'
field_schema.mode = 'nullable'
table_schema.fields.append(field_schema)
# nested field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'phoneNumber'
field_schema.type = 'record'
field_schema.mode = 'nullable'
area_code = bigquery.TableFieldSchema()
area_code.name = 'areaCode'
area_code.type = 'integer'
area_code.mode = 'nullable'
field_schema.fields.append(area_code)
number = bigquery.TableFieldSchema()
number.name = 'number'
number.type = 'integer'
number.mode = 'nullable'
field_schema.fields.append(number)
table_schema.fields.append(field_schema)
# repeated field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'children'
field_schema.type = 'string'
field_schema.mode = 'repeated'
table_schema.fields.append(field_schema)
#2
2
I had the same problem. In my case I already had some json loaded in bigquery with a schema automatically generated.
我有同样的问题。在我的情况下,我已经在bigquery中加载了一些json,并自动生成了一个模式。
So I was able to get the autogenerated schemawith the command:
所以我能够通过命令获得自动生成的模式:
bq show --format prettyjson my-gcp-project:my-bq-table |jq .schema > my-bq-table.json
the schema can then be transformed into a bigquery.TableSchema
with this snippet
然后可以使用此代码段将架构转换为bigquery.TableSchema
from apache_beam.io.gcp.internal.clients import bigquery
def _get_field_schema(**kwargs):
field_schema = bigquery.TableFieldSchema()
field_schema.name = kwargs['name']
field_schema.type = kwargs.get('type', 'STRING')
field_schema.mode = kwargs.get('mode', 'NULLABLE')
fields = kwargs.get('fields')
if fields:
for field in fields:
field_schema.fields.append(_get_field_schema(**field))
return field_schema
def _inject_fields(fields, table_schema):
for field in fields:
table_schema.fields.append(_get_field_schema(**field))
def parse_bq_json_schema(schema):
table_schema = bigquery.TableSchema()
_inject_fields(schema['fields'], table_schema)
return table_schema
It will work with the bigquery json schema specification and if you are lazy like me you can avoid to specify type
and mode
if you are happy with a field that is a nullable string by default.
它将与bigquery json模式规范一起使用,如果你像我一样懒,如果你对默认情况下可以为空的字符串的字段感到满意,则可以避免指定类型和模式。
#1
6
Currently you cannot directly specify a JSON schema. You have to specify the schema either as a string that contains a comma separated list of fields or a bigquery.TableSchema
object.
目前,您无法直接指定JSON架构。您必须将模式指定为包含逗号分隔的字段列表的字符串或bigquery.TableSchema对象。
If the schema is complex and contains nested and/or repeated fields, we recommend building a bigquery.TableSchema
object.
如果架构很复杂并且包含嵌套和/或重复的字段,我们建议构建一个bigquery.TableSchema对象。
Here is an example bigquery.TableSchema
object with nested and repeated fields.
这是一个带有嵌套和重复字段的bigquery.TableSchema对象示例。
from apitools.clients import bigquery
table_schema = bigquery.TableSchema()
# ‘string’ field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'fullName'
field_schema.type = 'string'
field_schema.mode = 'required'
table_schema.fields.append(field_schema)
# ‘integer’ field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'age'
field_schema.type = 'integer'
field_schema.mode = 'nullable'
table_schema.fields.append(field_schema)
# nested field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'phoneNumber'
field_schema.type = 'record'
field_schema.mode = 'nullable'
area_code = bigquery.TableFieldSchema()
area_code.name = 'areaCode'
area_code.type = 'integer'
area_code.mode = 'nullable'
field_schema.fields.append(area_code)
number = bigquery.TableFieldSchema()
number.name = 'number'
number.type = 'integer'
number.mode = 'nullable'
field_schema.fields.append(number)
table_schema.fields.append(field_schema)
# repeated field
field_schema = bigquery.TableFieldSchema()
field_schema.name = 'children'
field_schema.type = 'string'
field_schema.mode = 'repeated'
table_schema.fields.append(field_schema)
#2
2
I had the same problem. In my case I already had some json loaded in bigquery with a schema automatically generated.
我有同样的问题。在我的情况下,我已经在bigquery中加载了一些json,并自动生成了一个模式。
So I was able to get the autogenerated schemawith the command:
所以我能够通过命令获得自动生成的模式:
bq show --format prettyjson my-gcp-project:my-bq-table |jq .schema > my-bq-table.json
the schema can then be transformed into a bigquery.TableSchema
with this snippet
然后可以使用此代码段将架构转换为bigquery.TableSchema
from apache_beam.io.gcp.internal.clients import bigquery
def _get_field_schema(**kwargs):
field_schema = bigquery.TableFieldSchema()
field_schema.name = kwargs['name']
field_schema.type = kwargs.get('type', 'STRING')
field_schema.mode = kwargs.get('mode', 'NULLABLE')
fields = kwargs.get('fields')
if fields:
for field in fields:
field_schema.fields.append(_get_field_schema(**field))
return field_schema
def _inject_fields(fields, table_schema):
for field in fields:
table_schema.fields.append(_get_field_schema(**field))
def parse_bq_json_schema(schema):
table_schema = bigquery.TableSchema()
_inject_fields(schema['fields'], table_schema)
return table_schema
It will work with the bigquery json schema specification and if you are lazy like me you can avoid to specify type
and mode
if you are happy with a field that is a nullable string by default.
它将与bigquery json模式规范一起使用,如果你像我一样懒,如果你对默认情况下可以为空的字符串的字段感到满意,则可以避免指定类型和模式。