As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)?
根据此AWS论坛主题,是否有人知道如何使用AWS Glue创建AWS Athena表,其分区包含不同的模式(在这种情况下,表模式中的列的不同子集)?
At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH'
目前,当我在此数据上运行抓取工具然后在Athena中进行查询时,我收到错误“HIVE_PARTITION_SCHEMA_MISMATCH”
My use case is:
我的用例是:
- Partitions represent days
- 分区代表天
- Files represent events
- 文件代表事件
- Each event is a json blob in a single s3 file
- 每个事件都是单个s3文件中的json blob
- An event contains a subset of columns (dependent on the type of event)
- 事件包含列的子集(取决于事件的类型)
- The 'schema' of the entire table is the full set of columns for all the event types (this is correctly put together by Glue crawler)
- 整个表的“模式”是所有事件类型的完整列(这是由Glue crawler正确组合的)
- The 'schema' of each partition is the subset of columns for the event types that occurred on that day (hence in Glue each partition potentially has a different subset of columns from the table schema)
- 每个分区的“模式”是当天发生的事件类型的列的子集(因此在Glue中,每个分区可能具有与表模式不同的列子集)
- This inconsistency causes the error in Athena I think
- 这种不一致导致我认为雅典娜的错误
If I were to manually write a schema I could do this fine as there would just be one table schema, and keys which are missing in the JSON file would be treated as Nulls.
如果我要手动编写模式,我可以做到这一点,因为只有一个表模式,JSON文件中缺少的键将被视为Null。
Thanks in advance!
提前致谢!
2 个解决方案
#1
#1
13
I had the same issue, solved it by configuring crawler to update table metadata for preexisting partitions:
我有同样的问题,通过配置crawler来更新预先存在的分区的表元数据来解决它: