如何在Hive生成的平面文件上运行adhoc SQL查询?

时间:2022-11-15 00:57:52

We analyze of our log data using Hive and we store the aggregation results in daily partitioned text fields on S3 (let's call them "coarse" aggregations).


These aggregation results are rather small (not more than a few MB per day) and we have a Javascript dashboard that loads and visualizes certain aspects of this data (let's call them "fine-grained" aggregations).


Right now we perform the "fine-grained" aggregations with Javascript code. I rather want to use SQL queries here, too, for simplicity. I wonder what best practices exists for this kind of problem?


A) We could generate all "fine-grained" aggregations in Hive. However, operating on these small data sets takes ages in Hive.


B) We could introduce a "fast-access-layer" between S3 and Javascript that can run SQL queries. What query engine would you recommend?


1 个解决方案


Use Presto for fast access to not very big datasets. Presto is an in-memory distributed SQL query engine and optimized for interactive queries, star schema joins(big fact table with small dimensions). Memory to memory data transfer without disk writes is a key feature of presto. You can query your Hive tables using Presto Hive connector.

使用Presto可以快速访问不是很大的数据集。 Presto是一个内存中分布式SQL查询引擎,针对交互式查询,星型模式连接(具有小尺寸的大事实表)进行了优化。没有磁盘写入的内存到内存数据传输是presto的一个关键特性。您可以使用Presto Hive连接器查询Hive表。


Use Presto for fast access to not very big datasets. Presto is an in-memory distributed SQL query engine and optimized for interactive queries, star schema joins(big fact table with small dimensions). Memory to memory data transfer without disk writes is a key feature of presto. You can query your Hive tables using Presto Hive connector.

使用Presto可以快速访问不是很大的数据集。 Presto是一个内存中分布式SQL查询引擎,针对交互式查询,星型模式连接(具有小尺寸的大事实表)进行了优化。没有磁盘写入的内存到内存数据传输是presto的一个关键特性。您可以使用Presto Hive连接器查询Hive表。