如何在Hive生成的平面文件上运行adhoc SQL查询?

时间:2021-07-19 21:32:57

We analyze of our log data using Hive and we store the aggregation results in daily partitioned text fields on S3 (let's call them "coarse" aggregations).

我们使用Hive分析我们的日志数据,并将聚合结果存储在S3上的每日分区文本字段中(让我们称之为“粗略”聚合)。

These aggregation results are rather small (not more than a few MB per day) and we have a Javascript dashboard that loads and visualizes certain aspects of this data (let's call them "fine-grained" aggregations).

这些聚合结果相当小(每天不超过几MB),我们有一个Javascript仪表板,可以加载和可视化这些数据的某些方面(让我们称之为“细粒度”聚合)。

Right now we perform the "fine-grained" aggregations with Javascript code. I rather want to use SQL queries here, too, for simplicity. I wonder what best practices exists for this kind of problem?

现在我们使用Javascript代码执行“细粒度”聚合。为简单起见,我也想在这里使用SQL查询。我想知道这类问题有哪些最佳实践?

A) We could generate all "fine-grained" aggregations in Hive. However, operating on these small data sets takes ages in Hive.

A)我们可以在Hive中生成所有“细粒度”聚合。但是,在Hive上运行这些小型数据集需要很长时间。

B) We could introduce a "fast-access-layer" between S3 and Javascript that can run SQL queries. What query engine would you recommend?

B)我们可以在S3和Javascript之间引入一个可以运行SQL查询的“快速访问层”。你会推荐什么查询引擎?

1 个解决方案

#1


Use Presto for fast access to not very big datasets. Presto is an in-memory distributed SQL query engine and optimized for interactive queries, star schema joins(big fact table with small dimensions). Memory to memory data transfer without disk writes is a key feature of presto. You can query your Hive tables using Presto Hive connector.

使用Presto可以快速访问不是很大的数据集。 Presto是一个内存中分布式SQL查询引擎,针对交互式查询,星型模式连接(具有小尺寸的大事实表)进行了优化。没有磁盘写入的内存到内存数据传输是presto的一个关键特性。您可以使用Presto Hive连接器查询Hive表。

#1


Use Presto for fast access to not very big datasets. Presto is an in-memory distributed SQL query engine and optimized for interactive queries, star schema joins(big fact table with small dimensions). Memory to memory data transfer without disk writes is a key feature of presto. You can query your Hive tables using Presto Hive connector.

使用Presto可以快速访问不是很大的数据集。 Presto是一个内存中分布式SQL查询引擎,针对交互式查询,星型模式连接(具有小尺寸的大事实表)进行了优化。没有磁盘写入的内存到内存数据传输是presto的一个关键特性。您可以使用Presto Hive连接器查询Hive表。