CDH-hive的引擎换为spark:hive on spark

时间:2024-03-16 22:37:19

一、src

apache的hive on spark和cdh 的hive on spark完全不同,前者有严格的版本对应关系,比如hive1.1只能用spark1.2,2.2只能用spark1.6。cdh5.16.2虽然用的是hive1.1但可以简便的配置hos,不需要升级hive或降级spark。

1. apache hive官网给的版本匹配表

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
CDH-hive的引擎换为spark:hive on spark

2. cdh配置hos很简单,不需要升级hive或降级spark。

cdh5.16.x 配置hive on spark的官方文档
以下是对这篇文档的翻译

二、概述

这个文档主要说明如何实现cdh下的hive on spark。分为4部分:

  1. 配置 Hive on Spark
  2. 配置map join时的动态分区
  3. 在 Hive on Spark上使用UDF
  4. 在 Hive on Spark上派出故障

三、硬件要求

CDH-hive的引擎换为spark:hive on spark

四、配置Hive on Spark

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

最低角色???

1. 需要2个steps:

{1} 配置hive客户端使用spark引擎

(Configure the Hive client to use the Spark execution engine as described in Hive Execution Engines.)

CDH的hive支持2种引擎,mr和spark,分2种情况,配置如下:

  • Beeline - (Can be set per query)
  • Cloudera Manager (Affects all queries, not recommended).
    1. Go to the Hive service.
    2. Click the Configuration tab.
    3. Search for “execution”.
    4. Set the Default Execution Engine property to MapReduce or Spark. The default is MapReduce.
    5. Click Save Changes to commit the changes.
    6. Return to the Home page by clicking the Cloudera Manager logo.
    7. Click the icon next to any stale services to invoke the cluster restart wizard.
    8. Click Restart Stale Services.
    9. Click Restart Now.
    10. Click Finish.

{2} 确定Hive使用的Spark服务。 Cloudera Manager会自动将其设置为配置的MapReduce或YARN服务以及配置的Spark服务。 请参阅在Spark服务上配置Hive依赖关系。

(Identify the Spark service that Hive uses. Cloudera Manager automatically sets this to the configured MapReduce or YARN service and the configured Spark service. See Configuring the Hive Dependency on a Spark Service.)
默认情况下,如果1个spark服务可用,spark服务上的hive依赖会自动配置。如果想改变这个配置,如下:

  1. 进入hive配置页,点击configuration,搜索Spark On YARN Service
    如果想配置,选spark,如果不像,选none
  2. Click Save Changes.
  3. Go to the Spark service.
  4. Add a Spark gateway role to the host running HiveServer2.
  5. Return to the Home page by clicking the Cloudera Manager logo.
  6. Click the icon next to any stale services to invoke the cluster restart wizard.
  7. Click Restart Stale Services.
  8. Click Restart Now.
  9. Click Finish.
  10. In the Hive client, configure the Spark execution engine.

五、检验

1. 执行会产生mr的sql并查看日志

{1} 创建了一个表并向里面插入数据,在日志中就可以看到有spark job而不是mr job执行。

CDH-hive的引擎换为spark:hive on spark

{2} 再执行一个select distinct(id),也会发现有spark job

CDH-hive的引擎换为spark:hive on spark

2. 查看yarnUI可以看到appType是spark,并且在最后一列可以进入spark的UI

CDH-hive的引擎换为spark:hive on spark
CDH-hive的引擎换为spark:hive on spark