oozie4.3.0安装过程(支持python spark action)

时间:2021-12-11 20:49:39

oozie4.3.0安装过程


因本人工作环境需要用到oozie的spark action,oozie4.2.0不支持workflow工作流目录lib子目录下的jar包自动加载,故采用oozie-master的源码(版本为oozie 4.3.0-SNAPSHOT),
同时为使oozie spark action支持python文件,本人修改了若干源码,将在后面加以说明


1、安装环境

centos: 6.6
jdk: 1.8.0_25
maven: 3.3.9
hadoop: 2.6.0
spark: 1.6.0

为安装方便,使用root账户

2、打包
2.1)maven安装和配置
下载maven3.3.9

mkdir ~/download   
cd ~/download
wget http://apache.opencas.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar -zxvf apache-maven-3.3.9-bin.tar.gz -C /opt/
mv /opt/apache-maven-3.3.9 /opt/mavan

将maven的bin目录加入path变量
在/etc/profile增加两行,
export MAVEN_HOME=/opt/maven
export PATH=$PATH:$MAVEN_HOME/bin

保存退出后执行命令:
source /etc/profile


修改maven setting.xml, 使用开源中国的镜像,加快速度(该镜像不是很全,若发现某些库被屏蔽了建议修改mirrorOf项,这里不用关心)

<mirror>
<id>nexus-osc</id>
<name>OSChina Central</name>
<url>http://maven.oschina.net/content/groups/public/</url>
<mirrorOf>*</mirrorOf>
</mirror>


2.2)下载安装pig
下载pig

cd ~/download
wget http://archive.apache.org/dist/pig/pig-0.13.0/pig-0.13.0.tar.gz
tar -zxvf pig-0.13.0.tar.gz -C /opt/
mv /opt/pig-0.13.0 /opt/pig

将pig的bin目录加入path变量
在/etc/profile增加两行,

export PIG_HOME=/opt/pig
export PATH=$PATH:$PIG_HOME/bin


2.3)下载oozie的master(为4.3.0-SNAPSHOT源码,解压)

cd ~/download
git https://github.com/apache/oozie.git
cd oozie

2.4)修改主目录中的pom.xml,有以下位置要改:

       <targetJavaVersion>1.8</targetJavaVersion>
<hadoop.version>2.6.0</hadoop.version>
<hadoop.majorversion>2</hadoop.majorversion>
<pig.version>0.13.0</pig.version>
<maven.javadoc.opts>-Xdoclint:none</maven.javadoc.opts>
<spark.version>1.6.0</spark.version>

2.5)修改源码(为保证oozie spark action支持python)
文件1:/root/download/oozie-master/oozie/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java
第568行
else if (fileName.endsWith(".jar")) { // .jar files
改为
else if (fileName.endsWith(".jar")||fileName.endsWith(".py")) { // .jar files or .py files


文件2:
/root/download/oozie-master/oozie/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java
第221行:
if (!path.startsWith("job.jar") && path.endsWith(".jar")) {
改为
if (!path.startsWith("job.jar") && (path.endsWith(".jar")||path.endsWith(".py"))) {

2.6)执行以下命令编译打包:

bin/mkdistro.sh -DskipTests -Phadoop-2 -Dhadoop.version=2.6.0

2.7)打包完成,编译后的文件存放在distro/target文件夹下,我这里的文件名为oozie-4.3.0-SNAPSHOT-distro.tar.gz


3、安装oozie server

3.1)将oozie-4.3.0-SNAPSHOT-distro.tar.gz解压到/usr/local/下,并更名为oozie.

tar -zxvf distro/target/oozie-4.3.0-SNAPSHOT-distro.tar.gz -C /usr/local/
mv /usr/local/oozie-4.3.0-SNAPSHOT-distro /usr/local/oozie

3.2)进入/usr/local/oozie目录下,解压share,example,client三个tar包,如下:

cd /usr/local/oozie
tar -zxvf oozie-client-4.2.0.tar.gz
tar -zxvf oozie-examples.tar.gz
tar -zxvf oozie-sharelib-4.2.0.tar.gz

3.3)在HDFS文件系统中创建一个/user/oozie的目录,并将share目录上传至HDFS中的/user/oozie目录下

hadoop fs -mkdir /user/oozie
hadoop dfs -copyFromLocal /usr/local/oozie/share /user/oozie
hadoop dfs -ls /user/oozie

3.4)在/usr/local/oozie目录下创建libext目录,复制hadoop目录lib子目录下的文件到/usr/local/oozie/libetx下

mkdir /usr/local/oozie/libetx
cp ${HADOOP_HOME}/share/hadoop/*/*.jar libext/
cp ${HADOOP_HOME}/share/hadoop/*/lib/*.jar libext/
若有重复则按enter键略过
为防止引用冲突,需要删除libext目录下的以下jar包:
servlet-api-2.5.jar
jasper-compiler-5.5.23.jar
jasper-runtime-5.5.23.jar
jsp-api-2.1.jar
同时在/usr/local/oozie目录下查找jetty-all*.jar
find /usr/local/oozie -name jetty-all*.jar
若发现则删除

 

3.5)将mysql-connector-java-5.1.38.jar(应对应系统的mysql版本)和ext2.2.zip拷贝至/usr/local/oozie/libext目录下

3.6)打war包,在/usr/local/oozie/bin下执行命令:

./oozie-setup.sh prepare-war

war文件存放在/usr/local/oozie/oozie-server/webapps目录下。

4、配置oozie
4.1)设置环境变量
    编辑/etc/profile文件,添加如下:

    export OOZIE_HOME=/usr/local/oozie
export CATALINA_HOME=/usr/local/oozie/oozie-server
export PATH=${CATALINA_HOME}/bin:${OOZIE_HOME}/bin:$PATH
export OOZIE_URL=http://localhost:11000
export OOZIE_CONFIG=/usr/local/oozie/conf

4.2)修改/usr/local/oozie/conf/oozie-site.xml文件:修改如下

<configuration>

<!-- Proxyuser Configuration -->

<property>
<name>oozie.service.ProxyUserService.proxyuser.hue.hosts</name>
<value>*</value>
<description>
List of hosts the '#USER#' user is allowed to perform 'doAs'
operations.

The '#USER#' must be replaced with the username o the user who is
allowed to perform 'doAs' operations.

The value can be the '*' wildcard or a list of hostnames.

For multiple users copy this property and replace the user name
in the property name.
</description>
</property>

<property>
<name>oozie.service.ProxyUserService.proxyuser.hue.groups</name>
<value>*</value>
<description>
List of groups the '#USER#' user is allowed to impersonate users
from to perform 'doAs' operations.

The '#USER#' must be replaced with the username o the user who is
allowed to perform 'doAs' operations.

The value can be the '*' wildcard or a list of groups.

For multiple users copy this property and replace the user name
in the property name.
</description>
</property>

<property>
<name>oozie.db.schema.name</name>
<value>oozie</value>
<description>
Oozie DataBase Name
</description>
</property>

<property>
<name>oozie.service.JPAService.create.db.schema</name>
<value>false</value>
<description>
</description>
</property>

<property>
<name>oozie.service.JPAService.jdbc.driver</name>
<value>com.mysql.jdbc.Driver</value>
<description>
JDBC driver class.
</description>
</property>

<property>
<name>oozie.service.JPAService.jdbc.url</name>
<value>jdbc:mysql://localhost:3306/${oozie.db.schema.name}</value>
<description>
JDBC URL.
</description>
</property>

<property>
<name>oozie.service.JPAService.jdbc.username</name>
<value>oozie</value>
<description>
DB user name.
</description>
</property>

<property>
<name>oozie.service.JPAService.jdbc.password</name>
<value>oozie</value>
<description>
DB user password.
IMPORTANT: if password is emtpy leave a 1 space string, the service trims the value,
if empty Configuration assumes it is NULL.
</description>
</property>

<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/usr/local/hadoop/etc/hadoop</value>
</property>

<property>
<name>oozie.service.HadoopAccessorService.action.configurations</name>
<value>*=/usr/local/hadoop/etc/hadoop</value>
</property>

<property>
<name>oozie.service.SparkConfigurationService.spark.configurations</name>
<value>*=/usr/local/spark/conf</value>
</property>

<property>
<name>oozie.service.WorkflowAppService.system.libpath</name>
<value>/user/oozie/share/lib</value>
</property>

<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
<description>
Default value of oozie.use.system.libpath. If user haven't specified =oozie.use.system.libpath=
in the job.properties and this value is true and Oozie will include sharelib jars for workflow.
</description>
</property>

<property>
<name>oozie.subworkflow.classpath.inheritance</name>
<value>true</value>
</property>

</configuration>

4.3)配置mysql数据库,并生成oozie数据库脚本文件(将会在/usr/local/oozie/bin目录下生成oozie.sql文件)

    mysql -u root -proot       (进入mysql数据库,用户名和密码根据实际情况修改)
create database oozie; (创建名称为oozie的数据库)
grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie'; (设置oozie数据库的访问全选,创建用户名为oozie,密码为oozie的用户)
grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie'; (设置oozie数据库的访问权限)
FLUSH PRIVILEGES;
在/usr/local/oozie/bin目录下执行以下命令:

./ooziedb.sh create -sqlfile oozie.sql
接着执行如下命令,执行oozie数据库脚本文件,这将在oozie数据库中生成与oozie相关的数据表
./oozie-setup.sh db create -run  -sqlfile /usr/local/oozie/bin/oozie.sql


4.4)在hadoop集群的namenode上修改中的core-site.xml文件:

<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>

修改完之后,不需要重启hadoop集群,执行以下命令即可:

hdfs dfsadmin -refreshSuperUserGroupsConfiguration
yarn rmadmin -refreshSuperUserGroupsConfiguration

4.5)执行以下命令,启动oozie成功
cd /usr/local/oozie
bin/oozied.sh start

5、后续
5.1)修改oozie源码,使得oozie spark action支持python
现象:提交python action任务时会发现系统不能找到python文件
原因:oozie是通过mapreduce2(yarn)上的executor节点上启动spark driver的,在启动mapreduce任务时,并没有将lib目录下的python文件加载到yarn的分布式文件列表中,
所以在启动spark任务时会报找不到python文件,源码的修改就是为了将lib目录下的python文件加入yarn的分布式文件列表中,同时在启动spark的时候能在文件时文件列表中访问;

5.2)pi文件找到,但是仍报pyspark模块找不到
现象:通过分析是SPARK_HOME系统变量找不到,但是在yarn的所有节点上的/etc/profile、/usr/local/spark/conf/spark-env.sh上均部署了SPARK_HOME,但是在执行时确不能读到SPARK_HOME
原因:/etc/profile在远程访问中无法被读到,而/usr/local/spark/conf/spark-env.sh只在启动spark任务时会被读到,而oozie spark action执行首先是启动一个mapreduce2任务(yarn),
在mapreduce2的executor节点上启动spark driver任务,这时需要加载python包,而因为没有读取到spark-env.sh, 所以不会读到SPARK_HOME系统常量
修改:在yarn-env.sh中增加系统变量的设置:

export SPARK_HOME=/usr/local/spark