Spark Complile
// Because spark 1.5 need maven version:3.3.3 ,so i track the branch-1.4
git branch -a
git checkout --track origin/branch-1.4
git tag
git checkout v1.4.1
//Building for Scala 2.11
./dev/change-version-to-2.11.sh
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
// edit ~/sql/catalyst/pom.xml replace quasiquotes_2.10 artifactId name
mvn clean package -DskipTests -Pscala-2.11 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.0
// some other option -Psbt -Pjava8-tests -Phive-thriftserver -Ptest-java-home
// Building a Runnable Distribution
./make-distribution.sh --name custom-spark --tgz -DskipTests -Pscala-2.11 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.0
Note
- If you compile error in hive-thrift module, add the following dependency in the pom
<dependency>
<groupId>jline</groupId>
<artifactId>jline</artifactId>
<version>0.9.94</version>
</dependency>
Configuration
examples
spark.master spark://master:7077
spark.master yarn-client
spark.eventLog.enabled true
spark.eventLog.dir hdfs://dmp.zamplus.net:9000/logs/spark
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 2g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.yarn.jar hdfs://dmp.zamplus.net:9000/libs/spark-assembly-1.4.1-hadoop2.4.0.jar
Helps
- Spark website:
- Configuration:http://spark.apache.org/docs/latest/configuration.html
- Other documents:
Spark default Configuration
- executors
- –num-executors (default : 2)
- –executor-cores (default : 1)
- memory
- –driver-memory 4g
- –executor-memory 2g
- Java OPTS
- -verberos:gc -XX;+PrintGCDetails -XX:+PrintGCTimeStamps
- spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=256M ( same as –driver-java-options in the command line)
- spark.serializer
- default : org.apache.spark.serializer.KryoSerializer -
[TODO]
- I don’t know when i use
spark-shell
script ,I must add a parameter-Dspark.master=spark://dmp.zamplus.net:7077
. This really pullzed me.
Startup script execution
$SPARK_HOME/bin/spark-shell
$SPARK_HOME/bin/spark-submit --class org.apache.spark.repl.Main
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main
$JAVA_HOME/java -cp $SPARK_HOME/lib/spark-assembly-1.4.1-hadoop2.4.0.jar org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main
$JAVA_HOME/java-cp$SPARK_HOME/conf/:$SPARK_HOME/lib/spark-assembly-1.4.1-hadoop2.4.0.jar:$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar:$SPARK_HOME/lib/datanucleus-core-3.2.10.jar:$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar:/home/wankun/hadoop/etc/hadoop/-Xms2g -Xmx2g -XX:MaxPermSize=256morg.apache.spark.deploy.SparkSubmit--classorg.apache.spark.repl.Mainspark-shell
Notes
- The output cmds is separated by ‘\0’.
FAQ
- Q1
Invalid initial heap size: -Xms2g
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
This error is because of error configuration in spark-default.properties . Two space after the spark.driver.memory parameter.