Spark 源码阅读一-启动脚本

时间:2022-04-19 18:17:03

Spark Complile

Help Links

// Because spark 1.5 need maven version:3.3.3 ,so i track the branch-1.4
git branch -a
git checkout --track origin/branch-1.4
git tag 
git checkout v1.4.1

//Building for Scala 2.11 
./dev/change-version-to-2.11.sh 

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

// edit ~/sql/catalyst/pom.xml replace quasiquotes_2.10 artifactId name
mvn clean package -DskipTests -Pscala-2.11 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.0
// some other option -Psbt -Pjava8-tests -Phive-thriftserver -Ptest-java-home 

// Building a Runnable Distribution
./make-distribution.sh --name custom-spark --tgz -DskipTests -Pscala-2.11 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.0  

Note

  • If you compile error in hive-thrift module, add the following dependency in the pom
<dependency>
      <groupId>jline</groupId>
      <artifactId>jline</artifactId>
      <version>0.9.94</version>
    </dependency>

Configuration

examples

spark.master                     spark://master:7077
spark.master                     yarn-client
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://dmp.zamplus.net:9000/logs/spark
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              2g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

spark.yarn.jar                    hdfs://dmp.zamplus.net:9000/libs/spark-assembly-1.4.1-hadoop2.4.0.jar

Helps

Spark default Configuration

  • executors
    • –num-executors (default : 2)
    • –executor-cores (default : 1)
  • memory
    • –driver-memory 4g
    • –executor-memory 2g
  • Java OPTS
    • -verberos:gc -XX;+PrintGCDetails -XX:+PrintGCTimeStamps
    • spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=256M ( same as –driver-java-options in the command line)
  • spark.serializer
    • default : org.apache.spark.serializer.KryoSerializer
    • -

[TODO]

  • I don’t know when i use spark-shell script ,I must add a parameter -Dspark.master=spark://dmp.zamplus.net:7077. This really pullzed me.

Startup script execution

  • $SPARK_HOME/bin/spark-shell
  • $SPARK_HOME/bin/spark-submit --class org.apache.spark.repl.Main
  • $SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main
  • $JAVA_HOME/java -cp $SPARK_HOME/lib/spark-assembly-1.4.1-hadoop2.4.0.jar org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main
  • $JAVA_HOME/java-cp$SPARK_HOME/conf/:$SPARK_HOME/lib/spark-assembly-1.4.1-hadoop2.4.0.jar:$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar:$SPARK_HOME/lib/datanucleus-core-3.2.10.jar:$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar:/home/wankun/hadoop/etc/hadoop/-Xms2g -Xmx2g -XX:MaxPermSize=256morg.apache.spark.deploy.SparkSubmit--classorg.apache.spark.repl.Mainspark-shell

Notes

  • The output cmds is separated by ‘\0’.

FAQ

  • Q1

Invalid initial heap size: -Xms2g
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

This error is because of error configuration in spark-default.properties . Two space after the spark.driver.memory parameter.