chapter 1 introduction to the analysis with spark
the conponents of Sparks
spark core(contains the basic functionality of sparks. spark Core is also the home to the APIs that defines the RDDs),
spark sql(structured data ) is the package for working with the structured data.it allow query data via SQL as well as Apache hive , and it support many sources of data ,including Hive tables ,Parquet And jason.also allow developers to intermix SQL queries with the programatic data manipulation supported By the RDDs in Python ,java And Scala .
spark streaming(real-time),enables processing the live of streaming of data.
MLib(machine learning)
GraphX(graph processing )is a library for manipulating the graph .
A Brief History of Spark
spark is a open source project that has beed And is maintained By a thriving And diverse community of developer .
chapter 2 downloading spark And getting started
walking through the process of downloding And running the sprak on local mode on single computer .
you don't needmaster Scala,java orPython.Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark
on either your laptop or a cluster, all you need is an installation of Java 6 or newer. If you wish to use the Python API you will also need a Python interpreter (version 2.6 or newer).Spark does not yet work with Python 3.
downloading spark,select the "pre-build for Hadoop 2.4 And later".
tips:
widows user May run into issues installing .you can use the ziptool untar the .tar file Note :instatll spark in a directionalry with no space (e.g. c:\spark).
after you untar you will get a new directionaru with the same name but without the final .tar suffix .
damn it:
Most of this book includes code in all of Spark’s languages, but interactive shells are
available only in Python and Scala. Because a shell is very useful for learning the API, we recommend using one of these languages for these examples even if you are a Java
developer. The API is similar in every language.
change the directionaty to the spark,type bin\pyspark,you will see the logo.
Introduction to Core spark concepts
Driver program
|----your application
|----distributed datasets that you defined
usually weapply many operations on thedatasets.
***in the preceding example ,the Driver program was the spark she'll intself ,And you can type in the operation that you wanted.
***Driver program access the spark through a SparkContext object,which representing the connection to a computing cluster. what's more the sparkcontext is automatically created for you called as sc,in the pyspark ,you can print the infomation about this Object By typing "sc".well i think you will know the SparkContext have 3 kinds in java ,Python And Scala respectively.
SparkContext have many operations ,such as count(),first() and so on. Driver program typically manage a number of nodes called executors. when you call any operation on a cluster different machines might count in different ranges of the file.beacuse we run the spark shell locally,it execute all works on a single machine.
Passing Funtions to Spark
look at the following example in python:
lines = sc.textFile(""README.md);
pythonLines = lines.filter(lambda line : "Python" in line);
pythoLines.first();
if you are unfamilat with the lambda sytax. it's a shorthand way to define function inline in Python or Scala, then pass the function's name to the Spark. you can do like this :
def hasPython(line): # this function judge that wheter every line contain the "Python" in a file return "Python" in line. pythonLines = lines.filter(hasPython)
of course you can write in java.but they are defined as classes, implementing interface Funtion:
JavaRDD<String> pythonLines = filter(new Funtion<String, Bollean>()
{
Boolean call(String line)
{
return lines.contains(line);
}
});
nowadays java8 have supported lambda.
Spark qutomatically takes your functions (e.g. lines.contains("Python"))and ships it to executors nodes. Thus, you can write code in a single driver program and automatically have parts of it run on mutiple nodes.
Standalone Applications
Apart from running Spark interactively, Spark can be linked to standalone applications in either Java, Python or Scala. The main difference from using it in the shell is that you need to initialize your SparkContext.After that ,the functions is same.Remember , if you using it in the shell, the SparkContext is created automatically for you called "sc", you can use it direactly.
The proces linked to Spark varies from languages. In Java and Scala , you give your aplliation Maven dependency on the spark-core artifact.Maven is a popular package managerment tool for java-based languages let you link to libaries in public repositories.you can use Maven itself build your projet , or use other tools that can talk to the Maven repositories, including Scala's sbt od Gradle. Popular IDE like Eclipse also allow you to directly add a Maven dependency to a project.
In Python,you simply write application as Python scripts, but you must run them using the bin\spark-submit script included in Spark. The spark-submit script include the dependency for us in Python,what's more it sets up the enrivonment for Spark's Python API to function. Simply run your scripts like this:
bin\spak-submit my_script.py
(Note that you will have to use backslashes instead of forward slashes on Window)s