Spark Streaming源码初探 (1)

时间:2022-08-20 20:48:49


之前学习过Spark Core源码,接下来一段时间研究一下Spark Streaming相关的内容!下面就从最简单的Streaming程序开始作为入口点(Receiver模式),程序代码如下:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
* Created by Daxin on 2017/8/4.
*/
object StreamMain {

def main(args: Array[String]) {
val conf = new SparkConf()
conf.setAppName("socketStream")
//基于Reciver模式,所以线程数目需要大于1,否则只能接受数据无法处理数据
conf.setMaster("local[*]") //如果设置conf.setMaster("local[1]")的话,将会没有线程负责计算
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val ssc = new StreamingContext(sc, Seconds(2))
//TODO 最终创建一个SocketInputDStream返回
val line = ssc.socketTextStream("node", 9999)
val result = line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
result.print()
//TODO 核心代码入口点,在org.apache.spark.streaming.StreamingContext.start方法中启动JobScheduler,开启接受数据并进行计算
ssc.start()
ssc.awaitTermination()
}

}

如上程序可分为三部分,第一部分是模板代码创建ssc,第二部分创建DStream以及算子计算,第三部分启动ssc。下文将主要这三部分


1:StreamingContext的创建

SparkContext是Spark批处理的的入口,此处StreamingContext也正是Spark Streaming的入口。StreamingContext提供创建DStream的函数,以及调用其start方法启动流计算。


主构造器如下:

class StreamingContext private[streaming] (
_sc: SparkContext,
_cp: Checkpoint,
_batchDur: Duration
) extends Logging


辅助构造器:

  /**
* Create a StreamingContext using an existing SparkContext.
* @param sparkContext existing SparkContext
* @param batchDuration the time interval at which streaming data will be divided into batches
*/
def this(sparkContext: SparkContext, batchDuration: Duration) = {
this(sparkContext, null, batchDuration)
}

/**
* Create a StreamingContext by providing the configuration necessary for a new SparkContext.
* @param conf a org.apache.spark.SparkConf object specifying Spark parameters
* @param batchDuration the time interval at which streaming data will be divided into batches
*/
def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}

/**
* Create a StreamingContext by providing the details necessary for creating a new SparkContext.
* @param master cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).
* @param appName a name for your job, to display on the cluster web UI
* @param batchDuration the time interval at which streaming data will be divided into batches
*/
def this(
master: String,
appName: String,
batchDuration: Duration,
sparkHome: String = null,
jars: Seq[String] = Nil,
environment: Map[String, String] = Map()) = {
this(StreamingContext.createNewSparkContext(master, appName, sparkHome, jars, environment),
null, batchDuration)
}

/**
* Recreate a StreamingContext from a checkpoint file.
* @param path Path to the directory that was specified as the checkpoint directory
* @param hadoopConf Optional, configuration object if necessary for reading from
* HDFS compatible filesystems
*/
def this(path: String, hadoopConf: Configuration) =
this(null, CheckpointReader.read(path, new SparkConf(), hadoopConf).orNull, null)

/**
* Recreate a StreamingContext from a checkpoint file.
* @param path Path to the directory that was specified as the checkpoint directory
*/
def this(path: String) = this(path, SparkHadoopUtil.get.conf)

/**
* Recreate a StreamingContext from a checkpoint file using an existing SparkContext.
* @param path Path to the directory that was specified as the checkpoint directory
* @param sparkContext Existing SparkContext
*/
def this(path: String, sparkContext: SparkContext) = {
this(
sparkContext,
CheckpointReader.read(path, sparkContext.conf, sparkContext.hadoopConfiguration).orNull,
null)
}

通过构造器发现都需要一个SparkContext对象,可以更加清楚认识到Spark Streaming是在Spark Core基础之上构建的流计算引擎。


2:DStream的创建

StreamingContext类似于SparkContext同样提供了一些创建抽象数据集的函数,例如:

 /**
* Creates an input stream from TCP source hostname:port. Data is received using
* a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
* lines.
* @param hostname Hostname to connect to for receiving data
* @param port Port to connect to for receiving data
* @param storageLevel Storage level to use for storing the received objects
* (default: StorageLevel.MEMORY_AND_DISK_SER_2)
* @see [[socketStream]]
*/
def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}

本小节主要分析一下基于Receiver方式的创建DStream过程,主要是以Socket方式创建DStream为主要流程进行分析。接下来看一下DStream的继承结构:

Spark Streaming源码初探 (1)

注意:启动关于Flume相关的DStream是来至于Spark Streaming的Flume依赖库中的类,需要自行引入依赖!

通过上图我们可以看到基于Recevier方式创建的DStream都是继承至ReceiverInputDStream(例如:Direct方式的Kafka DStream是继承至InputDStream)。


接下来看一下如下代码创建DStream过程:

val line = ssc.socketTextStream("node", 9999)
  /**   * Creates an input stream from TCP source hostname:port. Data is received using   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited   * lines.   * @param hostname      Hostname to connect to for receiving data   * @param port          Port to connect to for receiving data   * @param storageLevel  Storage level to use for storing the received objects   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)   * @see [[socketStream]]   */  def socketTextStream(      hostname: String,      port: Int,      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)  }  /**   * Creates an input stream from TCP source hostname:port. Data is received using   * a TCP socket and the receive bytes it interpreted as object using the given   * converter.   * @param hostname      Hostname to connect to for receiving data   * @param port          Port to connect to for receiving data   * @param converter     Function to convert the byte stream to objects   * @param storageLevel  Storage level to use for storing the received objects   * @tparam T            Type of the objects received (after converting bytes to objects)   */  def socketStream[T: ClassTag](      hostname: String,      port: Int,      converter: (InputStream) => Iterator[T],      storageLevel: StorageLevel    ): ReceiverInputDStream[T] = {    new SocketInputDStream[T](this, hostname, port, converter, storageLevel)  }


最后底层调用ssc的socketStream方法 new 一个SocketInputDStream直接返回,此处没有触发计算。接下来看一下SocketInputDStream源码:

private[streaming]
class SocketInputDStream[T: ClassTag](
_ssc: StreamingContext,
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends ReceiverInputDStream[T](_ssc) {

def getReceiver(): Receiver[T] = {
new SocketReceiver(host, port, bytesToObjects, storageLevel)
  //后期org.apache.spark.streaming.scheduler.ReceiverTracker#launchReceivers中进行调用getReceiver()方法

}}


再看一下SocketReceiver的源码:

private[streaming]
class SocketReceiver[T: ClassTag](
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends Receiver[T](storageLevel) with Logging {

private var socket: Socket = _

def onStart() {

logInfo(s"Connecting to $host:$port")
try {
socket = new Socket(host, port)
} catch {
case e: ConnectException =>
restart(s"Error connecting to $host:$port", e)
return
}
logInfo(s"Connected to $host:$port")

// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }//调用进行数据数据接受
}.start()
}

def onStop() {
// in case restart thread close it twice
synchronized {
if (socket != null) {
socket.close()
socket = null
logInfo(s"Closed socket to $host:$port")
}
}
}

/** Create a socket connection and receive data until receiver is stopped */
def receive() {
try {
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next())//重点部分
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
onStop()
}
}
}

到此处就完成了socketTextStream的功能。到此处由于org.apache.spark.streaming.dstream.SocketInputDStream#getReceiver方法还没有被调用,所以还没有创建SocketReceiver,因此此处ssc.socketTextStream返回的只是一个数据的表示,并不代表真实的数据,只有当ssc.start()调用之后才会真正进行数据接受和处理!


3:StreamingContext的启动(StreamingContext.start())


在分析start过程之前先简述一下几个相关组件的功能,组件分别如下:

a: JobScheduler调度流作业在spark上运行

b:使用JobGenerator产生spark job并在线程池中运行


StreamingContext.start()源码如下:

  /**
* Start the execution of the streams.
*
* @throws IllegalStateException if the StreamingContext is already stopped.
*/
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()

// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
//TODO JobScheduler
scheduler.start()
}
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
logDebug("Adding shutdown hook") // force eager creation of logger
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")


case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}


如下是 JobScheduler.start()代码:

  def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started

logDebug("Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()

// attach rate controllers of input streams to receive batch completion updates
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)

listenerBus.start()
//TODO ReceiverTracker初始化
receiverTracker = new ReceiverTracker(ssc)
//TODO 主要用来统计输入数据集的统计信息,用来进行UI显示和监控
inputInfoTracker = new InputInfoTracker(ssc)

val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
case _ => null
}

executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
executorAllocClient,
receiverTracker,
ssc.conf,
ssc.graph.batchDuration.milliseconds,
clock)
executorAllocationManager.foreach(ssc.addStreamingListener)
//TODO 重点:ReceiverTracker启动,负责数据接受和处理
receiverTracker.start()
//TODO 重点:jobGenerator启动,负责产生Spark Job
jobGenerator.start()
executorAllocationManager.foreach(_.start())
logInfo("Started JobScheduler")
}
这里面核心代码是:ReceiverTracker和JobGenerator的创建初始化以及启动!

ReceiverTracker的启动(如下代码运行在Driver):

  /** Start the endpoint and receiver execution thread. */
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}

if (!receiverInputStreams.isEmpty) {
    //在RpcEnv环境下(其实就是NettyRpcEnv)创建ReceiverTracker的endPoint
// endpoint = ssc.env.rpcEnv.setupEndpoint("ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv)) //TODO 启动Receiver if (!skipReceiverLaunch) launchReceivers() logInfo("ReceiverTracker started") trackerState = Started } }



launchReceivers()方法实现:

  /**
* Get the receivers from the ReceiverInputDStreams, distributes them to the
* worker nodes as a parallel collection, and runs them.
* <br><br>
* org.apache.spark.streaming.scheduler.ReceiverTracker#launchReceivers()
*/
private def launchReceivers(): Unit = {

//TODO 获取ReceiverInputDStream中的Receiver
val receivers = receiverInputStreams.map { nis =>
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
}

runDummySparkJob()

logInfo("Starting " + receivers.length + " receivers")
//TODO endpoint为ReceiverTracker的Endpoint引用
//TODO 就是给自己发送消息
endpoint.send(StartAllReceivers(receivers))
}


org.apache.spark.streaming.scheduler.ReceiverTracker.ReceiverTrackerEndpoint#receive对StartAllReceivers消息的处理:

    override def receive: PartialFunction[Any, Unit] = {
// Local messages
//TODO 处理消息
case StartAllReceivers(receivers) =>
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
val executors = scheduledLocations(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
//TODO 启动receiver
startReceiver(receiver, executors)
}

case RestartReceiver(receiver) =>
// Old scheduled executors minus the ones that are not active any more
val oldScheduledExecutors = getStoredScheduledExecutors(receiver.streamId)
val scheduledLocations = if (oldScheduledExecutors.nonEmpty) {
// Try global scheduling again
oldScheduledExecutors
} else {
val oldReceiverInfo = receiverTrackingInfos(receiver.streamId)
// Clear "scheduledLocations" to indicate we are going to do local scheduling
val newReceiverInfo = oldReceiverInfo.copy(
state = ReceiverState.INACTIVE, scheduledLocations = None)
receiverTrackingInfos(receiver.streamId) = newReceiverInfo
schedulingPolicy.rescheduleReceiver(
receiver.streamId,
receiver.preferredLocation,
receiverTrackingInfos,
getExecutors)
}
// Assume there is one receiver restarting at one time, so we don't need to update
// receiverTrackingInfos
startReceiver(receiver, scheduledLocations)
case c: CleanupOldBlocks =>
receiverTrackingInfos.values.flatMap(_.endpoint).foreach(_.send(c))
case UpdateReceiverRateLimit(streamUID, newRate) =>
for (info <- receiverTrackingInfos.get(streamUID); eP <- info.endpoint) {
eP.send(UpdateRateLimit(newRate))
}
// Remote messages
case ReportError(streamId, message, error) =>
reportError(streamId, message, error)
}

最后完成ReceiverSupervisorImpl的创建:

    /**
* Start a receiver along with its scheduled executors
*/
private def startReceiver(
receiver: Receiver[_],
scheduledLocations: Seq[TaskLocation]): Unit = {
def shouldStartReceiver: Boolean = {
// It's okay to start when trackerState is Initialized or Started
!(isTrackerStopping || isTrackerStopped)
}

val receiverId = receiver.streamId
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
return
}

val checkpointDirOption = Option(ssc.checkpointDir)
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)

// Function to start the receiver on the worker node
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
val receiver = iterator.next()
assert(iterator.hasNext == false)
//TODO 完成ReceiverSupervisorImpl的创建
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}

// Create the RDD using the scheduledLocations to run the receiver in a Spark job
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
case Failure(e) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logError("Receiver has been stopped. Try to restart it.", e)
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
}(ThreadUtils.sameThread)
logInfo(s"Receiver ${receiver.streamId} started")
}