spark学习-Spark Streaming-15-Spark Streaming初次理解

时间:2022-05-28 20:49:19

1.Spark Streaming 是 Spark Core API 的扩展,它支持弹性的,高吞吐的,容错的实时数据流的处理。数据可以通过多种数据源获取,例如 Kafka,Flume,Kinesis 以及 TCP sockets,也可以通过例如 map,reduce,join,window 等的高阶函数组成的复杂算法处理。最终,处理后的数据可以输出到文件系统,数据库以及实时仪表盘中。事实上,你还可以在数据流上使用 Spark机器学习 以及 图形处理算法 。
spark学习-Spark Streaming-15-Spark Streaming初次理解

spark Stream按照我的理解就是,三个人之间的中间人,用来把第一个人说的话以合理的方式,向第三个人说。类似的俩个人打电话 《 你 —–电话——-她 》电话是实时的接受你说的的话,然后立马传递给第三个人,因此叫流,源源不断就像一个小溪流

那么先运行一个列子看看,只要你搭建了spark就行
打开第一个黑窗口运行

Last login: Tue Aug 22 08:49:20 2017 from 192.168.1.161
[root@bigdata01 ~]# nc -lk 9999

可以看到什么都没有,那么我们继续打开第二个黑窗口,这个别关闭

[root@bigdata01 ~]# /opt/hzjs/spark/bin/run-example streaming.NetworkWordCount localhost 9999 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hzjs/spark-2.1.1-bin-hadoop2.7/jars_test/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hzjs/spark-2.1.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_
bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/08/22 09:16:55 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.

Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.

17/08/22 09:16:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/08/22 09:16:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
-------------------------------------------

Time: 1503364625000 ms
-------------------------------------------


-------------------------------------------
Time: 1503364626000 ms
-------------------------------------------


-------------------------------------------
Time: 1503364627000 ms
-------------------------------------------


-------------------------------------------
Time: 1503364628000 ms
------------------------------------------

我们发现这边会一直刷秒数
然后我们在第一个窗口输入内容

[root@bigdata01 ~]# nc -lk 9999
i am your father
i like you

这时候我们可以立马在第二个窗口看到


-------------------------------------------
Time: 1503365442000 ms
-------------------------------------------


-------------------------------------------
Time: 1503365443000 ms
-------------------------------------------

(i,1)
(your,1)
(am,1)
(father,1)

-------------------------------------------
Time: 1503365444000 ms
-------------------------------------------


-------------------------------------------
Time: 1503365448000 ms
-------------------------------------------

(i,1)
(you,1)
(like,1)

-------------------------------------------
Time: 1503365449000 ms
-------------------------------------------


-------------------------------------------
Time: 1503365450000 ms
-------------------------------------------

因此这是一个实时处理接受稳定流的技术

第一个代码

package mystream


import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket
import java.nio.charset.StandardCharsets

import org.apache.spark.SparkConf
import org.apache.spark.internal.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.receiver.Receiver

/**
* Custom Receiver that receives data over a socket. Received bytes are interpreted as
* text and \n delimited lines are considered as records. They are then counted and printed.
*
* To run this on your local machine, you need to first run a Netcat server
* `$ nc -lk 9999`
* and then run the example
* `$ bin/run-example org.apache.spark.examples.streaming.CustomReceiver localhost 9999`
*/

object CustomReceiver {
def main(args: Array[String]) {

val sparkConf = new SparkConf().setMaster("local[8]").setAppName("CustomReceiver")
val ssc = new StreamingContext(sparkConf, Seconds(1))

// Create an input stream with the custom receiver on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
val lines = ssc.receiverStream(new CustomReceiver("192.168.10.83", "8096".toInt))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}


class CustomReceiver(host: String, port: Int)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {

def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
override def run() { receive() }
}.start()
}

def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself isStopped() returns false
}

/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
var socket: Socket = null
var userInput: String = null
try {
println("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
println("Connected to " + host + ":" + port)
val reader = new BufferedReader(
new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
userInput = reader.readLine()
while(!isStopped && userInput != null) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
socket.close()
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
restart("Error receiving data", t)
}
}
}
// scalastyle:on println

发起流

spark学习-Spark Streaming-15-Spark Streaming初次理解

运行结果

18/01/29 16:55:57 INFO Executor: Running task 4.0 in stage 37.0 (TID 53)
18/01/29 16:55:57 INFO BlockManager: Found block input-0-1517216156600 locally
-------------------------------------------
Time: 1517216157000 ms
-------------------------------------------
(9999`,2)
(stream,2)
(Receiver[String](StorageLevel.MEMORY_AND_DISK_2),1)
(They,1)
(considered,1)
(CustomReceiver,1)
(ssc.start(),1)
(bin/run-example,1)
(,105)
(first,1)
...

18/01/29 16:55:57 INFO Executor: Finished task 2.0 in stage 37.0 (TID 51). 1414 bytes result sent to driver
18/01/29 16:55:57 INFO Executor: Finished task 4.0 in stage 37.0 (TID 53). 1425 bytes result sent to driver
18/01/29 16:55:57 INFO TaskSetManager: Finished task 2.0 in stage 37.0 (TID 51) in 413 ms on localhost (executor driver) (1/5)
18/01/29 16:55:57 INFO TaskSetManager: Finished task 4.0 in stage 37.0 (TID 53) in 413 ms on localhost (executor driver) (2/5)