spark学习-Spark Streaming-15-Spark Streaming初次理解

1.Spark Streaming 是 Spark Core API 的扩展，它支持弹性的，高吞吐的，容错的实时数据流的处理。数据可以通过多种数据源获取，例如 Kafka，Flume，Kinesis 以及 TCP sockets，也可以通过例如 map，reduce，join，window 等的高阶函数组成的复杂算法处理。最终，处理后的数据可以输出到文件系统，数据库以及实时仪表盘中。事实上，你还可以在数据流上使用 Spark机器学习以及图形处理算法。
spark学习-Spark Streaming-15-Spark Streaming初次理解

spark Stream按照我的理解就是，三个人之间的中间人，用来把第一个人说的话以合理的方式，向第三个人说。类似的俩个人打电话《你 —–电话——-她》电话是实时的接受你说的的话，然后立马传递给第三个人，因此叫流，源源不断就像一个小溪流

那么先运行一个列子看看，只要你搭建了spark就行
打开第一个黑窗口运行

Last login: Tue Aug 22 08:49:20 2017 from 192.168.1.161
[root@bigdata01 ~]# nc -lk 9999

可以看到什么都没有，那么我们继续打开第二个黑窗口，这个别关闭

[root@bigdata01 ~]# /opt/hzjs/spark/bin/run-example streaming.NetworkWordCount localhost 9999 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hzjs/spark-2.1.1-bin-hadoop2.7/jars_test/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hzjs/spark-2.1.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/08/22 09:16:55 WARN SparkConf: 
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --num-executors to specify the number of executors
 - Or set SPARK_EXECUTOR_INSTANCES
 - spark.executor.instances to configure the number of instances in the spark config.

17/08/22 09:16:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/08/22 09:16:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
-------------------------------------------
Time: 1503364625000 ms
-------------------------------------------

-------------------------------------------
Time: 1503364626000 ms
-------------------------------------------

-------------------------------------------
Time: 1503364627000 ms
-------------------------------------------

-------------------------------------------
Time: 1503364628000 ms
------------------------------------------

我们发现这边会一直刷秒数
然后我们在第一个窗口输入内容

[root@bigdata01 ~]# nc -lk 9999
i am your father 
i like you

这时候我们可以立马在第二个窗口看到


-------------------------------------------
Time: 1503365442000 ms
-------------------------------------------

-------------------------------------------
Time: 1503365443000 ms
-------------------------------------------
(i,1)
(your,1)
(am,1)
(father,1)

-------------------------------------------
Time: 1503365444000 ms
-------------------------------------------

-------------------------------------------
Time: 1503365448000 ms
-------------------------------------------
(i,1)
(you,1)
(like,1)

-------------------------------------------
Time: 1503365449000 ms
-------------------------------------------

-------------------------------------------
Time: 1503365450000 ms
-------------------------------------------

因此这是一个实时处理接受稳定流的技术

第一个代码

package mystream


import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket
import java.nio.charset.StandardCharsets

import org.apache.spark.SparkConf
import org.apache.spark.internal.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.receiver.Receiver

/**
 * Custom Receiver that receives data over a socket. Received bytes are interpreted as
 * text and \n delimited lines are considered as records. They are then counted and printed.
 *
 * To run this on your local machine, you need to first run a Netcat server
 * `$ nc -lk 9999`
 * and then run the example
 * `$ bin/run-example org.apache.spark.examples.streaming.CustomReceiver localhost 9999`
 */
object CustomReceiver {
def main(args: Array[String]) {

val sparkConf = new SparkConf().setMaster("local[8]").setAppName("CustomReceiver")
val ssc = new StreamingContext(sparkConf, Seconds(1))

// Create an input stream with the custom receiver on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
val lines = ssc.receiverStream(new CustomReceiver("192.168.10.83", "8096".toInt))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}


class CustomReceiver(host: String, port: Int)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2)  {

def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
override def run() { receive() }
    }.start()
  }

def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself isStopped() returns false
  }

/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
var socket: Socket = null
var userInput: String = null
try {
      println("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      println("Connected to " + host + ":" + port)
val reader = new BufferedReader(
new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
      userInput = reader.readLine()
while(!isStopped && userInput != null) {
        store(userInput)
        userInput = reader.readLine()
      }
      reader.close()
      socket.close()
      println("Stopped receiving")
      restart("Trying to connect again")
    } catch {
case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
        restart("Error receiving data", t)
    }
  }
}
// scalastyle:on println

发起流

spark学习-Spark Streaming-15-Spark Streaming初次理解

运行结果

18/01/29 16:55:57 INFO Executor: Running task 4.0 in stage 37.0 (TID 53)
18/01/29 16:55:57 INFO BlockManager: Found block input-0-1517216156600 locally
-------------------------------------------
Time: 1517216157000 ms
-------------------------------------------
(9999`,2)
(stream,2)
(Receiver[String](StorageLevel.MEMORY_AND_DISK_2),1)
(They,1)
(considered,1)
(CustomReceiver,1)
(ssc.start(),1)
(bin/run-example,1)
(,105)
(first,1)
...

18/01/29 16:55:57 INFO Executor: Finished task 2.0 in stage 37.0 (TID 51). 1414 bytes result sent to driver
18/01/29 16:55:57 INFO Executor: Finished task 4.0 in stage 37.0 (TID 53). 1425 bytes result sent to driver
18/01/29 16:55:57 INFO TaskSetManager: Finished task 2.0 in stage 37.0 (TID 51) in 413 ms on localhost (executor driver) (1/5)
18/01/29 16:55:57 INFO TaskSetManager: Finished task 4.0 in stage 37.0 (TID 53) in 413 ms on localhost (executor driver) (2/5)

秒客网

spark学习-Spark Streaming-15-Spark Streaming初次理解

第一个代码

相关文章