博客地址:http://www.fanlegefan.com
原文地址:http://www.fanlegefan.com/index.php/2017/07/19/sparkstreaminglizi/
摘要
本文主要实现一个简单sparkstreaming小栗子,整体流程是从kafka实时读取数据,计算pv,uv,以及sum(money)操作,最后将计算结果存入redis中,用sql表述大概就是
select time,page,count(*),count(distinct user) uv,sum(money) from test group by page,time
样例数据格式:
user,page,money,time
smith,iphone4.html,578.02,1500618981283
andrew,mac.html,277.62,1500618981285
smith,note.html,388.56,1500618981285
将数据push到kafka
启动kafka
造数据
package com.fan.spark.stream
import java.text.DecimalFormat
import java.util.Properties
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import scala.util.Random
/**
* Created by http://www.fanlegefan.com on 17-7-21.
*/
object ProduceMessage {
def main(args: Array[String]): Unit = {
val props = newProperties()
props.put("bootstrap.servers","localhost:9092")
props.put("acks","all")
props.put("retries","0")
props.put("batch.size","16384")
props.put("linger.ms","1")
props.put("buffer.memory","33554432")
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer")
val producer = newKafkaProducer[String, String](props)
val users = Array("jack","leo","andy","lucy","jim","smith","iverson","andrew")
val pages = Array("iphone4.html","huawei.html","mi.html","mac.html","note.html","book.html","fanlegefan.com")
val df = newDecimalFormat("#.00")
val random = newRandom()
val num = 10
for(i<- 0 to num ){
val message = users(random.nextInt(users.length))+","+pages(random.nextInt(pages.length))+
","+df.format(random.nextDouble()*1000)+","+System.currentTimeMillis()
producer.send(newProducerRecord[String, String]("test", Integer.toString(i),message))
println(message)
}
producer.close()
}
}
控制台消费如下
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
andrew,book.html,309.58,1500620213384
jack,book.html,954.01,1500620213456
iverson,book.html,823.07,1500620213456
iverson,iphone4.html,486.76,1500620213456
lucy,book.html,14.00,1500620213457
iverson,note.html,206.30,1500620213457
jack,book.html,25.30,1500620213457
jim,iphone4.html,513.82,1500620213457
lucy,mac.html,677.29,1500620213457
smith,mi.html,571.30,1500620213457
lucy,iphone4.html,113.83,1500620213457
计算pv,uv以及累计金额
因为数据要存入redis中,获取redis客户端代码如下
package com.fan.spark.stream
import org.apache.commons.pool2.impl.GenericObjectPoolConfig
import redis.clients.jedis.JedisPool
/**
* Created by http://www.fanlegefan.com on 17-7-21.
*/
object RedisClient {
val redisHost = "127.0.0.1"
val redisPort = 6379
val redisTimeout = 30000
lazy val pool = newJedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
lazy val hook = newThread {
override def run = {
println("Execute hook thread: " + this)
pool.destroy()
}
}
sys.addShutdownHook(hook.run)
}
sparkstreaming 是按batch处理数据,例如设置batchDuration=10,则每批次处理10秒中内接收到的数据,计算pv的时候,直接count累加就可以;但是计算uv的时候,这10秒内出现的用户,在之前的batch中也可能出现,但是spark是按batch处理数据,没办法知道之前用户是否出现过,如果只是简单的累计的话,一天下来uv的数据会比真实的uv大很多,所以要解决这个问题就要引入HyperLogLog,还好redis已经提供了这个功能,具体使用情况直接看栗子
redis 127.0.0.1:6379> PFADD mykey a b c d e f g h i j
(integer) 1
redis 127.0.0.1:6379> PFCOUNT mykey
(integer) 10
a b c d e f g h i j这些可以理解为user,每来一个user,我们就执行下pfadd user操作;使用pfcount key就可以直接获得去重后的uv,但是要注意的是这种算法是有误差的,查阅了相关文档误差大约在0.8%左右,用于计算uv,这种误差还是可以接受的,具体误差大家可以测试下,这里我就不测了
实时计算代码如下
package com.fan.spark.stream
import java.text.SimpleDateFormat
import java.util.Date
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by http://www.fanlegefan.com on 17-7-21.
*/
object UserActionStreaming {
def main(args: Array[String]): Unit = {
val df = newSimpleDateFormat("yyyyMMdd")
val group = "test"
val topics = "test"
val sparkConf = newSparkConf().setAppName("pvuv").setMaster("local[3]")
val sc = newSparkContext(sparkConf)
val ssc = newStreamingContext(sc, Seconds(10))
ssc.checkpoint("/home/work/IdeaProjects/sparklearn/checkpoint")
val topicSets = topics.split(",").toSet
val kafkaParams = Map[String, String](
"metadata.broker.list"-> "localhost:9092",
"group.id"-> group
)
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, topicSets)
stream.foreachRDD(rdd=>rdd.foreachPartition(partition=>{
val jedis = RedisClient.pool.getResource
partition.foreach(tuple=>{
val line = tuple._2
val arr = line.split(",")
val user = arr(0)
val page = arr(1)
val money = arr(2)
val day = df.format(newDate(arr(3).toLong))
//uv
jedis.pfadd(day + "_"+ page , user)
//pv
jedis.hincrBy(day+"_pv", page, 1)
//sum
jedis.hincrByFloat(day+"_sum", page, money.toDouble)
})
}))
ssc.start()
ssc.awaitTermination()
}
}
在redis中查看结果
127.0.0.1:6379> keys *
1)"20170721_note.html"
2)"20170721_book.html"
3)"20170721_fanlegefan.com"
4)"20170721_mac.html"
5)"20170721_pv"
6)"20170721_mi.html"
7)"20170721_iphone4.html"
8)"20170721_sum"
9)"20170721_huawei.html"
查看pv
127.0.0.1:6379> HGETALL 20170721_pv
1)"mi.html"
2)"112"
3)"note.html"
4)"107"
5)"fanlegefan.com"
6)"124"
7)"huawei.html"
8)"122"
9)"iphone4.html"
10)"92"
11)"mac.html"
12)"103"
13)"book.html"
14)"135"
查看sum
127.0.0.1:6379> HGETALL 20170721_sum
1)"mi.html"
2)"56949.65999999999998948"
3)"note.html"
4)"56803.50999999999999801"
5)"fanlegefan.com"
6)"59622.50999999999999801"
7)"huawei.html"
8)"64456.50000000000000711"
9)"iphone4.html"
10)"48643.07000000000001094"
11)"mac.html"
12)"51693.17999999999998906"
13)"book.html"
14)"67724.17999999999999261"
查看UV,测试数据只有8个user,所以uv都是8
127.0.0.1:6379> PFCOUNT 20170721_huawei.html
(integer) 8
127.0.0.1:6379> PFCOUNT 20170721_fanlegefan.com
(integer) 8
现在数据已经在redis中,可以写个定时任务将数据push到mysql中,前端就可以展示了,实时计算大概是这么个思路