[实验数据]
本实验所用数据为某宽带运营商从2014年11月11日0时到1时的用户上网点击流数据,包含了31,487,488条记录。数据集已经存放在HDFS上,路径为“/data/13/4/dpi/”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.dpi”。
各字段的定义为:
字段 定义
srcip 用户所在客户端IP
ad 用户上网设备号
ts 用户请求的时间戳,1970年1月1日0时至今的毫秒数
url 用户请求的当前页面URL
ref 用户请求的来源页面URL,存储的是base64编码后的值
ua 用户所用浏览器,存储的是base64编码后的值
dstip 用户访问的服务器IP
cookie 用户Cookie,存储的是base64编码后的值
数据集的前5行为:
student1@master:~$ hdfs dfs -cat /data/13/4/dpi/ | head -5
1.171.51.162 none 1415638064719 /album/355986 aHR0cDovL3d3dy56aGl5b3VsYS5jb20v TW96aWxsYS81LjAgKGNvbXBhdGlibGU7IE1TSUUgOS4wOyBXaW5kb3dzIE5UIDYuMTsgV09XNjQ7IFRyaWRlbnQvNS4wOyBTTENDMjsgLk5FVCBDTFIgMi4wLjUwNzI3OyAuTkVUIENMUiAzLjUuMzA3Mjk7IC5ORVQgQ0xSIDMuMC4zMDcyOTsgTWVkaWEgQ2VudGVyIFBDIDYuMDsgTlAwNjsgLk5FVDQuMEM7IC5ORVQ0LjBFOyAzNjBTRSk= 180.166.52.52 UEhQU0VTU0lEPW1uczIyMW5rb200cG5iaml2NjhldGV1cjczOyBDTlpaREFUQTEwMDAwNTc5NzM9MTIzNzAwOTIzOC0xNDE1NjM2NDU5LWh0dHAlMjUzQSUyNTJGJTI1MkZ3ZWliby5jb20lMjUyRiU3QzE0MTU2MzY0NTk7IEhtX2x2dF9lMWU5YjY5MDNkMDFlNGI4NjMxOWRmMzY1ZWQxNTQwZD0xNDE1NDY4MzUyLDE0MTU0ODM2MDMsMTQxNTU0OTM2MywxNDE1NjM4MDIzOyBIbV9scHZ0X2UxZTliNjkwM2QwMWU0Yjg2MzE5ZGYzNjVlZDE1NDBkPTE0MTU2MzgwMjM= 20141111 00
1.193.64.174 none 1415635901445 /?sid=734&tn=title&gotourl=&utm_source=baidu&utm_medium=cpc&utm_campaign=baidupinpaizhuanqu&utm_content=title aHR0cDovL3d3dy5iYWlkdS5jb20vcz93ZD0lRTklOTQlQTYlRTYlQjElOUYlRTQlQjklOEIlRTYlOTglOUYlRTUlQUUlOTglRTclQkQlOTEmcnN2X3NwdD0xJmlzc3A9MSZmPTMmcnN2X2JwPTAmcnN2X2lkeD0yJmllPXV0Zi04JnRuPWJhaWR1aG9tZV9wZyZyc3ZfZW50ZXI9MSZyc3Zfc3VnMz03JnJzdl9zdWc0PTg4MyZyc3Zfc3VnMT02Jm9xPSVFOSU5NCVBNiVFNiVCMSU5RiZyc3Zfc3VnMj0xJnJzcD0xJmlucHV0VD01MDgxJl9fZWlzPTEmX19laXN0PTM4MQ== TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgTWF4dGhvbi80LjMuMS4yMDAwIENocm9tZS8zMC4wLjE1OTkuMTAxIFNhZmFyaS81MzcuMzY= 116.236.224.20 YmRzaGFyZV9maXJzdGltZT0xMzgyMTkzMzE5NTUxOyBhZ19maWQ9WUx6bE1Dd3dCUk9zeUJtRjsgX19hZ19jbV89MTQxMTkxNzQxMjI4NzsgSG1fbHZ0X2Q2MGJjZTQyNjY3NTM5OWIxZGY0NTcyOGI0OGYxMDdkPTE0MTE5MTUwNzksMTQxMTkzMDI3NCwxNDExOTMxMzM4LDE0MTE5OTUxNjM7IF9fdXRtYT0yMTc0ODU3MDkuMTM4NDQ5NTY2OS4xMzgzNzEzMDkzLjE0MTE5MzEzMzkuMTQxMTk5NTE2Ni4yMDsgX191dG16PTIxNzQ4NTcwOS4xNDExOTk1MTY2LjIwLjUudXRtY3NyPWJhaWR1fHV0bWNjbj1iYWlkdXBpbnBhaXpodWFucXV8dXRtY21kPWNwY3x1dG1jY3Q9dGl0bGU7IG9vb29vb289NWExOTk2M2Fvb29vb29vXzVhMTk5NjNh 20141111 00
1.193.64.174 none 1415635909805 /js/datetime/Utility/My97DatePicker/ aHR0cDovL3d3dy5qaW5qaWFuZ2lubnMuY29tL0RlZmF1bHQuaHRtbD91dG1fc291cmNlPWJhaWR1JnV0bV9tZWRpdW09Y3BjJnV0bV9jYW1wYWlnbj1iYWlkdXBpbnBhaXpodWFucXUmdXRtX2NvbnRlbnQ9dGl0bGU= TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgTWF4dGhvbi80LjMuMS4yMDAwIENocm9tZS8zMC4wLjE1OTkuMTAxIFNhZmFyaS81MzcuMzY= 116.236.224.20 YmRzaGFyZV9maXJzdGltZT0xMzgyMTkzMzE5NTUxOyBhZ19maWQ9WUx6bE1Dd3dCUk9zeUJtRjsgX19hZ19jbV89MTQxMTkxNzQxMjI4NzsgSG1fbHZ0X2Q2MGJjZTQyNjY3NTM5OWIxZGY0NTcyOGI0OGYxMDdkPTE0MTE5MTUwNzksMTQxMTkzMDI3NCwxNDExOTMxMzM4LDE0MTE5OTUxNjM7IG9vb29vb289NWExOTk2M2Fvb29vb29vXzVhMTk5NjNhOyBBU1AuTkVUX1Nlc3Npb25JZD16ZDBnc3o1NW9ha2h6ZDQ1a3p6dHVoNDU7IENvb2tpZVRlc3Q9Y29va2lldGVzdDsgdHJhY2U9dHJhY2Vsb2dpZD02MDUxOTY4NjQmdHJhY2VpZD03MzQmY2xpY2tfaWQ9MDsgX191dG10PTE7IF9fdXRtYT0yMTc0ODU3MDkuMTM4NDQ5NTY2OS4xMzgzNzEzMDkzLjE0MTE5OTUxNjYuMTQxNTYzNTg3NC4yMTsgX191dG1iPTIxNzQ4NTcwOS4xLjEwLjE0MTU2MzU4NzQ7IF9fdXRtYz0yMTc0ODU3MDk7IF9fdXRtej0yMTc0ODU3MDkuMTQxNTYzNTg3NC4yMS42LnV0bWNzcj1iYWlkdXx1dG1jY249YmFpZHVwaW5wYWl6aHVhbnF1fHV0bWNtZD1jcGN8dXRtY2N0PXRpdGxl 20141111 00
1.198.181.118 none 1415638269814 /?cmpid=jl_mob_000015&app=jlfh NoDef TW96aWxsYS81LjAgKExpbnV4OyBVOyBBbmRyb2lkIDQuMC40OyB6aC1jbjsgWlRFIFU3OTUrIEJ1aWxkL0lNTTc2RCkgQXBwbGVXZWJLaXQvNTM0LjMwIChLSFRNTCwgbGlrZSBHZWNrbykgVmVyc2lvbi80LjAgTW9iaWxlIFNhZmFyaS81MzQuMzA= 116.228.143.243 X19hZ19jbV89MTQwNTM5MDY4MzE5OA== 20141111 00
1.206.172.59 none 1415636904699 /cms/app/info/doc//24944 aHR0cDovL3d3dy5iYWlkdS5jb20vcz90bj0wNjAwODAwNl8yX3BnJndkPSVCOSVBNCVENyVGNyVDRSVERSVENiVEOCVCNCVGMyVDQSVDMiVCOSVDQSVDNyVFOSVCRiVGNiVDQiVCNSVDMyVGNw== TW96aWxsYS80LjAgKGNvbXBhdGlibGU7IE1TSUUgOC4wOyBXaW5kb3dzIE5UIDUuMTsgVHJpZGVudC80LjAp 222.66.2.231 NoDef 20141111 00
[实验步骤提示]
在以下提示步骤中,步骤1是用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤2是用R语言做数据可视化。
-
用Hive做数据分析和数据准备
a. 统计所有域名的页面访问量
统计各域名的页面访问量。hive -e \ "select regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV from bigdata_cases.dpi group by regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) order by PV desc;"
得到结果的前10行:
domain pv
6279156
2674312
2547403
1452202
1420461
847723
533678
388944
291111
266182
b. 统计页面访问量前10的域名每分钟的访问量
统计页面访问量前10的域名每分钟的访问量。
hive -e \
"select round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) as time, regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi a
inner join (
select regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV
from bigdata_cases.dpi
group by regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2)
order by PV desc limit 10) b on regexp_extract(parse_url(, 'HOST'),'([a-z0-9]+).(.*)',2)=
where round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) between 0 and 60
group by round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000), regexp_extract(parse_url(url, 'HOST'),'([a-z0-9]+).(.*)',2)
order by time, domain;" \
得到结果的前10行:
time domain PV IP
0.0 22382 8460
0.0 553 464
0.0 15418 11964
0.0 9199 5385
0.0 3573 3038
0.0 15609 6656
0.0 30682 21880
0.0 56171 36787
0.0 10917 1051
0.0 3325 1921
c. 统计电商网站每分钟的访问量
统计电商网站每分钟的访问量。
hive -e \
"select round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) as time,
case when parse_url(url, 'HOST') like '%.' then 'Taobao'
when parse_url(url, 'HOST') like '%.' then 'Tmall'
when parse_url(url, 'HOST') like '%.' then 'JD'
when parse_url(url, 'HOST') like '%.' then 'YHD'
when parse_url(url, 'HOST') like '%.' then 'Amazon'
when parse_url(url, 'HOST') like '%.' then 'VIP'
else 'Suning' end as website,
count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi a
where (parse_url(url, 'HOST') like '%.' or parse_url(url, 'HOST') like '%.' or parse_url(url, 'HOST') like '%.'
or parse_url(url, 'HOST') like '%.' or parse_url(url, 'HOST') like '%.' or parse_url(url, 'HOST') like '%.'
or parse_url(url, 'HOST') like '%.')
and round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000) between 0 and 60
group by round((ts-1000*unix_timestamp('2014-11-11 00:00:00'))/60000),
case when parse_url(url, 'HOST') like '%.' then 'Taobao'
when parse_url(url, 'HOST') like '%.' then 'Tmall'
when parse_url(url, 'HOST') like '%.' then 'JD'
when parse_url(url, 'HOST') like '%.' then 'YHD'
when parse_url(url, 'HOST') like '%.' then 'Amazon'
when parse_url(url, 'HOST') like '%.' then 'VIP'
else 'Suning' end
order by time, website;" \
得到结果的前10行:
time website PV IP
0.0 Amazon 267 202
0.0 JD 9591 5545
0.0 Suning 343 274
0.0 Taobao 55040 39080
0.0 Tmall 65845 41833
0.0 VIP 87 69
0.0 YHD 3328 1923
1.0 Amazon 143 111
1.0 JD 4742 3087
1.0 Suning 232 181
d. 统计百度关键词的搜索量
统计百度关键词的搜索量,仅保留搜索量不少于20的关键词。这里返回的都是经过URL编码的关键词,之后会在R语言中进一步解析这些关键词。
hive -e \
"select parse_url(url, 'QUERY', 'wd') as keyword, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi
where url like '%.baidu.%/%wd=%'
group by parse_url(url, 'QUERY', 'wd')
having PV>=20
order by PV desc;" \
得到结果的前10行:
keyword PV IP
%E5%A4%A9%E6%B0%94 93396 1051
ip 23507 265
%E6%B7%98%E5%AE%9D 9372 4413
%E4%BA%AC%E4%B8%9C 8392 4076
%E5%A4%A9%E7%8C%AB 7032 3352
%E6%B7%98%E5%AE%9D%E7%BD%91 6183 2962
%E4%BA%AC%E4%B8%9C%E5%95%86%E5%9F%8E 4198 2154
%E6%94%AF%E4%BB%98%E5%AE%9D 3710 1832
ji 3613 2561
sh 3267 2126
e. 统计给淘宝或天猫引流的域名
统计给淘宝或天猫引流的域名,排除淘宝和天猫自己。
hive -e \
"select regexp_extract(parse_url(cast(unbase64(ref) as string), 'HOST'),'([a-z0-9]+).(.*)',2) as domain,count(1) as PV
from bigdata_cases.dpi
where (parse_url(url, 'HOST') like '%.' or parse_url(url, 'HOST') like '%.')
and parse_url(cast(unbase64(ref) as string), 'HOST') not like '%.taobao.%'
and parse_url(cast(unbase64(ref) as string), 'HOST') not like '%.tmall.%'
and parse_url(cast(unbase64(ref) as string), 'HOST') not rlike '([0-9]+).([0-9]+).([0-9]+).([0-9]+)'
group by regexp_extract(parse_url(cast(unbase64(ref) as string), 'HOST'),'([a-z0-9]+).(.*)',2)
order by PV desc;" \
得到结果的前10行:
domain PV
32317
23525
22678
12695
10371
8739
5836
5562
5110
4952
f. 统计给淘宝或天猫引流的百度关键词搜索量
统计给淘宝或天猫引流的百度关键词搜索量量。这里返回的都是经过URL编码的关键词,之后会在R语言中进一步解析这些关键词。
hive -e \
"select parse_url(cast(unbase64(ref) as string), 'QUERY', 'wd') as keyword, count(1) as PV, count(distinct srcip) as IP
from bigdata_cases.dpi
where (parse_url(url, 'HOST') like '%.' or parse_url(url, 'HOST') like '%.')
and cast(unbase64(ref) as string) like '%.baidu.%/%wd=%'
group by parse_url(cast(unbase64(ref) as string), 'QUERY', 'wd')
order by PV desc;" \
得到结果的前10行:
keyword PV IP
%E5%A4%A9%E7%8C%AB 2576 1745
%E6%B7%98%E5%AE%9D 2378 1766
%E6%B7%98%E5%AE%9D%E7%BD%91 2181 1559
%E5%A4%A9%E7%8C%AB%E5%95%86%E5%9F%8E 673 470
%E8%81%9A%E5%88%92%E7%AE%97 395 268
%E6%B7%98%E5%AE%9D%E5%95%86%E5%9F%8E 262 174
taobao 239 175
%E6%B7%98%E5%AE%9D%E7%BD%91%E9%A6%96%E9%A1%B5 148 116
tmall 131 85
tianmao 110 69
-
用R语言做数据可视化
a. 载入相关程序包
载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。> library(ggplot2) > library(wordcloud) Loading required package: RColorBrewer
b. 画出所有域名页面访问量的直方图
画出所有域名页面访问量的直方图,其中横坐标为对数尺度表示访问量,纵坐标为平方根尺度表示域名的数量。
> data1 <- ("D:/workspace/", sep = "\t", stringsAsFactors = FALSE)
> names(data1) <- c("domain", "PV")
> ggplot(data1, aes(x = PV)) + geom_histogram(aes(fill = ..count..)) + scale_fill_gradient("Count",
+ trans = "sqrt", low = "green", high = "red") + scale_x_log10() + scale_y_sqrt()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
c. 画出页面访问量前10的域名每分钟的访问量和独立IP数
画出页面访问量前10的域名每分钟的访问量和独立IP数,其中横坐标表示2014年11月11日0时过后的分钟数,纵坐标为对数尺度表示域名的访问量,线条粗细表示平均每个独立IP访问的页面数。
> data2 <- ("D:/workspace/", sep = "\t")
> names(data2) <- c("time", "domain", "PV", "IP")
> ggplot(data2, aes(x = time, y = PV, group = domain)) + geom_line(aes(colour = domain,
+ size = PV/IP)) + scale_y_log10()
d. 画出7大电商网站每分钟的访问量和独立IP数
画出7大电商网站每分钟的访问量和独立IP数,其中横坐标表示2014年11月11日0时过后的分钟数,纵坐标为对数尺度表示域名的访问量,线条粗细表示平均每个独立IP访问的页面数。
> data3 <- ("D:/workspace/", sep = "\t")
> names(data3) <- c("time", "website", "PV", "IP")
> ggplot(data3, aes(x = time, y = PV, group = website)) + geom_line(aes(colour = website,
+ size = PV/IP)) + scale_y_log10()
e. 画出百度搜索关键词的词云图
画出百度搜索关键词的词云图,字体越大说明搜索频率越高,仅画出搜索量前100的关键词。
> data4 <- ("D:/workspace/", sep = "\t", stringsAsFactors = FALSE)
> names(data4) <- c("keyword", "PV", "IP")
> keywords <- iconv(sapply(data4$keyword[1:200], URLdecode, simplify = TRUE),
+ "UTF-8", "gbk")
> wordcloud(keywords, data4$IP, c(4, 1), = FALSE, = TRUE,
+ col = rainbow(10))
f. 画出向淘宝或天猫引流的百度搜索关键词的词云图
画出向淘宝或天猫引流的百度搜索关键词的词云图,字体越大说明搜索频率越高,仅画出搜索量大于1的关键词。
> data6 <- ("D:/workspace/", sep = "\t", stringsAsFactors = FALSE)
> names(data6) <- c("keyword", "PV", "IP")
> keywords <- iconv(sapply(data6$keyword, URLdecode, simplify = TRUE), "UTF-8",
+ "gbk")
> wordcloud(keywords, data6$IP, c(4, 1), 5, = FALSE, = TRUE,
+ col = rainbow(10))