[实验数据]
本实验所用数据为江苏省南京市出租车从2010年9月1日到2日的GPS位置数据,包含了7,726辆出租车的共33,042,225条位置记录。数据集已经存放在HDFS上,路径为“/data/13/1/gps/”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.gps”。
各字段的定义为:
字段 定义
VehicleId 出租车标识符
Time 数据收集时间
Longitude 经度
Latitude 纬度
Speed 速度,以千米/小时计
Direction 方向,与正北方向的夹角度数
PassengerState 载客状态,1为载客,0为空车
数据集的前5行为:
student1@master:~$ hdfs dfs -cat /data/13/1/gps/ | head -5
806584008859 2010-09-01 00:01:42 118.85504299999999 31.939446 0 0 0
806770549907 2010-09-01 00:01:45 118.74909 32.089162000000002 30 6 1
806770943693 2010-09-01 00:01:46 118.856236 32.094287999999999 26 318 1
806914743134 2010-09-01 00:01:48 118.836872 31.931645 0 0 0
806451847007 2010-09-01 00:01:51 118.795597 32.039382000000003 20 10 0
[实验步骤提示]
在以下提示步骤中,步骤1是用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤2是用R语言做数据可视化。
-
用Hive做数据分析和数据准备
a. 统计每小时出租车的载客情况
统计每小时总共的独立出租车数量以及在该小时载过客的独立出租车数量,并计算出载过客的比例。hive -e \ "select day(Time) as Day, hour(Time) as Hour, count(distinct case when PassengerState=1 then VehicleId else NULL end) as Passenger, count(distinct VehicleId) as Total, count(distinct case when PassengerState=1 then VehicleId else NULL end) / count(distinct VehicleId) as PassengerRatio from bigdata_cases.gps group by day(Time), hour(Time) order by Day, Hour;" \
得到结果:
Day Hour Passenger Total PassengerRatio
1 0 4258 7457 0.5710071074158509
1 1 3760 7453 0.5044948342949148
1 2 3156 7457 0.42322649859192707
1 3 2600 7482 0.3475006682705159
1 4 1219 7462 0.16336102921468776
1 5 1356 7454 0.18191574993292192
1 6 2141 7479 0.2862682176761599
1 7 3798 7494 0.5068054443554844
1 8 5345 7504 0.7122867803837953
1 9 5998 7512 0.7984558040468583
1 10 6198 7508 0.8255194459243473
1 11 6246 7497 0.8331332533013205
1 12 6424 7526 0.8535742758437417
1 13 5810 7486 0.7761154154421587
1 14 6004 7485 0.8021376085504343
1 15 6124 7486 0.818060379374833
1 16 6166 7505 0.8215856095936043
1 17 6085 7507 0.8105767949913414
1 18 5957 7501 0.7941607785628583
1 19 6292 7519 0.836813406038037
1 20 5788 7507 0.7710137205275076
1 21 5644 7491 0.7534374582832732
1 22 5314 7486 0.7098584023510554
1 23 6 11 0.5454545454545454
2 0 6 10 0.6
2 1 5 10 0.5
2 2 527 2179 0.24185406149609912
2 3 245 3115 0.07865168539325842
2 4 420 2747 0.15289406625409538
2 5 156 1881 0.08293460925039872
2 7 4653 7496 0.6207310565635006
2 8 5951 7504 0.793043710021322
2 9 6261 7483 0.8366965120940799
2 10 6324 7500 0.8432
2 11 6231 7497 0.8311324529811924
2 12 6075 7484 0.8117316942811331
2 13 5862 7493 0.7823301748298412
2 14 5956 7508 0.7932871603622802
2 15 6114 7531 0.8118443765768159
2 16 6172 7550 0.8174834437086093
2 17 5943 7561 0.7860071419124455
2 18 5948 7564 0.7863564251718668
2 19 5999 7553 0.794253938832252
2 20 5842 7549 0.773877334746324
2 21 5688 7536 0.7547770700636943
2 22 5481 7535 0.7274054412740544
2 23 5203 7527 0.6912448518666136
b. 统计每小时载客出租车的平均车速
统计每小时正在载客的出租车的平均车速。
hive -e \
"select day(Time) as Day, hour(Time) as Hour,
avg(Speed) as speed
from bigdata_cases.gps
where PassengerState=1
group by day(Time), hour(Time)
order by day, hour;" \
得到结果:
Day Hour Speed
1 0 26.40957333997252
1 1 26.484185465929794
1 2 25.639321183088185
1 3 25.07481635824269
1 4 23.407378384509105
1 5 25.52963627342468
1 6 29.044904348241705
1 7 24.384438390355054
1 8 18.168545668750205
1 9 18.274947503908844
1 10 19.536583655156143
1 11 20.64434407258455
1 12 21.36393650253309
1 13 23.031837592073078
1 14 21.183105455791274
1 15 19.92800990942307
1 16 19.278142206512133
1 17 19.409214001377936
1 18 17.76479025717124
1 19 20.073146396873305
1 20 22.744269452020934
1 21 23.00955957379188
1 22 23.94621109754561
1 23 25.748275862068965
2 0 13.095555555555556
2 1 11.86861313868613
2 2 21.301467562748595
2 3 19.34041468388851
2 4 25.451918616210484
2 5 26.88686302010285
2 7 17.94664964407304
2 8 18.322770974793492
2 9 20.82934053237346
2 10 21.53841368936541
2 11 22.018161086650508
2 12 22.64793409392082
2 13 23.1530217969407
2 14 21.25080310758536
2 15 20.629242391066345
2 16 20.087993670711402
2 17 18.5745157655044
2 18 15.634892457581637
2 19 18.011473668637095
2 20 21.97895318156716
2 21 22.934715533298963
2 22 22.65631225433132
2 23 23.96024022264538
c. 统计每小时载客出租车的利用率
统计每小时所有出租车载客的总时间和行驶的总时间,并计算出载客时间的比例,即为利用率。
hive -e \
"select day() as Day, hour() as Hour,
sum(case when =1 then unix_timestamp() - unix_timestamp() else 0 end) as Passenger,
sum(unix_timestamp() - unix_timestamp()) as Total,
sum(case when =1 then unix_timestamp() - unix_timestamp() else 0 end) / sum(unix_timestamp() - unix_timestamp()) as PassengerRatio
from (
select VehicleId, Time, PassengerState,
lead(Time) over (partition by VehicleId order by Time) as PostTime
from bigdata_cases.gps
) a
group by day(), hour()
order by Day, Hour;" \
得到结果:
Day Hour Passenger Total PassengerRatio
1 0 7095657 26718724 0.265568707547561
1 1 6064334 27223971 0.22275714296051813
1 2 4804754 26990898 0.17801386230276592
1 3 3741590 27248583 0.13731319533202882
1 4 1409216 27171164 0.051864395651213174
1 5 1529474 26881007 0.05689794284864402
1 6 2491744 27014891 0.09223594498308359
1 7 5491842 27089971 0.2027260198986555
1 8 11465755 27143530 0.4224120812584067
1 9 15124532 27521886 0.5495456234358358
1 10 14689703 27472348 0.53470868234488
1 11 14310622 27422351 0.5218597778140904
1 12 13357718 28312497 0.4717958292410592
1 13 11311445 27270035 0.41479393040749674
1 14 13089708 27595631 0.4743398692350974
1 15 14068412 27634367 0.5090911617407412
1 16 14751893 27394349 0.5385013164576388
1 17 13556173 27415544 0.49447032676061436
1 18 13071200 27576153 0.4740037524450927
1 19 12485630 27933084 0.44698358405394834
1 20 11268531 27776822 0.4056810746744174
1 21 12256920 27547401 0.4449392521639337
1 22 84651975 208940995 0.40514775475248405
1 23 8388 41022 0.20447564721369021
2 0 12702 35978 0.3530490855522819
2 1 12478 39939 0.3124264503367636
2 2 1115170 9534887 0.11695681343680318
2 3 424528 7926314 0.05355932152069676
2 4 875639 18153132 0.04823624925990733
2 5 949100 17638080 0.05380971171465375
2 7 3919390 9547005 0.4105360791159112
2 8 14705378 28521412 0.5155908129653609
2 9 14001764 26725269 0.5239148013814192
2 10 13872016 27304917 0.508040951012596
2 11 12864680 27323762 0.47082389313740913
2 12 12156640 27130169 0.4480856717110756
2 13 11046548 27102516 0.4075838567902699
2 14 12666754 27173682 0.4661405105130766
2 15 13725740 27160821 0.5053507035004575
2 16 14459637 27551027 0.5248311433181783
2 17 12361111 26985433 0.45806606104856645
2 18 13820382 27239384 0.5073676409128782
2 19 14369189 27257101 0.5271723137394545
2 20 12115073 27100794 0.4470375665008191
2 21 12152858 27112889 0.4482317616540237
2 22 12834472 27062913 0.47424576947795677
2 23 10539314 26607742 0.3960995262206015
d. 得到各出租车在3时、8时和13时的瞬时载客情况
得到各出租车在3时、8时和13时第一次同步数据时的载客情况,为之后做数据可视化做准备。
hive -e \
"select , , ,
from (
select VehicleId, Longitude, Latitude, PassengerState,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=3
) a
where =1;" \
hive -e \
"select , , ,
from (
select VehicleId, Longitude, Latitude, PassengerState,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=8
) a
where =1;" \
hive -e \
"select , , ,
from (
select VehicleId, Longitude, Latitude, PassengerState,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=13
) a
where =1;" \
e. 得到各载客出租车在3时、8时和13时的瞬时速度
得到各载客出租车在3时、8时和13时第一次同步数据时的瞬时速度,为之后做数据可视化做准备。
hive -e \
"select , , ,
from (
select VehicleId, Longitude, Latitude, Speed,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=3 and PassengerState=1
) a
where =1;" \
hive -e \
"select , , ,
from (
select VehicleId, Longitude, Latitude, Speed,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=8 and PassengerState=1
) a
where =1;" \
>
hive -e \
"select , , ,
from (
select VehicleId, Longitude, Latitude, Speed,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=13 and PassengerState=1
) a
where =1;" \
- 用R语言做数据可视化
a. 载入相关程序包
载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。
执行以下R语言代码:
> library(ggplot2)
> library(ggmap)
> library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
其中,由于程序包ggmap所用的是谷歌的地图服务,需要修改程序包中源代码“get_googlemap.R”和“”中谷歌API的地址,将URL从“/maps/api/”替换为“/maps/api/”,并重新用Rtools编译打包。本实验系统中的RStudio已完成修改,不需要再做替换。
b. 可视化每小时出租车的平均速度和利用率
画出每小时出租车的平均速度和利用率的曲线图,其中横坐标表示小时,纵坐标表示出租车利用率,颜色表示日期,线条粗细表示平均速度。
> data2 <- ("D:/workspace/", sep = "\t")
> data3 <- ("D:/workspace/", sep = "\t")
> names(data2) <- c("Day", "Hour", "Speed")
> names(data3) <- c("Day", "Hour", "Passenger", "Total", "PassengerRatio")
> data23 <- cbind(data2, data3[5])
> data23$Day <- (data23$Day)
> ggplot(data23, aes(x = Hour, y = PassengerRatio, group = Day)) + geom_line(aes(colour = Day,
+ size = Speed))
b. 画出3时、8时和13时出租车载客情况的地理信息图
画出3时、8时和13时出租车载客情况的地理信息图,其中颜色表示载客情况,红色为空车,蓝色为载客。需要对经度做+0.006的校正,对维度做−0.003的校正。
> data4.1 <- ("D:/workspace/", sep = "\t")
> names(data4.1) <- c("VehicleId", "Longitude", "Latitude", "PassengerState")
> data4.1$PassengerState <- (data4.1$PassengerState)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") +
+ geom_point(data = data4.1, aes(x = Longitude + 0.006, y = Latitude - 0.003,
+ colour = PassengerState), alpha = 0.5, size = 2) + labs(title = "2010/9/1 3:00")
Map from URL : /maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Information from URL : /maps/api/geocode/json?address=Nanjing&sensor=false
Warning: Removed 1051 rows containing missing values (geom_point).
> data4.2 <- ("D:/workspace/", sep = "\t")
> names(data4.2) <- c("VehicleId", "Longitude", "Latitude", "PassengerState")
> data4.2$PassengerState <- (data4.2$PassengerState)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") +
+ geom_point(data = data4.2, aes(x = Longitude + 0.006, y = Latitude - 0.003,
+ colour = PassengerState), alpha = 0.5, size = 2) + labs(title = "2010/9/1 8:00")
Map from URL : /maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 1078 rows containing missing values (geom_point).
> data4.3 <- ("D:/workspace/", sep = "\t")
> names(data4.3) <- c("VehicleId", "Longitude", "Latitude", "PassengerState")
> data4.3$PassengerState <- (data4.3$PassengerState)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") +
+ geom_point(data = data4.3, aes(x = Longitude + 0.006, y = Latitude - 0.003,
+ colour = PassengerState), alpha = 0.5, size = 2) + labs(title = "2010/9/1 13:00")
Map from URL : /maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 888 rows containing missing values (geom_point).
c. 画出3时、8时和13时载客出租车瞬时速度的地理信息图
画出3时、8时和13时载客出租车瞬时速度的地理信息图,其中颜色表示瞬时速度,越接近红色速度越慢,越接近绿色速度越快。仅考虑车速小于等于80码且大于0的样本。需要对经度做+0.006的校正,对维度做−0.003的校正。
> data5.1 <- ("D:/workspace/", sep = "\t")
> names(data5.1) <- c("VehicleId", "Longitude", "Latitude", "Speed")
> summary(data5.1)
VehicleId Longitude Latitude Speed
Min. :8.064e+11 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.:8.066e+11 1st Qu.:118.8 1st Qu.:32.03 1st Qu.: 0.00
Median :8.068e+11 Median :118.8 Median :32.05 Median : 7.00
Mean :8.067e+11 Mean :117.8 Mean :31.78 Mean : 16.98
3rd Qu.:8.068e+11 3rd Qu.:118.8 3rd Qu.:32.08 3rd Qu.: 27.00
Max. :8.088e+11 Max. :119.9 Max. :38.86 Max. :196.00
> data5.1 <- filter(data5.1, Speed > 0 & Speed <= 80)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") +
+ geom_point(data = data5.1, aes(x = Longitude + 0.006, y = Latitude - 0.003,
+ colour = Speed), alpha = 0.5, size = 2) + scale_colour_gradient(limits = c(0,
+ 80), trans = "sqrt", low = "red", high = "green") + labs(title = "2010/9/1 3:00")
Map from URL : /maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 78 rows containing missing values (geom_point).
> data5.2 <- ("D:/workspace/", sep = "\t")
> names(data5.2) <- c("VehicleId", "Longitude", "Latitude", "Speed")
> data5.2 <- filter(data5.2, Speed > 0 & Speed <= 80)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") +
+ geom_point(data = data5.2, aes(x = Longitude + 0.006, y = Latitude - 0.003,
+ colour = Speed), alpha = 0.5, size = 2) + scale_colour_gradient(limits = c(0,
+ 80), trans = "sqrt", low = "red", high = "green") + labs(title = "2010/9/1 8:00")
Map from URL : /maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 345 rows containing missing values (geom_point).
> data5.3 <- ("D:/workspace/", sep = "\t")
> names(data5.3) <- c("VehicleId", "Longitude", "Latitude", "Speed")
> data5.3 <- filter(data5.3, Speed > 0 & Speed <= 80)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") +
+ geom_point(data = data5.3, aes(x = Longitude + 0.006, y = Latitude - 0.003,
+ colour = Speed), alpha = 0.5, size = 2) + scale_colour_gradient(limits = c(0,
+ 80), trans = "sqrt", low = "red", high = "green") + labs(title = "2010/9/1 13:00")
Map from URL : /maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 280 rows containing missing values (geom_point).