摘要: 仅用于记录R语言学习过程:
内容提要:
时间与日期数据的处理;
lubridate包;
时间序列介绍及举例
正文:
时间与日期数据的处理
n 导读:
u 时间生成函数:as.Date()
> as.Date(\'2017-02-16\')
[1] "2017-02-16"
> class(as.Date(\'2017-02-16\'))
[1] "Date" #由字符串变成日期
> as.Date(\'20170301\', format = \'%Y%m%d\') #格式要一一对应
[1] "2017-03-01" 格式符号
> as.Date(100,origin = \'2018-08-07\')
[1] "2018-11-15" #从2018-08-07开始100天
格式符号:
%m :月份
%b :对应月份的简称,如Jan
%B:对应月份的全称,如July
%Y:四位数年份
%y:二位数年份
%d: 对应天
u 时间生成函数:ISOdate()
> ISOdate(2018,4,21,15,21,52)
[1] "2018-04-21 15:21:52 GMT"
u as.POSIXlt()函数:
> dts <- c(\'2005-10-21 18:24:24\',\'2017-02-16 19:20:20\')
> as.POSIXlt(dts)
[1] "2005-10-21 18:24:24 CST" "2017-02-16 19:20:20 CST"
format 参数:
%a : Mon;
%A: Monday
%b: jan
%B: July
u strptime()函数:输出日期
u strftime()函数:将日期进行格式化
n julian()函数:计算日期之差
> julian(as.Date(\'2017-02-16\')) #默认是1970-01-01
[1] 17213
attr(,"origin")
[1] "1970-01-01"
origin 可以更改时间起点
> julian(as.Date(\'2017-02-16\'),origin = as.Date(\'2016-02-19\'))
[1] 363
attr(,"origin")
[1] "2016-02-19"
n difftime()函数:计算时间之差
> difftime(as.Date(\'2017-02-16\'), as.Date(\'2016-02-19\'),units = \'weeks\')
Time difference of 51.85714 weeks
n mean() :计算日期的均值
> mean(c(as.Date(\'2017-02-16\'), as.Date(\'2016-02-19\')))
[1] "2016-08-18"
n seq()函数生成时间序列:
> dateline <- seq(as.Date(\'2016-02-19\'),by = \'days\',length.out = 10)
> dateline
[1] "2016-02-19" "2016-02-20" "2016-02-21" "2016-02-22" "2016-02-23" "2016-02-24"
[7] "2016-02-25" "2016-02-26" "2016-02-27" "2016-02-28"
> dateline <- seq(as.Date(\'2016-02-19\'),by = \'weeks\',length.out = 10)
> dateline
[1] "2016-02-19" "2016-02-26" "2016-03-04" "2016-03-11" "2016-03-18" "2016-03-25"
[7] "2016-04-01" "2016-04-08" "2016-04-15" "2016-04-22"
> dateline <- seq(as.Date(\'2016-02-19\'),by = \'2 days\',length.out = 10)
> dateline
[1] "2016-02-19" "2016-02-21" "2016-02-23" "2016-02-25" "2016-02-27" "2016-02-29"
[7] "2016-03-02" "2016-03-04" "2016-03-06" "2016-03-08"
n stringi包中的时间生成函数:stri_datetime_add()函数
> library(stringi)
> my_newtime <- stri_datetime_add(as.Date(\'2017-02-16\'),value = 10,units =\'days\') #value= 10 指10天后,因为units设置的为天
> my_newtime
[1] "2017-02-26 08:00:00 CST"
n 创建时间(也用stringi包):stri_datetime_create()
stri_datetime_create(2014,4,20)
n 时间日期变量解析函数:将字符串转化成特定格式的日期时间变量;或者反过来:stri_datetime_parse()函数
> stri_datetime_parse(c(\'2015-02-27\',\'2015-02-28\'),\'yyyy-MM-dd\')
[1] "2015-02-27 21:28:56 CST" "2015-02-28 21:28:56 CST"
> stri_datetime_parse(c(\'2015-02-27\',\'2015-02-30\'),\'yyyy-MM-dd\')
[1] "2015-02-27 21:29:51 CST" NA #去掉错误日期
> stri_datetime_parse(c(\'2015-02-27\',\'2015-02-30\'),\'yyyy-MM-dd\',lenient = T)
[1] "2015-02-27 21:30:29 CST" "2015-03-02 21:30:29 CST" #自动校正
> stri_datetime_parse(\'2016-6-6\',\'yyyy-M-d\')
[1] "2016-06-06 09:02:23 CST"
lubridate包
n ymd()函数:生成时间变量的函数,顺序为年月日
> ymd(\'20170204\')
[1] "2017-02-04"
> ymd(\'020217\')
[1] "2002-02-17"
> x <- c(\'2009s01s01\',\'2009-02-12\',\'2009 01 30\',\'200814\',\'09.1.1\',\'abd 09 12 08\',\'!!09 ## 12 $$ 12\')
> ymd(x)
[1] "2009-01-01" "2009-02-12" "2009-01-30" "2020-08-14" #可用分隔符 2008,1,4 即可读出来2008年
[5] "2009-01-01" "2009-12-08" "2009-12-12"
类似函数:mdy() 月日年,ymd_hms 时分秒
> ymd_hms(\'20170219142323\')
[1] "2017-02-19 14:23:23 UTC"
n 提取日期变量中的子集,如提取时间
> x_time <-ymd(x)
> month(x_time,label = T)
[1] 1月 2月 1月 8月 1月 12月 12月
12 Levels: 1月 < 2月 < 3月 < 4月 < 5月 < 6月 < ... < 12月
> month(x_time,label = F)
[1] 1 2 1 8 1 12 12
> month(x_time,label = T,abbr = F)
[1] 一月 二月 一月 八月 一月 十二月 十二月
12 Levels: 一月 < 二月 < 三月 < 四月 < 五月 < ... < 十二月
> day(x_time)
[1] 1 12 30 14 1 8 12
> mday(x_time)
[1] 1 12 30 14 1 8 12 #在一个月中是第几天
> wday(x_time) #在一个星期中是第几天
[1] 5 5 6 6 5 3 7
n 修改日期:
> new_date <- now()
> new_date
[1] "2018-08-08 09:26:07 CST"
>
> day(new_date) <- 17
> new_date
[1] "2018-08-17 09:26:07 CST"
>
> month(new_date) <- 12
> new_date
[1] "2018-12-17 09:26:07 CST"
n 生成日期;make_date()函数
> dates <- make_date(year = 2010:2016,month = 1:3,day = 1:5)
> dates
[1] "2010-01-01" "2011-02-02" "2012-03-03" "2013-01-04"
[5] "2014-02-05" "2015-03-01" "2016-01-02"
n 添加时间信息: make_datetime()函数,类似make_date()函数,但是多了时分秒的信息
n 计算时间的近似值:
> x_time <- as.POSIXct(\'2009-11-12 12:01:59\')
> round_date(x_time,unit = \'minute\') #unit可改成hour,month,half year等
[1] "2009-11-12 12:02:00 CST"
n 残缺日期的处理:
> time_t <- c(\'2017-02\',\'201609\',\'2017/5\')
> ymd(time_t)
[1] NA NA NA
Warning message:
All formats failed to parse. No formats found.
> ymd(time_t,truncated = 1) #补全 都补的1号
[1] "2017-02-01" "2016-09-01" "2017-05-01"
n 如何进行时间变量的操作:如加减乘除 先用interval(),再用time_length()函数
> inta <- interval(start =ymd(\'1900,01,01\'),end = ymd(\'1999,1231\'))
> time_length(inta,unit = \'year\') # year也可以改成day
[1] 99.99726
> x <- as.POSIXlt(\'2017-02-03\')
> x + days(10) + hours(12) + minutes(30)
[1] "2017-02-13 12:30:00 UTC"
时间序列
n 生成时间序列:ts()函数
> ts(1:10,frequency = 4,start = c(1998,2)) #10个元素,4个季度,起始季度:1998年第二个季度
Qtr1 Qtr2 Qtr3 Qtr4
1998 1 2 3
1999 4 5 6 7
2000 8 9 10
> ts(1:10,frequency = 12,start = c(1998,2)) #以月为节点,12
Feb Mar Apr May Jun Jul Aug Sep Oct Nov
1998 1 2 3 4 5 6 7 8 9 10
> ts(1:10,frequency = 12,start = c(1998,2),end = c(2001,3)) #起始和终止时间
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1998 1 2 3 4 5 6 7 8 9 10 1
1999 2 3 4 5 6 7 8 9 10 1 2 3
2000 4 5 6 7 8 9 10 1 2 3 4 5
2001 6 7 8
n 时间序列可视化:
> value <- ts(data = sample(0:300,366,replace = TRUE),start =
+ as.Date(\'2016-01-01\'),frequency = 1,end = as.Date(\'2016-12-31\'))
> date <- seq(from = as.Date(\'2016-01-01\'),by = 1,length.out = 366)
> df <- data.frame(value = value,time = date)
> head(df)
value time
1 242 2016-01-01
2 59 2016-01-02
3 21 2016-01-03
4 117 2016-01-04
5 162 2016-01-05
6 151 2016-01-06
> plot(ts(cumsum(1+round(rnorm(100),2)),start = c(1954,7),frequency = 12))
> plot(value)
n xts扩展包中的函数:生成时间序列:xts()函数
> value <- sample(0:100,365,replace = T)
> times <- seq(from = as.Date(\'2017-01-01\'),by = 1,length = 365)
>
> myts <- xts(value,times)
> head(myts)
[,1]
2017-01-01 21
2017-01-02 66
2017-01-03 95
2017-01-04 6
2017-01-05 43
2017-01-06 6
> class(myts)
[1] "xts" "zoo"
u 后续操作:window()函数:从时间序列提取子集,同时也具有赋值功能
> window(myts,start = as.Date(\'2017-01-10\'),end = as.Date(\'2017-02-01\'))
[,1]
2017-01-10 30
2017-01-11 77
2017-01-12 70
2017-01-13 93
2017-01-14 57
2017-01-15 32
2017-01-16 100
2017-01-17 7
2017-01-18 13
2017-01-19 36
2017-01-20 52
2017-01-21 97
2017-01-22 43
2017-01-23 94
2017-01-24 50
2017-01-25 84
2017-01-26 79
2017-01-27 57
2017-01-28 86
2017-01-29 16
2017-01-30 55
2017-01-31 87
2017-02-01 71
赋值:
> window(myts,start = as.Date(\'2017-01-10\'),end = as.Date(\'2017-01-15\')) <- 1:6
> window(myts,start = as.Date(\'2017-01-10\'),end = as.Date(\'2017-01-15\'))
[,1]
2017-01-10 1
2017-01-11 2
2017-01-12 3
2017-01-13 4
2017-01-14 5
2017-01-15 6
u lag()函数:计算滞后,把前一项的值赋给后一项
> head(lag(myts)) #滞后的
[,1]
2017-01-01 NA
2017-01-02 21
2017-01-03 66
2017-01-04 95
2017-01-05 6
2017-01-06 43
> head(myts) #原本的
[,1]
2017-01-01 21
2017-01-02 66
2017-01-03 95
2017-01-04 6
2017-01-05 43
2017-01-06 6
u diff()函数:计算离差值:后一项值减去前一项的值
> head(diff(myts)) #计算离差的
[,1]
2017-01-01 NA
2017-01-02 45
2017-01-03 29
2017-01-04 -89
2017-01-05 37
2017-01-06 -37
时间序列分析
n 《时间序列分析与应用》
n 以co2数据集为例
> data(\'co2\')
> class(co2)
[1] "ts"
> head(co2)
[1] 315.42 316.31 316.50 317.56 318.13 318.00
> training <- co2[1:400]
> class(training)
[1] "numeric"
> ts_training <- ts(training,start = start(co2),frequency = frequency(co2)) #转换成时间序列
> plot(ts_training) #此处可视化
> de_co2 <- decompose(ts_training) #分解
> plot(de_co2) #此处可视化
#建模分析 采用SARIMA模型进行分析
#ARIMA(P 自相关的阶数 ,D 差分的阶数,Q 滑动平均的阶数
# 加载tseries包,运用kpss.test函数判断序列的平稳性
> training <- co2[1:400]
> ts_training <- ts(training,start = start(co2),frequency = frequency(co2))
> kpss.test(ts_training)
KPSS Test for Level Stationarity
data: ts_training
KPSS Level = 7.9126, Truncation lag parameter = 4,
p-value = 0.01 #说明序列是不平稳的
Warning message:
In kpss.test(ts_training) : p-value smaller than printed p-value
#把序列变平稳,用差分的方法处理
> ts_training_diff <- diff(ts_training)
> head(ts_training_diff)
[1] 0.89 0.19 1.06 0.57 -0.13 -1.61
#再次看序列是否平稳
> kpss.test(ts_training_diff) #kpss.test 确定了D 由于是用了一阶差分,所以D为1
KPSS Test for Level Stationarity
data: ts_training_diff
KPSS Level = 0.020359, Truncation lag parameter = 4, p-value =
0.1 #说明序列已经平稳了
Warning message:
In kpss.test(ts_training_diff) : p-value greater than printed p-value
> acf(ts_training) #acf自相关系数,确定P值 p =1
> pacf(ts_training) #偏自相关系数,确定Q值, Q = 1
#模型表达式: SARIMA(1,1,1)*(1,1,1)12 (12为frequency)
> co2_fit <-Arima(ts_training,order = c(1,1,1),
+ seasonal = list(order = c(1,1,1),period =12)) #Arima模型,1,1,1 PDQ
> co2_fore <- forecast(co2_fit,68)
> plot(co2,col= \'red\')
> par(new = T)
> plot(co2_fore)