In many of the models that we use, we assume independent observations. When we have time series data, this assumption is incorrect. A time series is a set of data with sequential observations recorded over regular time intervals. Individual observations within a time series are correlated with prior observations. Time series data are taken at widely varying intervals, such as yearly, quarterly, monthly, daily, etc.; however, you can use similar methods to analyze the data.
Any time individual observations are correlated by time. The trends, seasonality, and other factors that bear on time series data can be extracted effectively to decouple the time components of the data.
Time series allows you to examine when important events occur that may not be easily visible with other forms of analysis. Time series analysis also allows you to make predictions about systems, an important tool in natural systems with management implications.
We’re going to examine data using two popular time series analyses: An ARIMA forecast and a Fourier transform.
ARIMA stands for Autoregressive Integrated Moving Average.
AR = Autoregressive parameter, denoted by p in the model. This value determines the autocorrelation in the time series.
I = Integrated parameter, denoted by d in the model. This value indicates the number of differences that we needed to take in order to make the time series stationary.
MA = Moving Average parameter, denoted by q in the model. This value relates the auto-correlation and lag. Different values denote auto-correlation with varying lags.
Essentially, an ARIMA model uses differencing to detrend data in order to make reasonable predictions about the future. Let’s use the following temperature data to make a prediction about the future.
First, let’s install some libraries, in case you don’t already have them:
#install.packages(c("forecast","tseries", "lubridate", "tidyverse"))
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(forecast)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
library(tseries)
Now import the data we’ll be using. This data comes from Google Earth Engine.
#import data
df <- read_csv("CENMET.csv")
## Parsed with column specification:
## cols(
## date = col_character(),
## prcp = col_double(),
## tmin = col_double(),
## tmax = col_double(),
## elev = col_double()
## )
str(df)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6939 obs. of 5 variables:
## $ date: chr "2000/1/1" "2000/1/2" "2000/1/3" "2000/1/4" ...
## $ prcp: num 16.944 10.47 2.214 31.28 0.891 ...
## $ tmin: num 271 270 272 272 268 ...
## $ tmax: num 277 275 278 279 277 ...
## $ elev: num 1126 1126 1126 1126 1126 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_character(),
## .. prcp = col_double(),
## .. tmin = col_double(),
## .. tmax = col_double(),
## .. elev = col_double()
## .. )
The date reads as a character, so we need to change it to a date element before we can create our time series element.
df$date <- ymd(df$date)
str(df)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6939 obs. of 5 variables:
## $ date: Date, format: "2000-01-01" "2000-01-02" ...
## $ prcp: num 16.944 10.47 2.214 31.28 0.891 ...
## $ tmin: num 271 270 272 272 268 ...
## $ tmax: num 277 275 278 279 277 ...
## $ elev: num 1126 1126 1126 1126 1126 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_character(),
## .. prcp = col_double(),
## .. tmin = col_double(),
## .. tmax = col_double(),
## .. elev = col_double()
## .. )
Looks good. Now, we can glance at the data.
ggplot(df, aes(date, tmax)) + geom_line() + scale_x_date('month') +
ylab("Tmax") +
xlab("") +
theme_bw()
This data set gives temperature in Kelvin, which is why these numbers look so high. We could adjust that, but there’s no real point for this analysis. Besdies that, there’s a lot of data here and it can be a little difficult to maneuver with it. We’re going to cut it down to just a few years of data.
df1 <- df[c(1:1100),]
ggplot(df1, aes(date, tmax)) + geom_line() + scale_x_date('month') +
ylab("Tmax [K]") +
xlab("") +
theme_bw()
Let’s add a month and a year column for our time series analysis.
#add month again for TS analysis
df1$month<- month(df1$date)
df1$year <- year(df1$date)
Now, let’s quickly check the range and look for any outliers. (Note: ARIMA assumes that there are no outliers.)
#plot month over month DO to see the range and outliers -- We see some variation within months between years
ggplot(df1, aes(date, tmax)) + geom_point(color = "navyblue") +
facet_wrap(~month) + scale_x_date('month') +
ylab("Tmax [K]") +
xlab("Wrap by month for all years")+
theme_bw()
#Create a time series object based on Tmax to pass to tsclean()
TS.Tmax <- ts(df1[,c("tmax")])
Within the forecast package, tsclean() takes care of any outliers and missing data, ensuring that we do not run into any issues with out time series.
#tsclean() function to ID and replace outliers and input missing values if they exist
df1$clean_count <- tsclean(TS.Tmax)
#Graph cleaned data - ignore type as scale warning
ggplot() +
geom_line(data = df1, aes(x = date, y = clean_count)) +
ylab('Clean Count') +
theme_bw()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
The data are still quite variable, so it may be useful to eliminate some of that variability. Let’s create some moving averages in order to smooth out that variability.
#Get monthly and weekly moving averages (MA) and compare to cleaned daily data which
#still has a lot of variance and volatility in it.
df1$plotdata.ma7 <- ma(df1$clean_count, order=7)
df1$plotdata.ma30 <- ma(df1$clean_count, order=30)
Graphing original data, and weekly and monthly moving averages
ts1 <- ggplot() +
geom_line(data = df1, aes(x = date, y = clean_count, colour = "Counts")) +
geom_line(data = df1, aes(x = date, y = plotdata.ma7, colour = "Weekly Moving Average")) +
geom_line(data = df1, aes(x = date, y = plotdata.ma30, colour = "Monthly Moving Average")) +
ylab('Tmax [K]') +
xlab('Date') +
theme_bw()
ts1
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Warning: Removed 6 rows containing missing values (geom_path).
## Warning: Removed 30 rows containing missing values (geom_path).
We can see the highly variable daily time step in red. The slightly smoothed weekly average in blue eliminates some of that initial variability. The monthly moving average is quite smoothed and has likely removed too much from the data.
Now, we take seaonsality, trend, and cycle into account with functions from forecast. Essentially, we are finding all of the sources of variance from our observations.
count_ma = ts(na.omit(df1$plotdata.ma7), frequency = 52)
decomp = stl(count_ma, s.window = "periodic")
deseasonal_cnt <- seasadj(decomp) #This function removes the seasonal component of the data
plot(decomp)
Let’s walk through this new plot. The “data” component is the weekly moving average data that we plotted previously. The “seasonal” component takes seasonal patterns into account. This is important for our data because there is an obvious seasonal pattern. The “trend” component takes trend into account. In our case, it sees the seasonal pattern as a trend. The “remainder” component is the remaining variation in the data after seasonality and trending are removed.
NOTE: We can also accomplish this task with the decompose() function. In this case, we need to note whether the time series should be modeled as additive or multiplicative. As the names indicate, the series can either be expressed as a sum or product of the components of the above graph.
Additive time series Value = Base level + Trend + Seasonality + Error (i.e. random or remainder in the plots)
Multiplicative time series Value = Base Level * Trend * Seasonality * Error (i.e. random or remainder in the plots)
df2 <- decompose(count_ma, type = "additive")
plot(df2)
We can use this information to remove the seasonal component and the random component of the data
deseasonal_df <- count_ma - df2$seasonal
detrend_df <- count_ma - df2$trend
derandomize_df <- count_ma - df2$random
par(mfrow=c(1,3))
plot(count_ma, main = "Temp time series")
plot(detrend_df, main = "Deseasonalized")
plot(derandomize_df, main = "Random component removed")
Let’s remove the random and seasonal components and be left with the trend.
df3 <- deseasonal_df - df2$random
plot(df3)
NOTE: The seasadj() function from above does the work that we just did without the hassle. So we’ll proceed from here.
adf.test(count_ma, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: count_ma
## Dickey-Fuller = -2.4612, Lag order = 10, p-value = 0.3831
## alternative hypothesis: stationary
We cannot reject the null hypothesis and we classify the data as non-stationary
In order to build a more accurate model, we need to look for spikes at specific lag points.
Essentially, we are now building out the ARIMA model (recall, AR = p, I = d, MA = q). We will now try to explore the best values for the ARIMA function to generate a reasonable forecast.
#ACF plots display correlation between a series and its lags
Acf(count_ma, main = '')
#PACF plots display correlation between a series and its lags that explained by previous lags
Pacf(count_ma, main = '')
We now attempt to get the “Integrated” portion of the ARIMA model by differencing
count_d1 = diff(deseasonal_cnt, differences = 1) #difference of 1 is sufficient
plot(count_d1)
Let’s test the stationarity now.
adf.test(count_d1, alternative = "stationary")
## Warning in adf.test(count_d1, alternative = "stationary"): p-value smaller
## than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: count_d1
## Dickey-Fuller = -10.057, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
#with difference of 1 data, we bring the
#Dickey-Fuller test to -10.201,
#Lag order to 10 and
#p value to 0.01 (or less)
With a single difference, we have made the Dickey-Fuller statistic more negative (better fit) and we can reject the null hypothesis and call this time series “stationary.” You can try different differencing and see how that affects the Dickey-Fuller statistic.
Now, we’ll look for lag points that cross the confidence intervals.
#Look for spikes at specific lag points of the differenced series
par(mfrow=c(2,1))
Acf(count_d1, main='ACF for differenced Series')
Pacf(count_d1, main = 'PACF for differenced Series')
First, let’s take a look without the autoarima function and try to visualize the lags
#Evaluate and iterate - does the model make sense?
fit <- auto.arima(deseasonal_cnt, seasonal = FALSE)
tsdisplay(residuals(fit), lag.max = 60, main = "(1,1,1) Model Residuals") #high lag max to make sure we're seeing all of the lags
We can see that lags are still present under the default pdq values in the auto.arima function. There appear to be significant lags at 7
#graph shows serious lags at 7, so modify for q = 7
fit2 = arima(deseasonal_cnt, order=c(1,1,7))
tsdisplay(residuals(fit2), lag.max = 20, main = "Model Residuals")
#Not perfect yet, but getting better
Let’s now take a look at the fit of our model using known data
#Test model performance with a holdout set
hold <- window(ts(deseasonal_cnt), start = 991)
fit_no_holdout = arima(ts(deseasonal_cnt[-c(991:1090)]), order = c(1,1,7))
fcast_no_holdout <- forecast(fit_no_holdout,h=100)
plot(fcast_no_holdout, main='ARIMA Forecast')
lines(ts(deseasonal_cnt))
This model does not fit very well. The observed data moves outside of the confidence intervals and the straight line behavior is inconsistent with observations. Let’s try a different model fit and see how it looks.
fit3 = arima(deseasonal_cnt, order=c(1,1,12))
tsdisplay(residuals(fit3), lag.max = 20, main = "Model Residuals")