In many of the models that we use, we assume independent observations. When we have time series data, this assumption is incorrect. A time series is a set of data with sequential observations recorded over regular time intervals. Individual observations within a time series are correlated with prior observations. Time series data are taken at widely varying intervals, such as yearly, quarterly, monthly, daily, etc.; however, you can use similar methods to analyze the data.
Any time individual observations are correlated by time. The trends, seasonality, and other factors that bear on time series data can be extracted effectively to decouple the time components of the data.
Time series allows you to examine when important events occur that may not be easily visible with other forms of analysis. Time series analysis also allows you to make predictions about systems, an important tool in natural systems with management implications.
We’re going to examine data using two popular time series analyses: An ARIMA forecast and a Fourier transform.
ARIMA stands for Autoregressive Integrated Moving Average.
AR = Autoregressive parameter, denoted by p in the model. This value determines the autocorrelation in the time series.
I = Integrated parameter, denoted by d in the model. This value indicates the number of differences that we needed to take in order to make the time series stationary.
MA = Moving Average parameter, denoted by q in the model. This value relates the auto-correlation and lag. Different values denote auto-correlation with varying lags.
Essentially, an ARIMA model uses differencing to detrend data in order to make reasonable predictions about the future. Let’s use the following temperature data to make a prediction about the future.
First, let’s install some libraries, in case you don’t already have them:
#install.packages(c("forecast","tseries", "lubridate", "tidyverse"))
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(forecast)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
library(tseries)
Now import the data we’ll be using. This data comes from Google Earth Engine.
#import data
df <- read_csv("CENMET.csv")
## Parsed with column specification:
## cols(
## date = col_character(),
## prcp = col_double(),
## tmin = col_double(),
## tmax = col_double(),
## elev = col_double()
## )
str(df)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6939 obs. of 5 variables:
## $ date: chr "2000/1/1" "2000/1/2" "2000/1/3" "2000/1/4" ...
## $ prcp: num 16.944 10.47 2.214 31.28 0.891 ...
## $ tmin: num 271 270 272 272 268 ...
## $ tmax: num 277 275 278 279 277 ...
## $ elev: num 1126 1126 1126 1126 1126 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_character(),
## .. prcp = col_double(),
## .. tmin = col_double(),
## .. tmax = col_double(),
## .. elev = col_double()
## .. )
The date reads as a character, so we need to change it to a date element before we can create our time series element.
df$date <- ymd(df$date)
str(df)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6939 obs. of 5 variables:
## $ date: Date, format: "2000-01-01" "2000-01-02" ...
## $ prcp: num 16.944 10.47 2.214 31.28 0.891 ...
## $ tmin: num 271 270 272 272 268 ...
## $ tmax: num 277 275 278 279 277 ...
## $ elev: num 1126 1126 1126 1126 1126 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_character(),
## .. prcp = col_double(),
## .. tmin = col_double(),
## .. tmax = col_double(),
## .. elev = col_double()
## .. )
Looks good. Now, we can glance at the data.
ggplot(df, aes(date, tmax)) + geom_line() + scale_x_date('month') +
ylab("Tmax") +
xlab("") +
theme_bw()
This data set gives temperature in Kelvin, which is why these numbers look so high. We could adjust that, but there’s no real point for this analysis. Besdies that, there’s a lot of data here and it can be a little difficult to maneuver with it. We’re going to cut it down to just a few years of data.
df1 <- df[c(1:1100),]
ggplot(df1, aes(date, tmax)) + geom_line() + scale_x_date('month') +
ylab("Tmax [K]") +
xlab("") +
theme_bw()
Let’s add a month and a year column for our time series analysis.
#add month again for TS analysis
df1$month<- month(df1$date)
df1$year <- year(df1$date)
Now, let’s quickly check the range and look for any outliers. (Note: ARIMA assumes that there are no outliers.)
#plot month over month DO to see the range and outliers -- We see some variation within months between years
ggplot(df1, aes(date, tmax)) + geom_point(color = "navyblue") +
facet_wrap(~month) + scale_x_date('month') +
ylab("Tmax [K]") +
xlab("Wrap by month for all years")+
theme_bw()
#Create a time series object based on Tmax to pass to tsclean()
TS.Tmax <- ts(df1[,c("tmax")])
Within the forecast package, tsclean() takes care of any outliers and missing data, ensuring that we do not run into any issues with out time series.
#tsclean() function to ID and replace outliers and input missing values if they exist
df1$clean_count <- tsclean(TS.Tmax)
#Graph cleaned data - ignore type as scale warning
ggplot() +
geom_line(data = df1, aes(x = date, y = clean_count)) +
ylab('Clean Count') +
theme_bw()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
The data are still quite variable, so it may be useful to eliminate some of that variability. Let’s create some moving averages in order to smooth out that variability.
#Get monthly and weekly moving averages (MA) and compare to cleaned daily data which
#still has a lot of variance and volatility in it.
df1$plotdata.ma7 <- ma(df1$clean_count, order=7)
df1$plotdata.ma30 <- ma(df1$clean_count, order=30)
Graphing original data, and weekly and monthly moving averages
ts1 <- ggplot() +
geom_line(data = df1, aes(x = date, y = clean_count, colour = "Counts")) +
geom_line(data = df1, aes(x = date, y = plotdata.ma7, colour = "Weekly Moving Average")) +
geom_line(data = df1, aes(x = date, y = plotdata.ma30, colour = "Monthly Moving Average")) +
ylab('Tmax [K]') +
xlab('Date') +
theme_bw()
ts1
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Warning: Removed 6 rows containing missing values (geom_path).
## Warning: Removed 30 rows containing missing values (geom_path).
We can see the highly variable daily time step in red. The slightly smoothed weekly average in blue eliminates some of that initial variability. The monthly moving average is quite smoothed and has likely removed too much from the data.
Now, we take seaonsality, trend, and cycle into account with functions from forecast. Essentially, we are finding all of the sources of variance from our observations.
count_ma = ts(na.omit(df1$plotdata.ma7), frequency = 52)
decomp = stl(count_ma, s.window = "periodic")
deseasonal_cnt <- seasadj(decomp) #This function removes the seasonal component of the data
plot(decomp)
Let’s walk through this new plot. The “data” component is the weekly moving average data that we plotted previously. The “seasonal” component takes seasonal patterns into account. This is important for our data because there is an obvious seasonal pattern. The “trend” component takes trend into account. In our case, it sees the seasonal pattern as a trend. The “remainder” component is the remaining variation in the data after seasonality and trending are removed.
NOTE: We can also accomplish this task with the decompose() function. In this case, we need to note whether the time series should be modeled as additive or multiplicative. As the names indicate, the series can either be expressed as a sum or product of the components of the above graph.
Additive time series Value = Base level + Trend + Seasonality + Error (i.e. random or remainder in the plots)
Multiplicative time series Value = Base Level * Trend * Seasonality * Error (i.e. random or remainder in the plots)
df2 <- decompose(count_ma, type = "additive")
plot(df2)
We can use this information to remove the seasonal component and the random component of the data
deseasonal_df <- count_ma - df2$seasonal
detrend_df <- count_ma - df2$trend
derandomize_df <- count_ma - df2$random
par(mfrow=c(1,3))
plot(count_ma, main = "Temp time series")
plot(detrend_df, main = "Deseasonalized")
plot(derandomize_df, main = "Random component removed")
Let’s remove the random and seasonal components and be left with the trend.
df3 <- deseasonal_df - df2$random
plot(df3)
NOTE: The seasadj() function from above does the work that we just did without the hassle. So we’ll proceed from here.
adf.test(count_ma, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: count_ma
## Dickey-Fuller = -2.4612, Lag order = 10, p-value = 0.3831
## alternative hypothesis: stationary
We cannot reject the null hypothesis and we classify the data as non-stationary
In order to build a more accurate model, we need to look for spikes at specific lag points.
Essentially, we are now building out the ARIMA model (recall, AR = p, I = d, MA = q). We will now try to explore the best values for the ARIMA function to generate a reasonable forecast.
#ACF plots display correlation between a series and its lags
Acf(count_ma, main = '')
#PACF plots display correlation between a series and its lags that explained by previous lags
Pacf(count_ma, main = '')
We now attempt to get the “Integrated” portion of the ARIMA model by differencing
count_d1 = diff(deseasonal_cnt, differences = 1) #difference of 1 is sufficient
plot(count_d1)
Let’s test the stationarity now.
adf.test(count_d1, alternative = "stationary")
## Warning in adf.test(count_d1, alternative = "stationary"): p-value smaller
## than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: count_d1
## Dickey-Fuller = -10.057, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
#with difference of 1 data, we bring the
#Dickey-Fuller test to -10.201,
#Lag order to 10 and
#p value to 0.01 (or less)
With a single difference, we have made the Dickey-Fuller statistic more negative (better fit) and we can reject the null hypothesis and call this time series “stationary.” You can try different differencing and see how that affects the Dickey-Fuller statistic.
Now, we’ll look for lag points that cross the confidence intervals.
#Look for spikes at specific lag points of the differenced series
par(mfrow=c(2,1))
Acf(count_d1, main='ACF for differenced Series')
Pacf(count_d1, main = 'PACF for differenced Series')
First, let’s take a look without the autoarima function and try to visualize the lags
#Evaluate and iterate - does the model make sense?
fit <- auto.arima(deseasonal_cnt, seasonal = FALSE)
tsdisplay(residuals(fit), lag.max = 60, main = "(1,1,1) Model Residuals") #high lag max to make sure we're seeing all of the lags
We can see that lags are still present under the default pdq values in the auto.arima function. There appear to be significant lags at 7
#graph shows serious lags at 7, so modify for q = 7
fit2 = arima(deseasonal_cnt, order=c(1,1,7))
tsdisplay(residuals(fit2), lag.max = 20, main = "Model Residuals")
#Not perfect yet, but getting better
Let’s now take a look at the fit of our model using known data
#Test model performance with a holdout set
hold <- window(ts(deseasonal_cnt), start = 991)
fit_no_holdout = arima(ts(deseasonal_cnt[-c(991:1090)]), order = c(1,1,7))
fcast_no_holdout <- forecast(fit_no_holdout,h=100)
plot(fcast_no_holdout, main='ARIMA Forecast')
lines(ts(deseasonal_cnt))
This model does not fit very well. The observed data moves outside of the confidence intervals and the straight line behavior is inconsistent with observations. Let’s try a different model fit and see how it looks.
fit3 = arima(deseasonal_cnt, order=c(1,1,12))
tsdisplay(residuals(fit3), lag.max = 20, main = "Model Residuals")
fcast1 <- forecast(fit3, h=60)
plot(fcast1)
This fit gives us a clearer trend. Let’s fit it once more with observed data.
hold <- window(ts(deseasonal_cnt), start = 991)
fit_no_holdout = arima(ts(deseasonal_cnt[-c(991:1090)]), order = c(1,1,12))
fcast_no_holdout <- forecast(fit_no_holdout,h=100)
plot(fcast_no_holdout, main='ARIMA Forecast')
lines(ts(deseasonal_cnt)) #a bit better, but not ideal
The model is getting better, but let’s find the ideal values. Let’s try the autofit provided by the forecast package, which optimizes the p,d,q values
#get auto fit p,d,q values
auto.arima(deseasonal_cnt, seasonal = FALSE) #ideal ARIMA is (p=1,d=0,q=4)
## Series: deseasonal_cnt
## ARIMA(1,0,4) with non-zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 mean
## 0.9789 0.9287 0.2028 0.3216 0.5486 287.5877
## s.e. 0.0062 0.0351 0.0573 0.0322 0.0595 2.1928
##
## sigma^2 estimated as 0.2825: log likelihood=-862.18
## AIC=1738.36 AICc=1738.46 BIC=1773.34
fit4 = arima(deseasonal_cnt, order=c(1,0,4))
tsdisplay(residuals(fit4), lag.max = 20, main = "Model Residuals")
fcast2 <- forecast(fit4, h=60)
plot(fcast2)
This model looks much better. let’s try fitting this with observed data.
hold <- window(ts(deseasonal_cnt), start = 991)
fit_no_holdout = arima(ts(deseasonal_cnt[-c(991:1090)]), order = c(1,0,4))
fcast_no_holdout <- forecast(fit_no_holdout,h=100)
plot(fcast_no_holdout, main='ARIMA Forecast')
lines(ts(deseasonal_cnt))
This looks much better than the previous attempts.
This code has been sourced and modified from João Neto’s “Fourier Transform: A R Tutorial” 2013, http://www.di.fc.ul.pt/~jpn/r/fourier/fourier.html.
The interactive provided by the link has been sourced thorugh “An Interactive Guide To The Fourier Transform” by “Better Explained.”
Before we get into what the Fourier Transform is, lets review what a wave is, in particular a sine wave
Here we have a simple sine wave. It has a total of four basic components.
1. Amplitude: how tall is the wave
2. Frequency: how often does it repeat itself, 1/period
3. Phase: at what part of the cycle does the wave start
4. Translations: where is the waves’s center of mass
xs <- seq(0,4*pi,pi/100)
wave.1 <- sin(xs)
par(mfrow = c(1, 2))
plot(xs,wave.1,type="l"); abline(h=0,lty=3)
Lets play around with different waves.
#Simple function to plot a wave
wave.func <- function(amp = 1, freq = 4, phase = 0, trans = 0, length = 30) {
xs <- seq(0,length,1/100)
wave <- trans + amp*sin(freq*xs+phase)
plot(xs,wave,type="l")
abline(h=0,lty=3)
return(wave)
}
x = wave.func(amp = 3, freq = 4, phase = 1, trans = 1)
y = wave.func(amp = 3, freq = .5, phase = 1, trans = 1)
z = wave.func(amp = 5, freq = 1, phase = 3, trans = 2)
Now lets look at a complex wave. Lets linearly combine the three previous waves.
xs <- seq(0,30,1/100)
wave.total = x+y+z
plot(xs,wave.total,type="l")
As we can see it repeats itself over time. The idea that we can combine simple waves into more complex ones is at the core of the Forier Series.
It was Joseph Fourier who demonstrated that “any periodic wave can be represented by a sum of simple sine waves.”
\[ f(t) = a_0 + \sum_{k} a_k * sin(kwt+p_k)\] Where:
\(a_0\) is the DC component
w = 2\(\pi f_0\) where \(f_0\) is the fundamental frequency of original wave
\(a_k * sin(kwt+p_k)\) is called the harmonic, note the harmonic has to be less than the overall wave
This is all good but what about nonrepeating patterns?
Instead of simple sine waves, the fourier transform (FT) breaks the complex wave down into circular motions.
The FT sends the trajectory(your timeseries) through a series of filters.
* the FT pulls a “wave” compnent from your signal and leaves a remainder
* it then iterates untill there is no remainder left
* these filters are independent
Three principle components of the circular wave
1. Strength: how strong is the signal
2. Speed: how fast is it repeating, related to freuqency
3. Delay: where does it start
Very similar to the descriptions of a simple wave
Here is a link to a great animation demonstrating this consept.
https://betterexplained.com/examples/fourier/?cycles=1,2,3
The FT goes through your data and breaks them in to thier circular wave compoentns.
\[X_k = \sum_{n = 0}^{N-1}x_ne^{-i*2\pi*k*n/N}\]
Can’t have a frequency bigger than 1/(total time). Mind you I am glossing over underlying imaginary numbers behind this. For more details there is a great video: https://www.youtube.com/watch?v=spUNpyF58BY
Ok then lets look at an example of this in aciton and what it will look like.
This is a function to create a complex wave so we can generate some examples.
plot.fourier <- function(fourier.series, f.0, ts) {
w <- 2*pi*f.0
trajectory <- sapply(ts, function(t) fourier.series(t,w))
plot(ts, trajectory, type="l", xlab="time", ylab="f(t)"); abline(h=0,lty=3)
}
This is a function to visualize the results of a fft.
plot.frequency.spectrum <- function(X.k, xlimits=c(0,length(X.k))) {
plot.data <- cbind(0:(length(X.k)-1), Mod(X.k))
# TODO: why this scaling is necessary?
plot.data[2:length(X.k),2] <- 2*plot.data[2:length(X.k),2]
plot(plot.data, t="h", lwd=2, main="",
xlab="Frequency (Hz)", ylab="Strength",
xlim=xlimits, ylim=c(0,max(Mod(plot.data[,2]))))
}
Lets generate a wave to work with.
acq.freq <- 100 # data acquisition (sample) frequency (Hz)
time <- 6 # measuring time interval (seconds)
ts <- seq(0,time-1/acq.freq,1/acq.freq) # vector of sampling time-points (s)
f.0 <- 1/time
dc.component <- 1
component.freqs <- c(3,7,10) # frequency of signal components (Hz)
component.delay <- c(0,0,0) # delay of signal components (radians)
component.strength <- c(1.5,.5,.75) # strength of signal components
f <- function(t,w) {
dc.component +
sum( component.strength * sin(component.freqs*w*t + component.delay))
}
plot.fourier(f,f.0,ts=ts)
Of course we wouldn’t have the full funciton in the real world so lets resample the trajectory.
w <- 2*pi*f.0
trajectory <- sapply(ts, function(t) f(t,w))
head(trajectory,n=30)
## [1] 1.000000 1.162132 1.323161 1.481997 1.637568 1.788836 1.934801
## [8] 2.074515 2.207089 2.331703 2.447610 2.554146 2.650736 2.736896
## [15] 2.812242 2.876489 2.929454 2.971057 3.001324 3.020382 3.028458
## [22] 3.025877 3.013056 2.990502 2.958803 2.918623 2.870694 2.815807
## [29] 2.754805 2.688575
X.k <- fft(trajectory) # find all harmonics with fft()
plot.frequency.spectrum(X.k, xlimits=c(0,20))
Here we have the major frequencies and their relative strength.
Ok so that was cool, lets add some noise.
set.seed(101)
acq.freq <- 200
time <- 1
w <- 2*pi/time
ts <- seq(0,time,1/acq.freq)
trajectory <- 3*rnorm(101) + 3*sin(3*w*ts)
## Warning in 3 * rnorm(101) + 3 * sin(3 * w * ts): longer object length is
## not a multiple of shorter object length
plot(trajectory, type="l")
X.k <- fft(trajectory)
plot.frequency.spectrum(X.k,xlimits=c(0,acq.freq/2))
Lets try adding some nonstationarity.
trajectory1 <- trajectory + 25*ts # let's create a linear trend
plot(trajectory1, type="l")
X.k <- fft(trajectory1)
plot.frequency.spectrum(X.k,xlimits=c(0,acq.freq/2))
It does not work on trending timeseries.
We first must detrend the TS, lets use a simple linear regression.
trend <- lm(trajectory1 ~ts)
detrended.trajectory <- trend$residuals
plot(detrended.trajectory, type="l")
And now our signal is back.
X.k <- fft(detrended.trajectory)
plot.frequency.spectrum(X.k,xlimits=c(0,acq.freq/2))
Let’s try this with some real data.
library(zoo) #this is used to turn date in to an index
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
weather <- read.csv("CENMET.csv") # daily conditions (1 Hz = 1 day)
weather <- weather[order(nrow(weather):1),]
plot(index(weather$date), weather$tmax, type="l")
trend <- lm(tmax ~ index(date), data = weather)
abline(trend, col="red")
#Although this does not have a trend, lets detrend it anyway
detrended.trajectory <- trend$residuals
plot(detrended.trajectory, type="l", main="detrended time series")
#There is also a package called GeneCycle which will perform the fft
library(GeneCycle)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: longitudinal
## Loading required package: corpcor
## Loading required package: fdrtool
##
## Attaching package: 'GeneCycle'
## The following object is masked from 'package:forecast':
##
## is.constant
f.data <- GeneCycle::periodogram(detrended.trajectory)
harmonics <- 1:50
plot(f.data$freq[harmonics]*length(detrended.trajectory),
f.data$spec[harmonics]/sum(f.data$spec),
xlab="Harmonics (Hz)", ylab="Amplitute Density", type="h")