The regression-based algorithm is one of the most widely used approaches in the field of data science to predict the target variable. For example, linear regression is used to predict house prices (target variable) based on input variables like the number of rooms, location, amenities available, and more. But sometimes, we have collected data points over a period with timestamps that are referred to as ‘time-series data’. Some examples of time series data include hourly tracking of weather conditions, heart activity monitoring, and more. Time series data can be modeled in special ways, one of which includes the autoregressive approach.
We know that time-series data is listed chronologically because of which there exists a serial dependence. In simple words, there is some dependency between the values in a time series and the values that occurred before & after it. The different observations at separate timestamps are often referred to as ‘lags’. This correlation between the present and past values of a time series data is called Autocorrelation. When autocorrelation is present in data, regression analysis can be performed to extract trends and patterns. Widely used regression analysis methods include Autoregressive Models (AR), moving average models (MA), or advanced models like ARIMA. In this blog, we’ll be covering autoregressive models with practical examples.
We use the past values to predict or forecast the future values in autoregressive models. Are you wondering how it is different from normal regression?
In the case of normal regression, the target value Y is predicted by giving input variable X to a trained model (The model is trained on past values of X and Y). In the case of time series, the past data points (Y) will themselves act as the input variable for predicting the current data point. The data, at a time (t-2) and (t-1), acts as the input for predicting data at a time (t+1). Hence, this is just a self-regression that can also be called autoregression. In simple terms, the regression is against the past values of the target itself.
We create a model where the target variable depends on its past values measured at fixed time lags. The data points (t and t+1) are separated by a time lag. This time lag could differ from situation to situation. It could be a difference of 1 hour in case of health monitoring, 1 day in case of stock market prices, 1 week in case of product sales, and so on.
In any AR model, the target Y is represented as a linear combination of previous values of Y. Here arises an important question: How many previous values of Y should be taken in modeling? This is determined by the parameter (p) of the model, which is called as ‘order’ of the model.
Let’s start by considering first-order models. Say we are working with daily time lag data to predict today’s Y(t). We just use the value of Y that was measured yesterday (t-1). The first-order models are represented as A(1). In the below image, we have mentioned the equation of the first-order AR model. There always will be some noise due to randomness factors that cannot be predicted.
Similarly, we can also express second-order models AR(2), where two consecutive past values will be used. Extending this, a model of order AR(p) can be expressed as:
Now that you have got a grip on the idea behind AR models, let us learn the method of implementing the same on actual data.
Autoregressive models can be implemented easily using the library module
statsmodels.tsa.ar_model. Following are the steps of the process:
Step 1: Know your data
The first step in any data science problem is to understand the data in front of you. To help you understand in a better manner, we are using the Airline passenger traffic dataset that can be downloaded from Kaggle (download data). The dataset has two columns, one with the timestamp and the other with the number of passengers. Start by importing libraries and reading the CSV file. Have a look at below code snippet:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data=pd.read_csv('/kaggle/input/airline-passenger-traffic/airline-passenger-traffic(1).csv', header = None)
data.columns = ['Month','Passengers']
Fig: Displaying rows of our dataset
Observe that the passenger values are recorded for each month, which will be the time lag.
Let us visualize it using the matplotlib package.
Fig: Plot of the passenger data against months
Step 2: Check stationarity of time series data
An important factor to consider before applying autoregressive models is to ensure that the time series is stationary. A time series in which there aren’t any trends or seasonalities is called stationary. The values like mean and variance should not depend on the time of measurement. Often, the time series is non-stationary and must be converted before applying models.
How to find out if a time series is stationary?
The Augmented Dickey-Fuller Test comes to our rescue! In this test, the NULL hypothesis is that the series is non-stationary, and some trend is present. The alternate hypothesis is that series is stationary. If you are familiar with basic statistics, you would know that if the p-value is less than 0.05, we can reject the null hypothesis (stationary series).
To conduct this test, we have a simple function called 'adfuller()' from the statsmodels library. Look at the code snippet.
from statsmodels.tsa.stattools import adfuller
adfuller_test = adfuller(data['Passengers'])
print('p-value: %f' % adfuller_test)
#> p-value: 0.993020
The p-value is 0.99, which means the series is non-stationary.
How to convert it to stationery?
Commonly used methods include applying transformations like log transform, box cox transformation, or by applying differencing. Differencing helps in removing the varying mean. In this method, we calculate the difference between consecutive data points of time series and replace it in place of original data. You can try out a combination of these methods and decide which is best for the data. In this case, log transform and differencing have been applied.
data['Passengers'] = np.log(data['Passengers'])
data['Passengers'] = data['Passengers'] - data['Passengers'].shift(1)
You can compare the data before and after the transformation and observe that the trends, seasonalities have been removed. To ensure, run the ADF test again.
Now, the p-value is far less than 0.05, which means the time series is stationary. You can proceed to the next step.
Step 3: Partial autocorrelation plot
The next step is to decide the order of the autoregressive model. This is decided from the partial autocorrelation plots (PACF). The statsmodels library allows you to perform this easily using the
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(data, ax=plt.gca(), lags = 20)
Fig: Partial Autocorrelation plot for the data
In the above plot, you can observe that other lags are insignificant after lag 12. So you can choose this as the order of your AR model.
Step 4: Train your model
We have finally arrived at the last stage. You can import the “AutoReg” class for easy training. Divide your dataset into train and test sets. Fit the model on training data by specifying the order.
train_set = data[0:120]
test_set = data[120:]
from statsmodels.tsa.ar_model import AutoReg
ar_model = AutoReg(train_set, lags=18).fit()
You should be able to obtain the summary as shown below:
Now, you have trained an autoregressive model successfully. Next, it is essential to know why AR models are preferred, their advantages and also limitations associated.
Data scientists prefer using Autoregressive models as it can forecast the recurring patterns while also having the ability to convey the absence of randomness in data using the autocorreleation function. Another advantage of AR over other models is that is requires less data. A main application of the Autoregressive model is in econometric modelling and to forecast future security prices in the market. But we have to understand that AR models work on the past values to predict a similar future. In case of economic depression or unexpected financial crisis, wars, etc, these models will fail to predict the outcome. Sometimes, it might be advantageous to use the Vector Autoregressive model (VAR) over AR. A single VAR model can be used to predict multiple time series variables.
I hope you understood the importance of autocorrelation, stationarity and how to find the order of AR models. Another alternative for Autoregressive models are the Moving Average (MA) models. In MA model, the error terms of previous forecasts will be modeled. AR and MA models have been integrated into ARMA models. These autoregressive models have been expanded further with new changes to advanced models like VAR, ARIMA, SARIMA, and more. I hope you found this blog interesting and useful.
Tell us the skills you need and we'll find the best developer for you in days, not weeks.