🚃 🙍🏿 🕴🏿 Abstract on forecasting methods 🤫 👉🏿 👨🏻‍🎤

This text is a continuation of a series of articles devoted to a brief description of the main methods of data analysis. The last time we covered classification methods, now we will consider forecasting methods. By forecasting, we mean the search for a specific number that is expected to be obtained for a new observation or for future periods. The article lists the names of the methods, their brief description and a Python script. An abstract may be useful before an interview, in a competition, or when starting a new project. It is assumed that the audience knows these methods, but has the need to quickly refresh them in memory.

Least squares regression . An attempt is made to present the dependence of one factor on another in the form of an equation. The coefficients are estimated by minimizing the loss function (error).

\sum_{i = 1}^{n} (y_{i} - (a x_{i} + b))^{2} \to m i n

$\sum_{i=1}^n (y_i-(ax_i+b))^2 → min$

If you solve this equation, you can find the estimated parameters:

a = \frac{n \sum_{i = 0}^{n} x_{i} y_{i} - \sum_{i = 0}^{n} x_{i} \sum_{i = 0}^{n} y_{i}}{n \sum_{i = 0}^{n} x_{i}^{2} - (\sum_{i = 0}^{n} x_{i})^{2}}

$a = \frac{n\sum_{i=0}^n x_i y_i - \sum_{i=0}^n x_i \sum_{i=0}^n y_i}{n\sum_{i=0}^n x_i^2 - (\sum_{i=0}^n x_i)^2}$

b = \frac{\sum_{i = 0}^{n} y_{i} - a \sum_{i = 0}^{n} x_{i}}{n}

$b = \frac{\sum_{i=0}^n y_i - a\sum_{i=0}^n x_i }{n}$

Graphical representation:

If the data has Gauss-Markov properties:

$E(ε_i)=0$ - the mathematical expectation of an error is 0
$σ^2(ε_i)=const$ - homoskedasticity
$cov(ε_i,ε_j)=0,i≠j$ - lack of multicollinearity
$x_i$ - determined value
$ε \sim N(0,σ^2)$ - the error is normally distributed

Then, by the Gauss-Markov theorem, estimates will have the following properties:

Linearity - with a linear transformation of the vector Y, the estimates also change linearly.
Unbiased - with increasing sample size, mathematical expectation tends to the true value.
Consistency - as the sample size increases, estimates tend to their true value.
Efficiency - estimates have the least variance.
Normality - grades are normally distributed.

 #imports
import statsmodels.api as sm

#model fit
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()

#result
print(results.summary())
results.predict(X)

- Generalized GLS . It is used when the Gauss-Markov conditions on homoskedasticity (constant dispersion) of residues and non-correlation of residues with each other are not satisfied. The goal of GLS is to take into account the values of the covariance matrix of residuals by adjusting the calculation of the parameters of the regression equation. Matrix of estimated parameters:

a^{*} = (X^{T} Ω^{- 1} X)^{- 1} X^{T} Ω^{- 1} Y

$a^* = (X^TΩ^{-1}X)^{-1}X^TΩ^{-1}Y$

where Ω is the covariance matrix of residuals. Note that for Ω = 1 we obtain the usual least squares

 #imports
import statsmodels.api as sm
from scipy.linalg import toeplitz

#model fit
data = sm.datasets.longley.load(as_pandas=False)
data.exog = sm.add_constant(data.exog)
ols_resid = sm.OLS(data.endog, data.exog).fit().resid
res_fit = sm.OLS(ols_resid[1:], ols_resid[:-1]).fit()
rho = res_fit.params
order = toeplitz(np.arange(16))
sigma = rho**order
gls_model = sm.GLS(data.endog, data.exog, sigma=sigma)
gls_results = gls_model.fit()

#result
print(gls_results.summary())
gls_results.predict

- wls. , ( ) , , .

 #imports
import statsmodels.api as sm

#model fit
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
wls_model = sm.WLS(Y,X, weights=list(range(1,8)))
results = wls_model.fit()

#result
print(results.summary())
results.predict

- tsls. wls, , , , wls. .

 #imports
from linearmodels import IV2SLS, IVLIML, IVGMM, IVGMMCUE
from linearmodels.datasets import meps
from statsmodels.api import OLS, add_constant

#model fit
data = meps.load()
data = data.dropna()
controls = ['totchr', 'female', 'age', 'linc','blhisp']
instruments = ['ssiratio', 'lowincome', 'multlc', 'firmsz']
data['const'] = 1
controls = ['const'] + controls
ivolsmod = IV2SLS(data.ldrugexp, data[['hi_empunion'] + controls], None, None)
res_ols = ivolsmod.fit()

#result
print(res_ols)
print(res_ols.predict)

-ARIMA. . Auto-regression ( Y ) integrated ( — , ) moving average ( ).

 #imports
from pandas import read_csv
from pandas import datetime
from pandas import DataFrame
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot

#model fit
def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit(disp=0)

#result
print(model_fit.summary())
model_fit.forecast()

- GARCH . General autoregression conditional heteroscedastic - used when there is heteroskedasticity in the time series.

 #imports
import pyflux as pf
import pandas as pd
from pandas_datareader import DataReader
from datetime import datetime

#model fit
jpm = DataReader('JPM',  'yahoo', datetime(2006,1,1), datetime(2016,3,10))
returns = pd.DataFrame(np.diff(np.log(jpm['Adj Close'].values)))
returns.index = jpm.index.values[1:jpm.index.values.shape[0]]
returns.columns = ['JPM Returns']

#result
model = pf.GARCH(returns,p=1,q=1)
x = model.fit()
x.summary()

If you missed any important method, please write about it in the comments and the article will be supplemented. Thank you for the attention.

Abstract on forecasting methods

More articles: