Abstract on forecasting methods

This text is a continuation of a series of articles devoted to a brief description of the main methods of data analysis. The last time we covered classification methods, now we will consider forecasting methods. By forecasting, we mean the search for a specific number that is expected to be obtained for a new observation or for future periods. The article lists the names of the methods, their brief description and a Python script. An abstract may be useful before an interview, in a competition, or when starting a new project. It is assumed that the audience knows these methods, but has the need to quickly refresh them in memory.


Least squares regression . An attempt is made to present the dependence of one factor on another in the form of an equation. The coefficients are estimated by minimizing the loss function (error).

āˆ‘i=1n(yiāˆ’(axi+b))2ā†’min


If you solve this equation, you can find the estimated parameters:

a=nāˆ‘i=0nxiyiāˆ’āˆ‘i=0nxiāˆ‘i=0nyināˆ‘i=0nxi2āˆ’(āˆ‘i=0nxi)2


b=āˆ‘i=0nyiāˆ’aāˆ‘i=0nxin


Graphical representation:



If the data has Gauss-Markov properties:

  • E(Īµi)=0- the mathematical expectation of an error is 0
  • Ļƒ2(Īµi)=const- homoskedasticity
  • cov(Īµi,Īµj)=0,iā‰ j- lack of multicollinearity
  • xi- determined value
  • Īµāˆ¼N(0,Ļƒ2)- the error is normally distributed

Then, by the Gauss-Markov theorem, estimates will have the following properties:

  • Linearity - with a linear transformation of the vector Y, the estimates also change linearly.
  • Unbiased - with increasing sample size, mathematical expectation tends to the true value.
  • Consistency - as the sample size increases, estimates tend to their true value.
  • Efficiency - estimates have the least variance.
  • Normality - grades are normally distributed.

 #imports
import statsmodels.api as sm

#model fit
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()

#result
print(results.summary())
results.predict(X)


- Generalized GLS . It is used when the Gauss-Markov conditions on homoskedasticity (constant dispersion) of residues and non-correlation of residues with each other are not satisfied. The goal of GLS is to take into account the values ā€‹ā€‹of the covariance matrix of residuals by adjusting the calculation of the parameters of the regression equation. Matrix of estimated parameters:

aāˆ—=(XTĪ©āˆ’1X)āˆ’1XTĪ©āˆ’1Y

where Ī© is the covariance matrix of residuals. Note that for Ī© = 1 we obtain the usual least squares

 #imports
import statsmodels.api as sm
from scipy.linalg import toeplitz

#model fit
data = sm.datasets.longley.load(as_pandas=False)
data.exog = sm.add_constant(data.exog)
ols_resid = sm.OLS(data.endog, data.exog).fit().resid
res_fit = sm.OLS(ols_resid[1:], ols_resid[:-1]).fit()
rho = res_fit.params
order = toeplitz(np.arange(16))
sigma = rho**order
gls_model = sm.GLS(data.endog, data.exog, sigma=sigma)
gls_results = gls_model.fit()

#result
print(gls_results.summary())
gls_results.predict

- wls. , ( ) , , .


 #imports
import statsmodels.api as sm

#model fit
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
wls_model = sm.WLS(Y,X, weights=list(range(1,8)))
results = wls_model.fit()

#result
print(results.summary())
results.predict

- tsls. wls, , , , wls. .


 #imports
from linearmodels import IV2SLS, IVLIML, IVGMM, IVGMMCUE
from linearmodels.datasets import meps
from statsmodels.api import OLS, add_constant

#model fit
data = meps.load()
data = data.dropna()
controls = ['totchr', 'female', 'age', 'linc','blhisp']
instruments = ['ssiratio', 'lowincome', 'multlc', 'firmsz']
data['const'] = 1
controls = ['const'] + controls
ivolsmod = IV2SLS(data.ldrugexp, data[['hi_empunion'] + controls], None, None)
res_ols = ivolsmod.fit()

#result
print(res_ols)
print(res_ols.predict)

-ARIMA. . Auto-regression ( Y ) integrated ( ā€” , ) moving average ( ).

 #imports
from pandas import read_csv
from pandas import datetime
from pandas import DataFrame
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot

#model fit
def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit(disp=0)

#result
print(model_fit.summary())
model_fit.forecast()

- GARCH . General autoregression conditional heteroscedastic - used when there is heteroskedasticity in the time series.


 #imports
import pyflux as pf
import pandas as pd
from pandas_datareader import DataReader
from datetime import datetime

#model fit
jpm = DataReader('JPM',  'yahoo', datetime(2006,1,1), datetime(2016,3,10))
returns = pd.DataFrame(np.diff(np.log(jpm['Adj Close'].values)))
returns.index = jpm.index.values[1:jpm.index.values.shape[0]]
returns.columns = ['JPM Returns']

#result
model = pf.GARCH(returns,p=1,q=1)
x = model.fit()
x.summary()

If you missed any important method, please write about it in the comments and the article will be supplemented. Thank you for the attention.


All Articles