This text is a continuation of a series of articles devoted to a brief description of the main methods of data analysis. The last time we covered classification methods, now we will consider forecasting methods. By forecasting, we mean the search for a specific number that is expected to be obtained for a new observation or for future periods. The article lists the names of the methods, their brief description and a Python script. An abstract may be useful before an interview, in a competition, or when starting a new project. It is assumed that the audience knows these methods, but has the need to quickly refresh them in memory.
Least squares regression . An attempt is made to present the dependence of one factor on another in the form of an equation. The coefficients are estimated by minimizing the loss function (error).
If you solve this equation, you can find the estimated parameters:
Graphical representation:
If the data has Gauss-Markov properties:- - the mathematical expectation of an error is 0
- - homoskedasticity
- - lack of multicollinearity
- - determined value
- - the error is normally distributed
Then, by the Gauss-Markov theorem, estimates will have the following properties:- Linearity - with a linear transformation of the vector Y, the estimates also change linearly.
- Unbiased - with increasing sample size, mathematical expectation tends to the true value.
- Consistency - as the sample size increases, estimates tend to their true value.
- Efficiency - estimates have the least variance.
- Normality - grades are normally distributed.
import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())
results.predict(X)
- Generalized GLS . It is used when the Gauss-Markov conditions on homoskedasticity (constant dispersion) of residues and non-correlation of residues with each other are not satisfied. The goal of GLS is to take into account the values āāof the covariance matrix of residuals by adjusting the calculation of the parameters of the regression equation. Matrix of estimated parameters:
where Ī© is the covariance matrix of residuals. Note that for Ī© = 1 we obtain the usual least squares
import statsmodels.api as sm
from scipy.linalg import toeplitz
data = sm.datasets.longley.load(as_pandas=False)
data.exog = sm.add_constant(data.exog)
ols_resid = sm.OLS(data.endog, data.exog).fit().resid
res_fit = sm.OLS(ols_resid[1:], ols_resid[:-1]).fit()
rho = res_fit.params
order = toeplitz(np.arange(16))
sigma = rho**order
gls_model = sm.GLS(data.endog, data.exog, sigma=sigma)
gls_results = gls_model.fit()
print(gls_results.summary())
gls_results.predict
- wls. , ( ) , , .
import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
wls_model = sm.WLS(Y,X, weights=list(range(1,8)))
results = wls_model.fit()
print(results.summary())
results.predict
- tsls. wls, , , , wls. .
from linearmodels import IV2SLS, IVLIML, IVGMM, IVGMMCUE
from linearmodels.datasets import meps
from statsmodels.api import OLS, add_constant
data = meps.load()
data = data.dropna()
controls = ['totchr', 'female', 'age', 'linc','blhisp']
instruments = ['ssiratio', 'lowincome', 'multlc', 'firmsz']
data['const'] = 1
controls = ['const'] + controls
ivolsmod = IV2SLS(data.ldrugexp, data[['hi_empunion'] + controls], None, None)
res_ols = ivolsmod.fit()
print(res_ols)
print(res_ols.predict)
-ARIMA. . Auto-regression ( Y ) integrated ( ā , ) moving average ( ).
from pandas import read_csv
from pandas import datetime
from pandas import DataFrame
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())
model_fit.forecast()
- GARCH . General autoregression conditional heteroscedastic - used when there is heteroskedasticity in the time series.
import pyflux as pf
import pandas as pd
from pandas_datareader import DataReader
from datetime import datetime
jpm = DataReader('JPM', 'yahoo', datetime(2006,1,1), datetime(2016,3,10))
returns = pd.DataFrame(np.diff(np.log(jpm['Adj Close'].values)))
returns.index = jpm.index.values[1:jpm.index.values.shape[0]]
returns.columns = ['JPM Returns']
model = pf.GARCH(returns,p=1,q=1)
x = model.fit()
x.summary()
If you missed any important method, please write about it in the comments and the article will be supplemented. Thank you for the attention.