Skip to content

Instantly share code, notes, and snippets.

@Gerenuk
Created August 6, 2018 12:55
Show Gist options
  • Save Gerenuk/3c84abdfd571d62b4cc61118e6ebf942 to your computer and use it in GitHub Desktop.
Save Gerenuk/3c84abdfd571d62b4cc61118e6ebf942 to your computer and use it in GitHub Desktop.
Time Series Summary
:toc:
:toc-placement: left
toc::[]
== Summary
* mostly from https://otexts.org/fpp2/[Forecasting: Principles and Practice (Hyndman, Athanasopoulos)]
* "stationary": properties do not depend on absolute time (e.g. fixed time seasons; but may have cycles due to dependence on past values)
* centered moving average will smooth out seasons (e.g. 1/8,1/4,1/4,1/4,1/8)
* maybe transform data if want to restrict to range (e.g. log for positive values) or variances are changing (e.g. Box-Cox transform)
** -> remember to do correct back transformation, since with the naive way the mean is biased
* you may consider modelling effects as additive or multiplicative
* forecasting with co-integrated features may require special treatment
* include spike dummy variables for special events
* you need stationary time series for ARIMA-like models -> use KPSS test -> use difference if small p-value (otherwise variance/interval estimates are incorrect)
* you can include features (rather than target alone); need stationary, and if differencing ideally difference all the same tiem
* could include lagged features, if lagged effects reasonable
* use AICc or CV to choose models (but differencing cannot be found with this since changes data)
* for ARIMA(p,d,q), if p=0 or q=0: can detect model from behaviour of ACF/PACF plots
* test residuals:
** use Ljung-Box test to see if residuals are white noise
** use AR(1) or Cochrane-Orcutt test to check for autocorrelation in residuals
** -> if not, you are missing information and may need a better model (more features, transformations, ...)
* usually prediction intervals are too narrow since many choices not accounted for
* bagging: do many predictions and average
=== Successful models
* SARIMAX: but does not work if period large (e.g. yearly seasons on weekly data)
* damped Holt-Winters
* auto.arima (R): may need to force to search more exhaustively
* ETS (Error-Trend-Season): non-linear ETS is not ARIMA
* Seasons could be handled with Fourier terms asciimath:[cos/sin (2*pi*k*t/T)] "harmonic regression"
* TBATS
* LSTM, RNN
* XGB, ANN
* GARCH
* Prophet (Facebook)
* CausalImpact (Google)
=== Special models
* Hierarchical/grouped models: when want to model parts and whole
* Vector Autoregression (VAR): when interactions between multiple targets
* Low integer count prediction: there are special methods for this; e.g. predict non-zero values and time between them
== Plots
* x-axis: month; within month show year development (an total mean for this month, but over years)
* scatter plot (matrix) of y's
* autocorrelation with diff lags
* ACF of residuals
== Tests
* Portmanteau test: test residuals whether they could be white noise
** Box-Pierce test: sum of h residuals squared; use h=10 for non-seasonal, h=2*season for seasonal, but h<=T/5
** Ljung-Box test: more accurate than Box-Pierce; chi^2 test
* Breusch-Godfrey test (Lagrange Multiplier) test for serial correlation: joint hypothesis that there is no autocorrelation in the residuals up to a certain specified order; small p-value -> significant autocorrelation remaining in the residuals.
=== Unit root test for stationarity
* determine whether differencing needed for stationarity
* KPSS Kwiatkowski-Phillips-Schmid-Shin test: differencing if small p-value
* also check if seasonal differencing needed
* with autocorrelation:
** while coef estimate is still unbiased, it may be off because a feature with predictive power is not in the model
** (hidden) variance of estimates increased
** estimated std errors to narrow; all statistics too optimistic
== Autocorrelation
* trend -> ACF drops slowly
* white noise -> 95% within -+ 2/sqrt(T) [T is length of time series]; if large deviations or too many -> not white noise
* model still unbiased if autocorrelated residuals, but prediction intervals may be wrong
* (Durbin-Watson tests for first order autocorrelation only; also does not work if target is reused as future feature; 0: pos 1st autocorr, 2: no 1st autocorr, 4: neg 1st autocorr)
* AR(1) more powerful test; Cochrane-Orcutt almost same result
* to fix:
** add feature
** change functional form
== Box-Cox transform
* asciimath:[(y^lambda-1)/lambda] (or asciimath:[ln(y)] if asciimath:[lambda=0])
* best if seasonal variation about equal sizes
* does not work if asciimath:[y<0]
* forecasts not sensitive on asciimath:[lambda]; but may have large effects on prediction intervals
*! not the the back transform gets the mean biased -> needs correction if want to add up results (e.g. aggregate); need forecast variance
== Metrics
* MAE for median
* RMSE for mean
* MAPE: mean of percentage error (only makes sense if meaningful zero); heavier weight on negative error
* (sMAPE: asciimath:[mean(|r_i|*2/(y+hat y)]; not recommended!)
* MASE mean absolute scaled error: scale errors based on training MAE from a simple forecast (e.g. naive forecasts)
* Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22, 679–688.
== Prediction intervals
* https://otexts.org/fpp2/prediction-intervals.html
* 1 step ahead: stddev about same as that of residuals
* for basic methods, there are equations (e.g. asciimath:[sigma*sqrt(h)] for naive method)
* bootstrap method: iteratively cumulate errors
== Selecting features
* https://otexts.org/fpp2/selecting-predictors.html
* try multiple model versions; e.g. all feature subsets
* or do backward feature selection; not necessarily best model, but always good model
* (forward selection if all features are too many)
* (Adjusted R^2: equivalent to minimizing stderr; selects too many)
* LOO-CV (fast methods)
* AIC (for large T same as CV method)
* AICc: corrected, since for small T selects too many features
* (BIC: for large T similar to leave-v-out where asciimath:[v=T(1-1/(ln(T)-1)])
*-> use AIC/CV, in FPP AICc is used
== Exponential smoothing
* average between last observation and previous forecast
* capture trend and seasonality
* asciimath:[y_(T+1)=alpha*y_T+alpha(1-alpha)*y_(T-1)+alpha(1-alpha)^2*y_(T-2)+...]
* need to choose smoothing and initial parameters
* Holt extended method to allow for trend: forecast/level/trend equation
* Gardner & McKenzie: added dampening to a flat line (parameter 0.8...0.98)
* can include seasons
* forecast by iterating model (and setting unknown errors to zero)
== ARIMA
* capture autocorrelation
* needs to be stationary
* cyclic maybe be stationary if depends on on past values and not absolute time
* ACF:
** non-stationary: drop slowly, r1 often large and positive
** stationary: drops quickly
* stationary -> do difference and Ljung-Box test (want large p value)
* https://otexts.org/fpp2/MA.html
* AR: autoregression of target
* MA: regression on past forecast errors
* AR theoretically same as MA with large order
* with some constraints MA model is invertible and also same as AR model with large order; invertible when recent observations more predictive than past(?)
* for stationary data we coefs are restricted:
** AR(1): asciimath:[-1<phi_1<1]
** AR(2): asciimath:[-1<phi_2<1], asciimath:[phi_1+phi_2<1]
** AR(3): complex restrictions
* -> linear model on (differenced) target and past prediction errors
* also includes a constant term
* ARIMA(p,d,q):
** order of autoregr.
** degree of differencing
** order of moving average
* with c=0: differencing 0/1/2 will model zero/constant/line
* with c!=0: differencing 0/1/2 will model mean/line/quadratic
* for cycles need p>=2; then asciimath:[cos omega = -phi_1(1-phi_2)/(4*phi_2)]
* sometime possible to tell p and q from ACF/PACF plots
* PACF is like last coef in AR(k) model
* finding p/q from ACF/PACF plot:
**! ACF/PACF plot only help if only one of p>0 or q>0
** ARIMA(p,d,0): ACF exp decay or sine; PACF spike p, but none beyond (sharp drop)
** ARIMA(0,d,q): PACF exp decay or sine; ACF spike q, but none beyond (sharp drop)
* auto.arima in R may need to be force to search more parameters
* estimation methods complex and different software can give different results
* AIC can be used to find parameters (has term asciimath:[p+q+bool(c!=0)])
*! d parameter cannot be found from AIC since it changes data and AIC not comparable
* auto.arima:
** https://otexts.org/fpp2/arima-r.html
** unit root test, minimize AIC, fit MLE
** d<=2 determined from KPSS test
** initial (0,0) (2,2) (1,0) (0,1); if d<2: use constant, but also fit (0,0) without constant
** start from best of initial and vary parameters to find better models
** `approximation=FALSE` and `stepwise=FALSE` for more search
* prediction intervals complicating; for stationary models (d=0) they will converge
*! prediction errors too narrow since variation in parameters, model order, etc not accounted
* seasonal ARIMA:
** (0,1): single spike in ACF, decay in PACF
** (1,0): decay in ACF, single spike in PACF
* Books:
** Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series analysis: Forecasting and control (5th ed). Hoboken, New Jersey: John Wiley & Sons.
** math background: Brockwell, P. J., & Davis, R. A. (2016). Introduction to time series and forecasting (3rd ed). New York, USA: Springer.
** alternatie autoarima: Peña, D., Tiao, G. C., & Tsay, R. S. (Eds.). (2001). A course in time series analysis. New York, USA: John Wiley & Sons.
=== ARIMA vs ETS
* linear ETS is special case of ARIMA
* non-linear ETS no equivalent ARIMA
* some ARIMA not a ETS
* all ETS non-stationary, some ARIMA stationary
* AIC cannot be used to compare ARIMA and ETS because diff model class and diff MLE
* but could use time series CV to compare
== Dynamic regression
* https://otexts.org/fpp2/estimation.html
* normal linear regression, but let (auto-correlated) error asciimath:[nu] follow an ARIMA model (with white noise error asciimath:[epsilon])
* cannot just minimize (auto-correlated) asciimath:[nu] since would be biased, statistical tests would be incorrect, AIC would be incorrect, p-values would be too small -> minimize asciimath:[epsilon] from ARIMA model
*! require that target and all of features are stationary (otherwise coefs will not be consistent)
* exception for stationarity requirement: co-integrated (linear combination of non-stationary is actually stationary)
* to keep variable nature, apply same differencing to all of them at once ("model in differences")
* now only need ARMA model (not ARIMA)
* to forecast, you also need to predict features (e.g. by assuming past averages)
* Pankratz, A. E. (1991): "Forecasting with dynamic regression models."
* Generalization of dynamic regression models, "transfer function models": Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). "Time series analysis: Forecasting and control" (5th ed)
=== Trend modeling
* options:
** deterministic trend: constant term in target prediction
** stochastic trend: using differencing (asciimath:[d=1]) in error model (will have wider prediction intervals; but that may be good because it considers changing trends)
== Seasons with Harmonic regression
* https://otexts.org/fpp2/dhr.html
* for long seasonal trends: dynamic regression with Fourier terms often better; especially since ARIMA does not like large periods (24 for hourly just about fine)
* also differencing daily data on a year ago is just noisy
* -> use harmonic regression for seasons and ARMA for short-term
* only drawback is that seasons are assumed to be constant
== Lagged features
* include lagged features if it makes sense (due to delayed effects)
* choose number of lags by AIC
== Hierarchical/grouped time series
* when target can be split into (hierarchical) categories (e.g. type or region)
* want "coherent" add up of series
* there is always: [all-intermediate-level-targets] = [Summing matrix] * [Bottom-level-targets]
* Wickramasuriya, S. L., Athanasopoulos, G., & Hyndman, R. J. (2018). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization
* "Grouped time series": Also independent (crossed) categorisation and not just a single tree
* bottom-up approach: just predict bottom level and sum
* top-down approach: predict top level and then predict proportions of categories; may perform well
* check https://otexts.org/fpp2/top-down.html for more
* middle-out approach: predict from middle and expand by summing or using proportions
* generally incoherent forecasts can be made coherent with a "reconciliation matrix"
* optimal reconciliation matrix: https://otexts.org/fpp2/reconciliation.html; "MinT" approach
** no top-down approach can be unbiased
** need to estimate forecast error variance -> use approximations
* further reading:
** Gross & Sohl (1990) provide a good introduction to the top-down approaches.
** The reconciliation methods were developed in a series of papers, which are best read in the following order: Hyndman et al. (2011), Athanasopoulos et al. (2009), Hyndman, Lee, & Wang (2016), Wickramasuriya et al. (2018).
** Athanasopoulos, Hyndman, Kourentzes, & Petropoulos (2017) extends the reconciliation approach to deal with temporal hierarchies.
== Complex, multiple seasonality
* https://otexts.org/fpp2/complexseasonality.html
* `msts`, `mstl` classes in R: specify all frequencies
* or use harmonic regression (include Fourier terms for each frequency asciimath:[T]): asciimath:[sin/cos ((2*pi*k*t/T))]; optimize number of terms by AICc
* or TBATS model by Livera, Hyndman, Synder: combination of Fourier, exp. smoothing, Box-Cox, automated
** however prediction intervals often too wide
** do not allow for features
* more sophisticated:
*+ Hyndman, R. J., & Fan, S. (2010). Density forecasting for long-term peak electricity demand. IEEE Transactions on Power Systems, 25(2), 1142–1153.
** Fan, S., & Hyndman, R. J. (2012). Short-term load forecasting based on a semi-parametric additive model. IEEE Transactions on Power Systems, 27(1), 134–141.
== Vector autoregression (VAR) where bi-directional influence
* target influences features
* influence cross-terms
* choose how many variables (choose only few by information criterion) and how many lags to use
* for co-integrated features use "vector error correction model"
* predict recursively with 1-step predictions
* `VARselect()` from R `vars` package select number of lags; Hyndmann suggests BIC (SC) criterions since AIC chooses too many features
* making cross-influence possible seems raw, but VAR useful when:
** forecasting a collection of related variables where no explicit interpretation is required
** testing whether one variable is useful in forecasting another (the basi-s of Granger causality tests);
** impulse response analysis, where the response of one variable to a sudden but temporary change in another variable is analysed;
** forecast error variance decomposition, where the proportion of the forecast variance of each variable is attributed to the effects of the other variables
* Vector ARMA generalization but also more complicating
== Neural networks
* use asciimath:[p] lags and asciimath:[k] nodes in a simple (autoregressive) neural network NNAR(p,k)
* `nnetar()`
* since not a stochastic model, no clear prediction intervals -> pick random errors from a distribution while predicting and simulate possible future paths
== Bootstrapping and bagging
* "blocked bootstrap" random sections of remainder time series (i.e. without trend ans season) select and joined together
* -> measure of forecast uncertainty
== General
* all can be additive or multiplicative
* remove calendar effects (e.g. number of days in month)
* forecast variable y: dependent, explained variables
* predictor variable x: independent, explanatory variable
* forecast inefficient if residuals autocorrelated or correlated with target
* could use spike or step function for interventions or changes
* lagged values can be useful
* for seasons of period asciimath:[m] could use first Fourier features asciimath:[sin/cos (2*pi*t*k/m)]
* regression with Fourier terms: "harmonic regression"
* https://otexts.org/fpp2/regression-matrices.html
* CV statistics can be calculated directory from hat matrix asciimath:[H=X(X'X)^(-1)X'] with asciimath:[CV=1/T sum (e_t/(1-h_t))^2] where asciimath:[h_t] are diagonal elements of hat matrix [hence not needed to fit many models]
* when multi-collinearity: coefficient estimates unreliable, relation between features unreliable; for good software not a problem [unless perfect collinearity?]
* forecast unreliable of new features outside previous range
* "seasonally adjusted": remove seasonal
* if season order 4: 2x4 MA (moving average, first 2 then 4) will give weights (1/8,1/4,1/4,1/4,1/8) and hence remove quarterly season; MA is centered here
* classical decomposition: detrend with MA (m for odd, 2xm for even order season) -> remove averaged curve from original; average out months
* for monthly/quarterly seasons, some institutions have developed simple averagine methods: https://otexts.org/fpp2/x11.html
* https://otexts.org/fpp2/stl.html[STL Seasonal and Trend decomposition using Loess]
* "stationary": when properties do not depend on time (every set of measurements if time invariant distribution)
* FPP prefers to use AICc for minimization
* one could plot complex root with unit circle (https://otexts.org/fpp2/arima-r.html) to see how stationary and invertible the model is -> roots close to unit circle may make the model unstable (but many implementations only return well-behaved models)
* for forecasting: use predicted values as features; use zero for future errors; https://otexts.org/fpp2/arima-forecasting.html
* forecasting with co-integrated: "Applied time series modelling and forecasting" (Richard Harris and Robert Sollis)
* almost all model give too narrow(!) prediction intervals since some uncertainty not considered (e.g. ETS 95% may cover only 70%)
* Bagging: predict many bootstrapped versions and average
* one could need 13 Fourier terms with asciimath:[2*pi*j*t/52.18] if you have weekly data and yearly seasons
* for holiday effects use dummy variable (but STL, ETC, TBATS do not allow features) -> rather use dynamic regression models
* for predicting low integer counts: https://otexts.org/fpp2/counts.html
** Croston's method: often used, even though not great; predict non-zero values and also length of zero count runs
** better methods for low count data: Christou, V., & Fokianos, K. (2015). On count time series prediction. Journal of Statistical Computation and Simulation, 85(2), 357–373.
* to limit forecasts to a range, one could transform them https://otexts.org/fpp2/limits.html (e.g. log for positive counts)
* if you want to predict aggregates (e.g. 3 month sum when having monthly data), you can just sum, but may need to correct prediction intervals (possible correlations)
* ETS and ARIMA with differencing allow for condition changes, but dynamic regression models have no evolution
* think of if you want 1 or multi step forecasts in train/test data
* missing values can be estimated
* you could replace outliers with missing values
* ideally do not use the future information to decide on missing or outlier values
* more possible time series issues: last chapter of "Principles of business forecasting" (Ord, J. K., Fildes, R., & Kourentzes, N. (2017))
== Models to try
* ETS
* (S)ARIMA(X)
* GARCH
* TBATS
* Prophet (Facebook)
* CausalImpact
* Shallow ANN
== Cookbook
* check ACF of residuals
** see tests above
** helpful if normal, for easier prediction intervals
* residuals correlate with feature -> may need non-linear feature transformation
* any feature which is not in model correlates with residuals? -> include
* residuals vs fitted values -> if pattern, then heteroscedacticity and maybe transform target
* outliers which are "influential observations"?
* for ARIMA:
** https://otexts.org/fpp2/arima-r.html
** plot
** stabilize variance if needed
** check stationary with ACF (needs to drop quickly)
** make stationary by differencing if needed
** check ACF/PACF if (p,0) or (0,q) model appropriate -> use AICc to find better models
** check residuals by ACF and Portmanteau test to verify that they are white noise
* for CV test set you can use business metric again and also different differencing parameters
from statsmodels.tsa.stattools import kpss
p_values = kpss(s)[1]
if p_values > 0.05:
print("Stationary. Good!")
* alternatively `adfuller(s)[1] < 0.05` wanted (but this test is weaker)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment