From Line Charts to LSTMs: Mastering Time Series Forecasting

Time series data has a certain rhythm to it. The kind of rhythm you don't notice at first—until you start looking. That's what happened to me a few years ago while working on a project forecasting customer support volumes. At first, it felt simple: plot some data, draw a trend line, maybe use Excel's FORECAST function.

But like most things in data science, time series forecasting reveals its depth slowly. One baseline fails, a pattern emerges, and before you know it—you're exploring mathematical models, learning about stationarity, seasonality, and the sweet magic of memory in neural nets.

This comprehensive guide walks you through the entire landscape of time series forecasting—from statistical classics to modern deep learning.

What is Time Series Forecasting?

Time Series: A sequence of data points indexed in time order Forecasting: Predicting future values based on historical patterns

Key Characteristics:

Temporal ordering: Order matters (unlike tabular ML)
Temporal dependence: Current value depends on past values
Seasonality: Repeating patterns at regular intervals
Trend: Long-term increase or decrease
Noise: Random fluctuations

Applications:

Sales and demand forecasting
Stock price prediction
Weather forecasting
Energy consumption
Anomaly detection
Capacity planning

Core Time Series Concepts

Components of Time Series

Every time series can be decomposed into:

1. Trend (T): Long-term movement

Increasing, decreasing, or stationary

2. Seasonality (S): Regular periodic patterns

Daily, weekly, monthly, quarterly, yearly

3. Cyclic (C): Non-fixed period fluctuations

Business cycles, economic cycles

4. Residual/Noise (ε): Random variation

Decomposition Models:

Additive: Y = T + S + ε (constant seasonality)
Multiplicative: Y = T × S × ε (proportional seasonality)

Stationarity

Definition: A time series is stationary if its statistical properties don't change over time.

Requirements:

Constant mean: μ(t) = μ
Constant variance: σ²(t) = σ²
Autocovariance independent of time: Cov(Yₜ, Yₜ₊ₕ) depends only on h, not t

Why it matters: Most statistical models (ARIMA) require stationarity

Testing for Stationarity:

Visual: Plot and look for trends/seasonality
Augmented Dickey-Fuller test: Tests null hypothesis of non-stationarity
KPSS test: Tests null hypothesis of stationarity

Achieving Stationarity:

Differencing: Y'ₜ = Yₜ - Yₜ₋₁ (removes trend)
Log transformation: Stabilizes variance
Seasonal differencing: Y'ₜ = Yₜ - Yₜ₋ₛ (removes seasonality)
Detrending: Subtract fitted trend

Autocorrelation

Autocorrelation Function (ACF): Correlation between series and its lagged values

ACF(k) = Corr(Yₜ, Yₜ₋ₖ)

Partial Autocorrelation Function (PACF): Correlation after removing effect of intermediate lags

Use Cases:

Identify AR and MA orders in ARIMA
Detect seasonality
Check for randomness

Chapter 1: ARIMA – The Old Reliable

I met ARIMA (AutoRegressive Integrated Moving Average) during a sales forecasting project. The data looked promising—no wild seasonal fluctuations, just a gentle upward trend.

ARIMA Components

ARIMA(p, d, q) combines three components:

1. AR (AutoRegressive) - p: Regression on its own lagged values

Yₜ = c + φ₁Yₜ₋₁ + φ₂Yₜ₋₂ + ... + φₚYₜ₋ₚ + εₜ

p: Number of lag observations
Current value depends on previous p values
PACF helps determine p

2. I (Integrated) - d: Differencing to achieve stationarity

First difference: Y'ₜ = Yₜ - Yₜ₋₁
Second difference: Y''ₜ = Y'ₜ - Y'ₜ₋₁

d: Number of differencing operations
d=0: Stationary (ARMA model)
d=1: Linear trend
d=2: Quadratic trend (rare)

3. MA (Moving Average) - q: Regression on past forecast errors

Yₜ = μ + εₜ + θ₁εₜ₋₁ + θ₂εₜ₋₂ + ... + θqεₜ₋q

q: Number of lagged forecast errors
Smooths out short-term fluctuations
ACF helps determine q

Complete ARIMA(p,d,q):

(1 - φ₁L - ... - φₚLᵖ)(1-L)ᵈYₜ = (1 + θ₁L + ... + θqLq)εₜ

Where L is the lag operator: LYₜ = Yₜ₋₁

Model Selection

Choosing p, d, q:

1. Differencing order (d):

Plot series and ACF
Apply differencing until stationary
Confirm with ADF test
Usually d ∈ 2

2. AR order (p):

Look at PACF plot
PACF cuts off after lag p → AR(p)
Significant lags in PACF suggest AR component

3. MA order (q):

Look at ACF plot
ACF cuts off after lag q → MA(q)
Significant lags in ACF suggest MA component

Information Criteria:

AIC (Akaike): AIC = 2k - 2ln(L)
BIC (Bayesian): BIC = k·ln(n) - 2ln(L)
Lower is better
BIC penalizes complexity more

Auto-ARIMA: Automatically searches for best (p,d,q) using stepwise algorithm

Implementation Considerations

Assumptions:

Stationarity (after differencing)
Linear relationships
Homoscedasticity (constant variance)
No autocorrelation in residuals

Diagnostics:

Residual plots: Should look like white noise
ACF of residuals: No significant autocorrelation
Ljung-Box test: Tests for autocorrelation in residuals
Normality test: Q-Q plot for normality

Limitations:

Requires stationarity
Linear model (can't capture non-linear patterns)
Struggles with structural breaks
Needs sufficient data (min ~50-100 observations)
No exogenous variables (use ARIMAX)

When ARIMA Shines:

Stationary or trendily stationary data
Short to medium-term forecasts (1-10 periods ahead)
Data without complex seasonality
Economic indicators, sales with stable growth
When interpretability matters
As a robust baseline

Lesson:

ARIMA is great when you're forecasting something stable—say, next quarter's orders in a predictable B2B business. But it's brittle when patterns shift or strong seasonality creeps in.

Chapter 2: SARIMA – ARIMA Learns the Seasons

Eventually, I worked on airline passenger data. This time, I saw seasonality—those neat annual peaks during summer holidays.

Enter SARIMA (Seasonal ARIMA). It extends ARIMA with seasonal components. Now the model could understand that spikes in July aren't outliers—they're expected.

SARIMA Notation: (p,d,q)(P,D,Q)ₛ

Non-seasonal part (p,d,q): Same as ARIMA Seasonal part (P,D,Q)ₛ:

P: Seasonal AR order
D: Seasonal differencing order
Q: Seasonal MA order
s: Seasonal period (12 for monthly with yearly seasonality, 7 for daily with weekly)

Mathematical Form:

φₚ(L)Φₚ(Lˢ)(1-L)ᵈ(1-Lˢ)ᴰYₜ = θq(L)Θq(Lˢ)εₜ

Example: SARIMA(1,1,1)(1,1,1)₁₂

AR(1): Last month affects this month
I(1): One non-seasonal difference
MA(1): Last error affects this month
Seasonal AR(1): Same month last year affects this month
Seasonal I(1): One seasonal difference
Seasonal MA(1): Error from same month last year affects this month
Period: 12 months

Selection Process

1. Determine s: Domain knowledge (7 for daily-weekly, 12 for monthly-yearly)

2. Apply seasonal differencing if needed:

D=1: Y'ₜ = Yₜ - Yₜ₋ₛ

3. Apply non-seasonal differencing if needed

4. Examine ACF/PACF at:

Early lags → non-seasonal (p, q)
Seasonal lags (s, 2s, 3s, ...) → seasonal (P, Q)

5. Use Auto-ARIMA or grid search with AIC/BIC

Limitations:

7 parameters: (p, d, q, P, D, Q, s) - tuning complexity explodes
Rigid seasonality: Assumes constant seasonal pattern
Computational cost: More parameters = slower
Overfitting risk: Easy to overfit with many parameters

When SARIMA Shines:

Strong, stable seasonality (retail, call centers)
Regular patterns (monthly sales, quarterly earnings)
Medium-term forecasts
When seasonality period is known

Tradeoff:

SARIMA loves structure. It's perfect for call centers or retail demand with regular seasons. But real-world data? It's rarely that disciplined. Pattern changes require retraining.

Chapter 3: Holt-Winters – Smoothing the Ride

One day, a stakeholder wanted a "simple model that works fast." I turned to Holt-Winters Exponential Smoothing.

Exponential Smoothing Family

Simple Exponential Smoothing (no trend, no seasonality):

ŷₜ₊₁ = αyₜ + (1-α)ŷₜ

α ∈ [0,1]: Smoothing parameter
Recent observations weighted more heavily

Holt's Linear Trend (with trend):

Level: lₜ = αyₜ + (1-α)(lₜ₋₁ + bₜ₋₁)
Trend: bₜ = β(lₜ - lₜ₋₁) + (1-β)bₜ₋₁
Forecast: ŷₜ₊ₕ = lₜ + h·bₜ

Holt-Winters Seasonal (trend + seasonality):

Additive Seasonality (constant seasonal variation):

Level: lₜ = α(yₜ - sₜ₋ₛ) + (1-α)(lₜ₋₁ + bₜ₋₁)
Trend: bₜ = β(lₜ - lₜ₋₁) + (1-β)bₜ₋₁
Seasonal: sₜ = γ(yₜ - lₜ) + (1-γ)sₜ₋ₛ
Forecast: ŷₜ₊ₕ = lₜ + h·bₜ + sₜ₋ₛ₊ₕ

Multiplicative Seasonality (proportional variation):

Forecast: ŷₜ₊ₕ = (lₜ + h·bₜ) × sₜ₋ₛ₊ₕ

Parameters:

α: Level smoothing (0-1)
β: Trend smoothing (0-1)
γ: Seasonal smoothing (0-1)

Advantages:

No stationarity required: Works with trends and seasonality
Fast: Computationally efficient
Simple: Few parameters, easy to understand
Online learning: Update as new data arrives
Interpretable: Clear components (level, trend, season)

Limitations:

Linear assumptions: Can't capture complex non-linear patterns
Fixed seasonality: Seasonal pattern doesn't evolve
Poor with shocks: Can't adapt to structural breaks
No covariates: Can't incorporate external variables
Forecast degrades: Long-term forecasts follow straight line + season

When COVID Hit:

The model crumbled. Holt-Winters assumes tomorrow looks like a weighted average of yesterday and last year's same month. It can't deal with shocks, anomalies, or complex lags.

Insight:

Holt-Winters is lightweight and interpretable. Perfect for fast baselines, business reports, and stable environments. But don't expect it to handle chaos or regime changes.

Chapter 4: Prophet – Facebook's Gift to Practitioners

Before diving into XGBoost, let me introduce Prophet—Facebook's practical forecasting tool designed for business time series.

How Prophet Works

Additive Model:

y(t) = g(t) + s(t) + h(t) + εₜ

Where:

g(t): Piecewise linear or logistic growth trend
s(t): Seasonal component (Fourier series)
h(t): Holiday effects
εₜ: Error term

Trend: Flexible changepoints that adapt to trend changes Seasonality: Fourier series for multiple seasonalities (daily, weekly, yearly) Holidays: User-specified holidays with custom effects

Why Prophet is Special:

Handles missing data: No need for imputation
Trend changepoints: Automatically detects shifts
Multiple seasonalities: Daily + weekly + yearly simultaneously
Holiday effects: Incorporate domain knowledge
Robust to outliers: Uses robust regression
Uncertainty intervals: Provides confidence bands
No tuning needed: Works out-of-the-box
Interpretable: Clear trend and seasonal components

Best For:

Business forecasting (sales, users, revenue)
Daily data with sub-daily patterns
Multiple seasonal effects
Known holidays/events
When you need uncertainty estimates
Quick prototyping

Limitations:

Assumes additive components
Less flexible than deep learning
Can struggle with complex interactions
Not ideal for high-frequency data (< hourly)

Chapter 5: XGBoost – The Hacker's Choice

Frustrated with classical models, I turned to XGBoost. It's not a time series model, but it's a powerful regression algorithm—if you reshape the data right.

Time Series as Supervised Learning

Key Idea: Convert time series to supervised regression problem

Original Data:

Date       | Sales
2024-01-01 | 100
2024-01-02 | 120
2024-01-03 | 115

Transformed (lag features):

Date       | lag_1 | lag_2 | lag_7 | Sales (target)
2024-01-08 | 130   | 125   | 100   | 135
2024-01-09 | 135   | 130   | 120   | 140

Feature Engineering for Time Series

1. Lag Features:

df['lag_1'] = df['value'].shift(1)
df['lag_7'] = df['value'].shift(7)
df['lag_30'] = df['value'].shift(30)

2. Rolling Window Statistics:

df['rolling_mean_7'] = df['value'].rolling(7).mean()
df['rolling_std_7'] = df['value'].rolling(7).std()
df['rolling_max_30'] = df['value'].rolling(30).max()

3. Temporal Features:

df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['day_of_week'].isin([5,6])

4. Cyclical Encoding (for periodic features):

df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

5. Domain-Specific:

Holiday indicators
Marketing campaign flags
Weather data
Economic indicators
Competitor prices

Advantages:

Non-linear: Captures complex patterns
Handles covariates: External variables easily incorporated
Feature interactions: Learns interactions automatically
Robust: Less sensitive to outliers than ARIMA
Flexible: Works with any time granularity

Challenges:

Manual feature engineering: Critical for success
No inherent temporal awareness: Must create lag features
Multi-step forecasting: Needs recursive or direct strategy
Leakage risk: Easy to accidentally leak future information
Missing values: Need careful handling in lags

Multi-Step Forecasting Strategies:

Recursive (Iterative):

Predict t+1, use it to predict t+2, etc.
Error accumulates
Uses same model

Direct:

Train separate model for each horizon
No error accumulation
More models to maintain

Multiple Output:

Single model predicts all horizons
Balanced approach

When XGBoost Shines:

Rich external features (weather, holidays, events)
Non-linear patterns
Multiple interacting variables
Short to medium-term forecasts
When you have time for feature engineering

Tradeoff:

XGBoost is a beast when you have rich metadata and time-aware features. But it needs manual feature engineering—and lots of it. Think of it as a powerful tool that requires you to explicitly tell it about time.

Chapter 6: RNNs and LSTMs – Memory Meets Prediction

Eventually, I reached the deep end: Recurrent Neural Networks. LSTMs, to be precise.

Why RNNs for Time Series?

Unlike feedforward networks, RNNs have memory—they maintain hidden states that capture information from previous time steps.

Basic RNN:

hₜ = tanh(Wₓₕxₜ + Wₕₕhₜ₋₁ + bₕ)
yₜ = Wₕᵧhₜ + bᵧ

Where:

hₜ: Hidden state at time t
xₜ: Input at time t
yₜ: Output at time t

Problem: Vanishing gradients make learning long-term dependencies difficult

LSTM (Long Short-Term Memory)

LSTMs solve the vanishing gradient problem with gating mechanisms:

Architecture:

Forget gate: fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf)
Input gate:  iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi)
Output gate: oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo)
Cell state:  C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc)
             Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ
Hidden state: hₜ = oₜ⊙tanh(Cₜ)

Gates:

Forget gate: What to forget from cell state
Input gate: What new information to store
Output gate: What to output from cell state
Cell state: Long-term memory
Hidden state: Short-term memory

GRU (Gated Recurrent Unit)

Simpler alternative to LSTM with fewer parameters:

Combines forget and input gates into single "update gate"
Merges cell state and hidden state
Faster training, similar performance

Training LSTMs for Forecasting

Sequence-to-One:

Input: Past n time steps
Output: Single future value
For one-step-ahead forecasting

Sequence-to-Sequence:

Input: Past n time steps
Output: Future m time steps
For multi-horizon forecasting

Many-to-Many:

Input: Sequence
Output: Sequence of same length
For sequence labeling/prediction

Architecture Design:

Layers:

model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(lookback, features)),
    Dropout(0.2),
    LSTM(32, return_sequences=False),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1)
])

Key Hyperparameters:

Units: Number of LSTM cells (32, 64, 128)
Lookback window: How many past steps to consider
Layers: Number of stacked LSTMs
Dropout: Regularization (0.2-0.5)
Batch size: Training batch size
Learning rate: Adam optimizer (typically 0.001)

Why LSTMs Excel:

Automatic feature learning: No manual lag features needed
Long-term dependencies: Captures patterns across long sequences
Multivariate: Handles multiple input features naturally
Non-linear: Captures complex relationships
Flexible architecture: Can be customized for different tasks

Challenges:

Data hungry: Needs thousands of observations
Computational cost: GPU recommended, slow training
Hyperparameter tuning: Many choices (units, layers, dropout, lookback)
Black box: Hard to interpret predictions
Overfitting: Easy to overfit on small data
Sensitive to scaling: Requires normalization

Real Example:

We built a stacked LSTM model to forecast energy consumption. It worked beautifully—capturing hourly fluctuations, weekend dips, even holidays.

But the model was heavy, black-boxed, and sensitive to noise. Training took hours, tuning was art. Business users found it hard to trust without interpretability.

When to Use RNNs/LSTMs:

Long sequences: > 100 time steps
Complex patterns: Multiple interacting seasonalities
Multivariate data: Multiple related time series
High-frequency data: Hourly, minutely, second-level
Sufficient data: > 10K observations
Applications: Energy forecasting, traffic prediction, sensor data, financial tick data

Takeaway:

RNNs shine when the sequence is king and patterns are complex. But use them when the stakes are high, data is rich, computational resources available, and you're okay with less interpretability.

The Final Lesson

Over time, I learned this: no single model rules them all. Each comes with strengths, assumptions, and blind spots. The art of time series forecasting is not about finding the best model—but the right model for the business need, the data, and the team.

| Model | Best For | Weakness | | ------------ | ----------------------------------- | ---------------------------------- | | ARIMA | Trend-based, short-term predictions | No seasonality, needs stationarity | | SARIMA | Seasonal business patterns | Complex tuning, rigid assumptions | | Holt-Winters | Smooth, regular seasonal data | Fails with volatility | | XGBoost | Feature-rich tabular time data | Needs careful feature engineering | | RNN (LSTM) | High-frequency, long-sequence data | Black-box, resource intensive |

So before picking a model, ask yourself:

Do I need explainability?
Is the data seasonal or volatile?
Can I derive rich features?
Do I have enough data to train a neural net?

The right answers won’t just improve your forecast—they’ll make your stakeholders trust it too.

Explore Unread

Great job! You've read all available articles

From Line Charts to LSTMs: Mastering Time Series Forecasting

This comprehensive guide walks you through the entire landscape of time series forecasting—from statistical classics to modern deep learning.

What is Time Series Forecasting?

Time Series: A sequence of data points indexed in time order Forecasting: Predicting future values based on historical patterns

Key Characteristics:

Temporal ordering: Order matters (unlike tabular ML)
Temporal dependence: Current value depends on past values
Seasonality: Repeating patterns at regular intervals
Trend: Long-term increase or decrease
Noise: Random fluctuations

Applications:

Sales and demand forecasting
Stock price prediction
Weather forecasting
Energy consumption
Anomaly detection
Capacity planning

Core Time Series Concepts

Components of Time Series

Every time series can be decomposed into:

1. Trend (T): Long-term movement

Increasing, decreasing, or stationary

2. Seasonality (S): Regular periodic patterns

Daily, weekly, monthly, quarterly, yearly

3. Cyclic (C): Non-fixed period fluctuations

Business cycles, economic cycles

4. Residual/Noise (ε): Random variation

Decomposition Models:

Additive: Y = T + S + ε (constant seasonality)
Multiplicative: Y = T × S × ε (proportional seasonality)

Stationarity

Definition: A time series is stationary if its statistical properties don't change over time.

Requirements:

Constant mean: μ(t) = μ
Constant variance: σ²(t) = σ²
Autocovariance independent of time: Cov(Yₜ, Yₜ₊ₕ) depends only on h, not t

Why it matters: Most statistical models (ARIMA) require stationarity

Testing for Stationarity:

Visual: Plot and look for trends/seasonality
Augmented Dickey-Fuller test: Tests null hypothesis of non-stationarity
KPSS test: Tests null hypothesis of stationarity

Achieving Stationarity:

Differencing: Y'ₜ = Yₜ - Yₜ₋₁ (removes trend)
Log transformation: Stabilizes variance
Seasonal differencing: Y'ₜ = Yₜ - Yₜ₋ₛ (removes seasonality)
Detrending: Subtract fitted trend

Autocorrelation

Autocorrelation Function (ACF): Correlation between series and its lagged values

ACF(k) = Corr(Yₜ, Yₜ₋ₖ)

Partial Autocorrelation Function (PACF): Correlation after removing effect of intermediate lags

Use Cases:

Identify AR and MA orders in ARIMA
Detect seasonality
Check for randomness

Chapter 1: ARIMA – The Old Reliable

I met ARIMA (AutoRegressive Integrated Moving Average) during a sales forecasting project. The data looked promising—no wild seasonal fluctuations, just a gentle upward trend.

ARIMA Components

ARIMA(p, d, q) combines three components:

1. AR (AutoRegressive) - p: Regression on its own lagged values

Yₜ = c + φ₁Yₜ₋₁ + φ₂Yₜ₋₂ + ... + φₚYₜ₋ₚ + εₜ

p: Number of lag observations
Current value depends on previous p values
PACF helps determine p

2. I (Integrated) - d: Differencing to achieve stationarity

First difference: Y'ₜ = Yₜ - Yₜ₋₁
Second difference: Y''ₜ = Y'ₜ - Y'ₜ₋₁

d: Number of differencing operations
d=0: Stationary (ARMA model)
d=1: Linear trend
d=2: Quadratic trend (rare)

3. MA (Moving Average) - q: Regression on past forecast errors

Yₜ = μ + εₜ + θ₁εₜ₋₁ + θ₂εₜ₋₂ + ... + θqεₜ₋q

q: Number of lagged forecast errors
Smooths out short-term fluctuations
ACF helps determine q

Complete ARIMA(p,d,q):

(1 - φ₁L - ... - φₚLᵖ)(1-L)ᵈYₜ = (1 + θ₁L + ... + θqLq)εₜ

Where L is the lag operator: LYₜ = Yₜ₋₁

Model Selection

Choosing p, d, q:

1. Differencing order (d):

Plot series and ACF
Apply differencing until stationary
Confirm with ADF test
Usually d ∈ 2

2. AR order (p):

Look at PACF plot
PACF cuts off after lag p → AR(p)
Significant lags in PACF suggest AR component

3. MA order (q):

Look at ACF plot
ACF cuts off after lag q → MA(q)
Significant lags in ACF suggest MA component

Information Criteria:

AIC (Akaike): AIC = 2k - 2ln(L)
BIC (Bayesian): BIC = k·ln(n) - 2ln(L)
Lower is better
BIC penalizes complexity more

Auto-ARIMA: Automatically searches for best (p,d,q) using stepwise algorithm

Implementation Considerations

Assumptions:

Stationarity (after differencing)
Linear relationships
Homoscedasticity (constant variance)
No autocorrelation in residuals

Diagnostics:

Residual plots: Should look like white noise
ACF of residuals: No significant autocorrelation
Ljung-Box test: Tests for autocorrelation in residuals
Normality test: Q-Q plot for normality

Limitations:

Requires stationarity
Linear model (can't capture non-linear patterns)
Struggles with structural breaks
Needs sufficient data (min ~50-100 observations)
No exogenous variables (use ARIMAX)

When ARIMA Shines:

Stationary or trendily stationary data
Short to medium-term forecasts (1-10 periods ahead)
Data without complex seasonality
Economic indicators, sales with stable growth
When interpretability matters
As a robust baseline

Lesson:

ARIMA is great when you're forecasting something stable—say, next quarter's orders in a predictable B2B business. But it's brittle when patterns shift or strong seasonality creeps in.

Chapter 2: SARIMA – ARIMA Learns the Seasons

Eventually, I worked on airline passenger data. This time, I saw seasonality—those neat annual peaks during summer holidays.

Enter SARIMA (Seasonal ARIMA). It extends ARIMA with seasonal components. Now the model could understand that spikes in July aren't outliers—they're expected.

SARIMA Notation: (p,d,q)(P,D,Q)ₛ

Non-seasonal part (p,d,q): Same as ARIMA Seasonal part (P,D,Q)ₛ:

P: Seasonal AR order
D: Seasonal differencing order
Q: Seasonal MA order
s: Seasonal period (12 for monthly with yearly seasonality, 7 for daily with weekly)

Mathematical Form:

φₚ(L)Φₚ(Lˢ)(1-L)ᵈ(1-Lˢ)ᴰYₜ = θq(L)Θq(Lˢ)εₜ

Example: SARIMA(1,1,1)(1,1,1)₁₂

AR(1): Last month affects this month
I(1): One non-seasonal difference
MA(1): Last error affects this month
Seasonal AR(1): Same month last year affects this month
Seasonal I(1): One seasonal difference
Seasonal MA(1): Error from same month last year affects this month
Period: 12 months

Selection Process

1. Determine s: Domain knowledge (7 for daily-weekly, 12 for monthly-yearly)

2. Apply seasonal differencing if needed:

D=1: Y'ₜ = Yₜ - Yₜ₋ₛ

3. Apply non-seasonal differencing if needed

4. Examine ACF/PACF at:

Early lags → non-seasonal (p, q)
Seasonal lags (s, 2s, 3s, ...) → seasonal (P, Q)

5. Use Auto-ARIMA or grid search with AIC/BIC

Limitations:

7 parameters: (p, d, q, P, D, Q, s) - tuning complexity explodes
Rigid seasonality: Assumes constant seasonal pattern
Computational cost: More parameters = slower
Overfitting risk: Easy to overfit with many parameters

When SARIMA Shines:

Strong, stable seasonality (retail, call centers)
Regular patterns (monthly sales, quarterly earnings)
Medium-term forecasts
When seasonality period is known

Tradeoff:

SARIMA loves structure. It's perfect for call centers or retail demand with regular seasons. But real-world data? It's rarely that disciplined. Pattern changes require retraining.

Chapter 3: Holt-Winters – Smoothing the Ride

One day, a stakeholder wanted a "simple model that works fast." I turned to Holt-Winters Exponential Smoothing.

Exponential Smoothing Family

Simple Exponential Smoothing (no trend, no seasonality):

ŷₜ₊₁ = αyₜ + (1-α)ŷₜ

α ∈ [0,1]: Smoothing parameter
Recent observations weighted more heavily

Holt's Linear Trend (with trend):

Level: lₜ = αyₜ + (1-α)(lₜ₋₁ + bₜ₋₁)
Trend: bₜ = β(lₜ - lₜ₋₁) + (1-β)bₜ₋₁
Forecast: ŷₜ₊ₕ = lₜ + h·bₜ

Holt-Winters Seasonal (trend + seasonality):

Additive Seasonality (constant seasonal variation):

Level: lₜ = α(yₜ - sₜ₋ₛ) + (1-α)(lₜ₋₁ + bₜ₋₁)
Trend: bₜ = β(lₜ - lₜ₋₁) + (1-β)bₜ₋₁
Seasonal: sₜ = γ(yₜ - lₜ) + (1-γ)sₜ₋ₛ
Forecast: ŷₜ₊ₕ = lₜ + h·bₜ + sₜ₋ₛ₊ₕ

Multiplicative Seasonality (proportional variation):

Forecast: ŷₜ₊ₕ = (lₜ + h·bₜ) × sₜ₋ₛ₊ₕ

Parameters:

α: Level smoothing (0-1)
β: Trend smoothing (0-1)
γ: Seasonal smoothing (0-1)

Advantages:

No stationarity required: Works with trends and seasonality
Fast: Computationally efficient
Simple: Few parameters, easy to understand
Online learning: Update as new data arrives
Interpretable: Clear components (level, trend, season)

Limitations:

Linear assumptions: Can't capture complex non-linear patterns
Fixed seasonality: Seasonal pattern doesn't evolve
Poor with shocks: Can't adapt to structural breaks
No covariates: Can't incorporate external variables
Forecast degrades: Long-term forecasts follow straight line + season

When COVID Hit:

The model crumbled. Holt-Winters assumes tomorrow looks like a weighted average of yesterday and last year's same month. It can't deal with shocks, anomalies, or complex lags.

Insight:

Holt-Winters is lightweight and interpretable. Perfect for fast baselines, business reports, and stable environments. But don't expect it to handle chaos or regime changes.

Chapter 4: Prophet – Facebook's Gift to Practitioners

Before diving into XGBoost, let me introduce Prophet—Facebook's practical forecasting tool designed for business time series.

How Prophet Works

Additive Model:

y(t) = g(t) + s(t) + h(t) + εₜ

Where:

g(t): Piecewise linear or logistic growth trend
s(t): Seasonal component (Fourier series)
h(t): Holiday effects
εₜ: Error term

Why Prophet is Special:

Handles missing data: No need for imputation
Trend changepoints: Automatically detects shifts
Multiple seasonalities: Daily + weekly + yearly simultaneously
Holiday effects: Incorporate domain knowledge
Robust to outliers: Uses robust regression
Uncertainty intervals: Provides confidence bands
No tuning needed: Works out-of-the-box
Interpretable: Clear trend and seasonal components

Best For:

Business forecasting (sales, users, revenue)
Daily data with sub-daily patterns
Multiple seasonal effects
Known holidays/events
When you need uncertainty estimates
Quick prototyping

Limitations:

Assumes additive components
Less flexible than deep learning
Can struggle with complex interactions
Not ideal for high-frequency data (< hourly)

Chapter 5: XGBoost – The Hacker's Choice

Frustrated with classical models, I turned to XGBoost. It's not a time series model, but it's a powerful regression algorithm—if you reshape the data right.

Time Series as Supervised Learning

Key Idea: Convert time series to supervised regression problem

Original Data:

Date       | Sales
2024-01-01 | 100
2024-01-02 | 120
2024-01-03 | 115

Transformed (lag features):

Date       | lag_1 | lag_2 | lag_7 | Sales (target)
2024-01-08 | 130   | 125   | 100   | 135
2024-01-09 | 135   | 130   | 120   | 140

Feature Engineering for Time Series

1. Lag Features:

df['lag_1'] = df['value'].shift(1)
df['lag_7'] = df['value'].shift(7)
df['lag_30'] = df['value'].shift(30)

2. Rolling Window Statistics:

df['rolling_mean_7'] = df['value'].rolling(7).mean()
df['rolling_std_7'] = df['value'].rolling(7).std()
df['rolling_max_30'] = df['value'].rolling(30).max()

3. Temporal Features:

df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['day_of_week'].isin([5,6])

4. Cyclical Encoding (for periodic features):

df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

5. Domain-Specific:

Holiday indicators
Marketing campaign flags
Weather data
Economic indicators
Competitor prices

Advantages:

Non-linear: Captures complex patterns
Handles covariates: External variables easily incorporated
Feature interactions: Learns interactions automatically
Robust: Less sensitive to outliers than ARIMA
Flexible: Works with any time granularity

Challenges:

Manual feature engineering: Critical for success
No inherent temporal awareness: Must create lag features
Multi-step forecasting: Needs recursive or direct strategy
Leakage risk: Easy to accidentally leak future information
Missing values: Need careful handling in lags

Multi-Step Forecasting Strategies:

Recursive (Iterative):

Predict t+1, use it to predict t+2, etc.
Error accumulates
Uses same model

Direct:

Train separate model for each horizon
No error accumulation
More models to maintain

Multiple Output:

Single model predicts all horizons
Balanced approach

When XGBoost Shines:

Rich external features (weather, holidays, events)
Non-linear patterns
Multiple interacting variables
Short to medium-term forecasts
When you have time for feature engineering

Tradeoff:

Chapter 6: RNNs and LSTMs – Memory Meets Prediction

Eventually, I reached the deep end: Recurrent Neural Networks. LSTMs, to be precise.

Why RNNs for Time Series?

Unlike feedforward networks, RNNs have memory—they maintain hidden states that capture information from previous time steps.

Basic RNN:

hₜ = tanh(Wₓₕxₜ + Wₕₕhₜ₋₁ + bₕ)
yₜ = Wₕᵧhₜ + bᵧ

Where:

hₜ: Hidden state at time t
xₜ: Input at time t
yₜ: Output at time t

Problem: Vanishing gradients make learning long-term dependencies difficult

LSTM (Long Short-Term Memory)

LSTMs solve the vanishing gradient problem with gating mechanisms:

Architecture:

Forget gate: fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf)
Input gate:  iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi)
Output gate: oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo)
Cell state:  C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc)
             Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ
Hidden state: hₜ = oₜ⊙tanh(Cₜ)

Gates:

Forget gate: What to forget from cell state
Input gate: What new information to store
Output gate: What to output from cell state
Cell state: Long-term memory
Hidden state: Short-term memory

GRU (Gated Recurrent Unit)

Simpler alternative to LSTM with fewer parameters:

Combines forget and input gates into single "update gate"
Merges cell state and hidden state
Faster training, similar performance

Training LSTMs for Forecasting

Sequence-to-One:

Input: Past n time steps
Output: Single future value
For one-step-ahead forecasting

Sequence-to-Sequence:

Input: Past n time steps
Output: Future m time steps
For multi-horizon forecasting

Many-to-Many:

Input: Sequence
Output: Sequence of same length
For sequence labeling/prediction

Architecture Design:

Layers:

model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(lookback, features)),
    Dropout(0.2),
    LSTM(32, return_sequences=False),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1)
])

Key Hyperparameters:

Units: Number of LSTM cells (32, 64, 128)
Lookback window: How many past steps to consider
Layers: Number of stacked LSTMs
Dropout: Regularization (0.2-0.5)
Batch size: Training batch size
Learning rate: Adam optimizer (typically 0.001)

Why LSTMs Excel:

Automatic feature learning: No manual lag features needed
Long-term dependencies: Captures patterns across long sequences
Multivariate: Handles multiple input features naturally
Non-linear: Captures complex relationships
Flexible architecture: Can be customized for different tasks

Challenges:

Data hungry: Needs thousands of observations
Computational cost: GPU recommended, slow training
Hyperparameter tuning: Many choices (units, layers, dropout, lookback)
Black box: Hard to interpret predictions
Overfitting: Easy to overfit on small data
Sensitive to scaling: Requires normalization

Real Example:

We built a stacked LSTM model to forecast energy consumption. It worked beautifully—capturing hourly fluctuations, weekend dips, even holidays.

But the model was heavy, black-boxed, and sensitive to noise. Training took hours, tuning was art. Business users found it hard to trust without interpretability.

When to Use RNNs/LSTMs:

Long sequences: > 100 time steps
Complex patterns: Multiple interacting seasonalities
Multivariate data: Multiple related time series
High-frequency data: Hourly, minutely, second-level
Sufficient data: > 10K observations
Applications: Energy forecasting, traffic prediction, sensor data, financial tick data

Takeaway:

RNNs shine when the sequence is king and patterns are complex. But use them when the stakes are high, data is rich, computational resources available, and you're okay with less interpretability.

The Final Lesson

So before picking a model, ask yourself:

Do I need explainability?
Is the data seasonal or volatile?
Can I derive rich features?
Do I have enough data to train a neural net?

The right answers won’t just improve your forecast—they’ll make your stakeholders trust it too.

Explore Unread

Great job! You've read all available articles

From Line Charts to LSTMs: Mastering Time Series Forecasting

What is Time Series Forecasting?

Core Time Series Concepts

Components of Time Series

Stationarity

Autocorrelation

Chapter 1: ARIMA – The Old Reliable

ARIMA Components

Model Selection

Implementation Considerations

When ARIMA Shines:

Lesson:

Chapter 2: SARIMA – ARIMA Learns the Seasons

SARIMA Notation: (p,d,q)(P,D,Q)ₛ

Selection Process

Limitations:

When SARIMA Shines:

Tradeoff:

Chapter 3: Holt-Winters – Smoothing the Ride

Exponential Smoothing Family

Advantages:

Limitations:

When COVID Hit:

Insight:

Chapter 4: Prophet – Facebook's Gift to Practitioners

How Prophet Works

Why Prophet is Special:

Best For:

Limitations:

Chapter 5: XGBoost – The Hacker's Choice

Time Series as Supervised Learning

Feature Engineering for Time Series

Advantages:

Challenges:

Multi-Step Forecasting Strategies:

When XGBoost Shines:

Tradeoff:

Chapter 6: RNNs and LSTMs – Memory Meets Prediction

Why RNNs for Time Series?

LSTM (Long Short-Term Memory)

GRU (Gated Recurrent Unit)

Training LSTMs for Forecasting

Architecture Design:

Why LSTMs Excel:

Challenges:

Real Example:

When to Use RNNs/LSTMs:

Takeaway:

The Final Lesson

Read Next

Explore Unread

From Line Charts to LSTMs: Mastering Time Series Forecasting

What is Time Series Forecasting?

Core Time Series Concepts

Components of Time Series

Stationarity

Autocorrelation

Chapter 1: ARIMA – The Old Reliable

ARIMA Components

Model Selection

Implementation Considerations

When ARIMA Shines:

Lesson:

Chapter 2: SARIMA – ARIMA Learns the Seasons

SARIMA Notation: (p,d,q)(P,D,Q)ₛ

Selection Process

Limitations:

When SARIMA Shines:

Tradeoff:

Chapter 3: Holt-Winters – Smoothing the Ride

Exponential Smoothing Family

Advantages:

Limitations:

When COVID Hit:

Insight:

Chapter 4: Prophet – Facebook's Gift to Practitioners

How Prophet Works

Why Prophet is Special:

Best For:

Limitations: