Sequence Models

February 20, 2026

Sequence Models

In deep learning, the key to processing sequential data (such as text, audio, or stock prices) lies in understanding the temporal dependencies between data points.

1. Intuition

Sequence Dependency: Reading a Story

When you read a sentence or listen to someone speak, your expectation for the next word depends entirely on the context.

IID Data: Like reading random entries in a dictionary; there is no connection between them.
Sequence Models: Like reading a mystery novel. If you skip the first two chapters and jump to the end, the identity of the culprit makes no sense. Core Point: The meaning of the current input (word, note, or frame) is deeply coupled with its “position” and the “foreshadowing” provided by earlier data.

Error Accumulation: The Gossip Train

Why is long-term multi-step prediction so difficult?

It is like a game of “Gossip Train” (also known as “Telephone”).
One-step Prediction: Each person hears the correct sentence and passes it on. Even with minor errors, the overall message remains fairly reliable.
Multi-step Prediction: The first person mishears, and the second person passes on that mistaken signal. By the end, the message is completely distorted. In models, this means once a prediction deviates from truth, that deviation becomes part of the next input, causing errors to amplify exponentially.

2. Knowledge Points

Autoregressive Models:
- The core idea is to use past observations of a sequence to predict the current value.
- Mathematical representation: $x_t \sim P(x_t \mid x_{t-1}, \dots, x_1)$
- Challenge: As time progresses, the amount of input data (history length) increases, making it difficult for models to handle variable-length inputs.
Markov Property:
- To simplify autoregressive models, we assume that only a recent window of history is necessary.
- First-order Markov: The current value depends only on the previous step, $x_t \sim P(x_t \mid x_{t-1})$.
- $k$-th order Markov: The current value depends on the previous $k$ steps, $x_t \sim P(x_t \mid x_{t-1}, \ldots, x_{t-k})$. This significantly reduces parameter count.
Latent Autoregressive Models:
- Instead of looking back at the entire raw history, these models maintain an internal “summary” state $h_t$ (hidden state).
- Prediction is based on the hidden state: $\hat{x}_t = f(h_t)$
- State Update: $h_t = g(h_{t-1}, x_{t-1})$ . This forms the foundation for Recurrent Neural Networks (RNNs).
One-step vs. Multi-step Prediction:
- One-step (1-step-ahead): Predicting $x_{t+1}$ using the true ground-truth $x_t$.
- Multi-step (k-step-ahead): Predicting $x_{t+k}$ where intermediate steps use the model’s own previously predicted values $\hat{x}$.

3. Key Points

Q: Why do sequence models often decay to a constant during long-term extrapolation?
- A: This is due to error accumulation. Since the model lacks access to future ground truth, small initial errors are magnified through recursive nesting. As the prediction horizon increases, the model loses its ability to track specific dependencies and converges to a statistical mean or constant.
Q: If we can use fixed-window (Markovian) models, why do we need latent variable models?
- A: Fixed windows (like CNNs or N-grams) only capture local dependencies and struggle with long-range associations. Latent variable models, through dynamically updated “memory units,” can theoretically retain information from much further back without needing a massive input window.

References

D2L (English) - 9.1 Working with Sequences