Pattern Recognition Through Time¶

Overview¶

Temporal pattern recognition covers signals that evolve over time:

Speech recognition: audio waveform → phonemes → words.
Handwriting recognition: pen stroke sequences → characters.
Video-based face recognition: temporal consistency improves accuracy over single frames.
These domains share a language-like structure — sequences of discrete or continuous observations generated by underlying hidden states.

An HMM is defined by:

\(S = \{s_1, \ldots, s_N\}\) — set of hidden states
\(O = \{o_1, \ldots, o_M\}\) — set of observable symbols
\(A = [a_{ij}]\) — state transition probabilities, \(a_{ij} = P(q_{t+1} = s_j | q_t = s_i)\)
\(B = [b_j(k)]\) — emission probabilities, \(b_j(k) = P(o_t = v_k | q_t = s_j)\)
\(\pi = [\pi_i]\) — initial state distribution, \(\pi_i = P(q_1 = s_i)\)

Evaluation — \(P(O | \lambda)\): probability of observation sequence given model. Solved by the Forward algorithm (or Backward): dynamic programming over time steps.

\[\alpha_t(j) = \left[ \sum_i \alpha_{t-1}(i) \, a_{ij} \right] b_j(o_t)\]
Decoding — most likely state sequence given observations. Solved by the Viterbi algorithm: like Forward but replaces summation with max.

\[\delta_t(j) = \max_i \left[ \delta_{t-1}(i) \, a_{ij} \right] b_j(o_t)\]

Backtrack through stored argmax pointers to recover the optimal path.
Learning — adjust model parameters \(\lambda = (A, B, \pi)\) to maximize \(P(O | \lambda)\). Solved by the Baum-Welch algorithm (special case of EM): iteratively re-estimates transition and emission probabilities using forward-backward statistics.

DTW aligns two temporal sequences that may vary in speed or duration.

Computes an optimal warping path through a cost matrix \(D(i, j)\):

\[D(i, j) = d(x_i, y_j) + \min\{D(i-1, j),\; D(i, j-1),\; D(i-1, j-1)\}\]

where \(d(x_i, y_j)\) is a local distance measure (e.g., Euclidean).
Constraints: monotonicity, continuity, boundary conditions (start and end must align).
Complexity: \(O(nm)\) for sequences of length \(n\) and \(m\).
Used in speech recognition (template matching), gesture recognition, and time-series classification.

DTW: deterministic alignment, no probabilistic model, good for template matching with few classes.
HMMs: probabilistic generative model, handles variability better, scales to large vocabularies.
Modern systems often use HMMs (or their neural successors like CTC/attention models) for large-scale recognition.