State Estimation

The Nature of Mathematical Modeling (draft)

$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} \def\argmax#1{\mathrel{\mathop{\mathrm {argmax}}\limits_{#1}}} $

Our study of estimating parameters from observations has presumed that there are unchanging parameters to be estimated. For many (if not most) applications this is not so: not only are the parameters varying, but finding their variation in time may be the goal of the data analysis. This chapter and the next bring time back into the picture. Here we will look at the problem of estimating a time-dependent set of parameters that describe the state of a system, given measurements of observable quantities along with some kind of model for the relationship between the observations and the underlying state. For example, in order to navigate, an airplane must know where it is. Many relevant signals arrive at the airplane, such as radar echoes, GPS messages, and gyroscopic measurements. The first task is to reduce these raw signals to position estimates, and then these estimates must be combined along with any relevant past information to provide the best overall estimate of the plane's position. Closely related tasks are smoothing, noise reduction, and signal separation, using the collected data set to provide the best estimate of previous states (given new measurements, where do we think the airplane was?), and prediction, using the data to forecast a future state (where is the airplane going?). These tasks are often described as filtering problems, even though they really are general algorithm questions, because they evolved from early implementations in analog filters.

We will start with the simple example of matched filters to detect a known signal, extend that to Wiener filters to separate a signal from noise, and then turn to the much more general, useful, and important Kalman filters. Much of estimation theory is based on linear techniques; since the world is not always obligingly linear, we will next look at how nonlinearity makes estimation more difficult, and simpler. The chapter closes with the use of Hidden Markov Models to help find models as well as states.

Matched Filters

Consider a signal $x(t)$ passed through a linear filter with impulse response $f(t)$ (go back to Chapter {ode} if you need a review of linear systems theory). The frequency domain response of the output $Y(\omega)$ will be the product of the Fourier transforms of the input and the filter $$ Y(\omega) = X(\omega) F(\omega) ~~~~, $$ and the time domain response will be the convolution $$ y(t) = x(t) * f(t) = \int_0^T x(t-u)f(u)~du ~~~~, $$ where the limits of the integral are the interval during which the signal has been applied to the filter. The magnitude of the output can be bounded by Schwarz's inequality: $$ \ba y^2(t) &=& \left| \int_0^T x(t-u)f(u)~du \right|^2 \\ &\leq& \int_0^T |x(t-u)|^2~du~\int_0^T|f(u)|^2~du ~~~~. \ea $$ By inspection, this bound will be saturated (reach its maximum value) if $$ f(u) = A~x^*(t-u) $$ for any constant $A$. The filter will produce the maximum output for a given input signal if the impulse response of the filter is proportional to the complex conjugate of the signal reversed in time. This is called a matched filter, and is used routinely to detect and time known signals. For example, to measure the arrival time of radar echoes, the output from a filter matched to the transmitted pulses goes to a comparator, and the time when the output exceeds a preset threshold is used to determine when a pulse has arrived.

Wiener Filters

Next, consider a time-invariant filter with impulse response $f(t)$ that receives an input $x(t)+\eta(t)$ and produces an output $y(t)$, with $x(t)$ a desired signal and $\eta(t)$ noise added to the signal (such as from the front end of an amplifier). In the time domain the output is the convolution $$ \tag{fit-filt} y(t) = \int_{-\infty}^{\infty} f(u) [ x(t-u) + \eta(t-u) ] ~ du ~~~~, $$ for now assuming that the signals are defined for all time. How should the filter be designed to make $y(t)$ as close as possible to $x(t)$? One way to do this is by minimizing the mean square error between them (in Chapter {fit-chap} we saw that this implicitly assumes Gaussian statistics, but is an assumption that is commonly and relatively reliably used more broadly). This problem was solved for a linear filter by Norbert Wiener at MIT's Radiation Laboratory in the 1940s, therefore the solution is called a Wiener filter.

The expected value of the error at time $t$ is $$ \la E^2 \ra = \la [x(t+\alpha) - y(t)]^2 \ra ~~~~, $$ where the average is over an ensemble of realizations of the noise process. An offset $\alpha$ has been added to cover the three cases of:

  • $\alpha < 0:$ {\em smoothing} the past
  • $\alpha = 0:$ {\em filtering} the present
  • $\alpha > 0:$ {\em predicting} the future

Substituting in equation (fit-filt), $$ \la E^2 \ra = \la x^2(t+\alpha) \ra - 2\int_{-\infty}^\infty f(u) \underbrace{\la x(t+\alpha) [x(t-u)+\eta(t-u)] \ra}_{ \disp \equiv C_{x,x+\eta}(\alpha+u)} ~du $$

$$ + \int_{-\infty}^{\infty} \int_{-\infty}^\infty f(u) f(v) \underbrace{\la [x(t-u)+\eta(t-u)][x(t-v)+\eta(t-v)] \ra}_{ \disp \equiv C_{x+\eta,x+\eta}(u-v)} ~du~dv ~~. $$

We must find the $f(t)$ that minimizes the sum of these integrals over the correlation functions. Since the first term does not depend on the filter function $f(t)$ it can't contribute to the minimization and we will drop it. Because of the double integral we can't use the Euler-Lagrange equation derived in Chapter {vary}, but we can use a similar argument. Assume that $f(t)$ is the optimal filter that we are looking for, and let $g(t)$ be any arbitrary filter added to it, giving a new filter $f(t)+\eps g(t)$. In terms of this the new error is $$ \ba \la E^2 \ra &=& \int_{-\infty}^\infty \int_{-\infty}^\infty [f(u)+\eps g(u)][f(v)+\eps g(v)] C_{x+\eta,x+\eta}(u-v)~du~dv \\ & & -2\int_{-\infty}^\infty [f(u)+\eps g(u)] C_{x,x+\eta}(\alpha+u)~du~~~~. \ea $$ We can now differentiate with respect to $\eps$ and look for the minimum at $\eps=0$: $$ \ba \left. \ps{\la E^2 \ra}{\eps}\right|_{\eps=0} &=& 0 \\ &=& \int_{-\infty}^\infty \int_{-\infty}^\infty g(u) f(v) C_{x+\eta,x+\eta}(u-v)~du~dv \\ & & + \int_{-\infty}^\infty \int_{-\infty}^\infty f(u) g(v) C_{x+\eta,x+\eta}(u-v)~du~dv \\ & & + 2 \eps \int_{-\infty}^\infty \int_{-\infty}^\infty g(u) g(v) C_{x+\eta,x+\eta}(u-v)~du~dv \\ & & - 2 \int_{-\infty}^\infty g(u) C_{x,x+\eta}(\alpha+u)~du ~~~~. \ea $$ The first two terms are the same (interchanging dummy integration variables and using the symmetry of the correlation functions), and the third one vanishes at $\eps=0$, so we're left with $$ \int{-\infty}^\infty g(\tau) \left[ - C{x,x+\eta}(\alpha+\tau)

  + \int_{-\infty}^\infty f(u) C_{x+\eta,x+\eta}(u-\tau)\right] ~du~d\tau = 0
$$ Since $g(\tau)$ is arbitrary, the only way this can be equal to zero for all choices of $g$ is if the term in brackets vanishes $$

\tag{est-hopf} \int{-\infty}^\infty f(u) C{x+\eta,x+\eta}(u-\tau) ~du = C_{x,x+\eta}(\alpha+\tau) ~~~~. $$

This is now a simpler integral equation to be solved for $f$, but there is a crucial subtlety in equation (est-hopf). If we solve it for all $\tau$, positive and negative, then it will require that the filter function $f$ be defined for both positive and negative times. This is a noncausal filter. The only way that the filter can have access to the signal at all times (other than by being psychic) is if it is applied after the full time record has been recorded, which is fine for off-line applications. If we don't mind a noncausal filter, the convolution in equation (est-hopf) is easily solved by taking (two-sided) Laplace transforms $$ F(s) C_{x+\eta,x+\eta}(s) = C_{x,x+\eta}(s) e^{\alpha s} $$ and so $$ F(s) = \frac{ C_{x,x+\eta}(s) e^{\alpha s} } { C_{x+\eta,x+\eta}(s) } ~~~~. $$ This has a simple interpretation: the Wiener filter rolls the response off when the signal is much smaller than the noise, and sets the gain to unity if the signal is much larger than the noise. Prediction or smoothing is done simply by the complex phase shift of a linear system.

If a causal filter is needed so that the Wiener filter can be used in real-time then $f(\tau)$ must vanish for negative $\tau$, and equation ({est-hopf}) must be solved only for $\tau \ge 0$. In this case it is called the Wiener--Hopf equation, and is much more difficult to solve, although there are many special techniques available for doing so because it is so important {Brown:92}. Beyond Wiener filters, signal separation for nonlinear systems requires the more general time series techniques to be introduced in Chapter {time}.

Kalman Filters

Wiener filters are impressively optimal, but practically not very useful. It is important to remember that everything is optimal with respect to something. In the case of Wiener filters we found the "best" linear time-invariant filter, but by design it is therefore linear time-invariant. The result is an almost trivial kind of signal separation, simply cutting off the response where the signal is small compared to the noise. Furthermore, it does not easily generalize to more complex problems with multiple degrees of freedom.

Screenshot from 2026-05-03 20-56-32.png

Figure: Update of an internal state and an external observable.

Now consider the general system shown in Figure {fittrans}. There is an internal state $\vec x$, for example, the position, velocity, and acceleration of an airplane as well as the orientations of the control surfaces. Its state is updated in discrete time according to a distribution function $p(\vec x_{t+1}|\vec x_t)$, which includes both the deterministic and random influences. For the airplane, the deterministic part is the aerodynamics, and the random part includes factors such as turbulence and control errors. The internal state is not directly accessible, but rather must be inferred from measurements of observables $\vec y$ (such as the airplane's pitot tube, GPS receiver, and radar returns), which are related to $\vec x$ by a relation $p(\vec y|\vec x)$ that can include a random component due to errors in the measurement process. How should the measurements be combined to estimate the system's state? Further, is it possible to iteratively update the state estimate given new measurements without recalculating it from scratch? Kalman filters provide a general solution to this important problem.

To start, let's assume that there are just two random variables, $x$ and $y$, that have a joint probability distribution $p(x,y)$. Given a measurement of $y$, what function $\hat x(y)$ should we use to estimate $x$? Once again, we will do this by picking the estimate that minimizes the mean square error over the distribution. This means that we want to minimize $$ \ba \la [x - \hat x(y)]^2 \ra &=& \int\int [x - \hat x(y)]^2~p(x,y)~dx~dy \\ &=& \int\int [x^2 - 2 x \hat x(y) + \hat x^2(y)] \underbrace{p(x,y)}_{\disp p(x|y) p(y)} dx~dy \\ &=& \int x^2 \underbrace{\int p(x,y)~dy}_{\disp p(x)}~dx - 2\int\hat x(y) \underbrace{\int x~p(x|y)dx}_ {\disp \equiv \la x|y \ra} ~p(y)~dy \\ & & + \int \hat x^2(y) \underbrace{\int p(x,y)~dx}_{\disp p(y)} ~dy \\ &=& \int x^2~p(x)~dx + \int [\hat x^2(y) - 2 \hat x(y) \la x|y \ra ]~p(y)~dy \\ &=& \int x^2~p(x)~dx \\ & & + \int [\hat x^2(y) - 2 \hat x(y) \la x|y \ra + \la x|y \ra^2 - \la x|y \ra^2 ]~p(y)~dy \\ &=& \int x^2~p(x)~dx \\\ & & - \int \la x|y \ra^2 ~p(y)~dy + \int[ \hat x(y)-\la x|y\ra]^2 ~p(y)~dy ~. \ea $$ All integrals are over the limits of the distribution, and in the last line we completed the square. The first two terms don't depend on the unknown estimator $\hat x(y)$, and so are irrelevant to the minimization. The last term is the product of two non-negative functions, which will be minimized if the left hand one vanishes: $$ \hat x(y) = \la x|y \ra = \int x ~p(x|y)~dx ~~~~. $$ In retrospect, this is perhaps an obvious result: the minimum mean square estimator simply is the expected value. This result easily generalizes to multi-dimensional distributions.

Now let's assume that the system's update rule is linear, with additive noise $\vec\eta$ $$ \vec x_t = \mat{A}_{t-1}\cdot\vec x_{t-1} + \vec\eta_t $$ (we will later relax the assumption of linear updates), and assume a linear relationship between the state and the observable with additive noise $\vec\eps$ $$ \vec y_t = \mat{B}_t\cdot\vec x_t + \vec\eps_t ~~~~. $$ The noise sources are assumed to be uncorrelated in time, but can have correlations among the components, as measured by the noise covariance matrices $\mat{N}^x$ and $\mat{N}^y$ $$ \mat{N}^x =\la \vec\eta \vec\eta^T \ra ~~~~ \la \eta_i(t) \eta_j(t') \ra = N_{ij}^x \delta_{tt'} $$ $$ \mat{N}^y =\la \vec\eps \vec\eps^T \ra ~~~~ \la \eps_i(t) \eps_j(t') \ra = N_{ij}^y \delta_{tt'} $$ (where as usual $\ve{\eps}^T$ is the transpose of $\ve{\eps}\hspace{2.5pt}$). The two noise sources are taken to be uncorrelated with each other $$ \la \vec\eta \vec\eps^T \ra = \vec 0 $$ and to have zero mean $$ \la \vec\eta \ra = \la \vec\eps \ra = \vec 0 ~~~~. $$

The elements of Kalman filtering are shown in Figure {fitkalmn}. $\vec x_t$ is the true (but inaccessible) state of the system at time $t$, $\vec y_t$ the observable, and $\mat{E}_t$ is the covariance matrix of the error in the estimate of $\ve{x}$. The notation $\vec x_{n|m}$ represents the best estimate for $\vec x_n$ given the record of measurements up to time $m$ $$ \vec x_{n|m} = \la \vec x_n | \vec x_m, \vec x_{m-1}, \ldots \ra = \int \vec x_n~p(\vec x_n | \vec x_m, \vec x_{m-1}, \ldots ) ~ d\vec x_n ~~~~. $$ The first step in Kalman filtering is to use the best estimate of the previous system state, $\vec x_{t-1|t-1}$, to predict the new state $\vec x_{t|t-1}$. This is then used to predict the observable $\vec y_{t|t-1}$. Then, when the true new observable $\vec y_t$ is measured, it and the estimate $\vec y_{t|t-1}$ are combined to estimate the new internal state $\vec x_{t|t}$. There are two very important and perhaps nonobvious elements of this figure. First, the state estimate updates are done on just the previous state, without needing the full record, but (given the assumptions of the model) this provides just as good an estimate. Second, this estimate will be much better than if the new observable alone was used to estimate the internal state. Kalman filtering is an example of {\em recursive estimation}: to determine the present estimate it is necessary to know the previous one, which in turn depends on the one before that, and so forth back to the initial conditions.

Screenshot from 2026-05-03 20-57-07.png

Figure: Steps in Kalman filtering.

To do the first prediction step, recall that if two variables $a$ and $b$ with probabilities $p_a(a)$ and $p_b(b)$ are added, then the distribution for their sum $c=a+b$ is the convolution $$ p(c) = \int p_b(b) p_a(c-b) ~db ~~~~. $$ Since $\vec x_t$ depends only on the previous value $\vec x_{t-1}$ plus the noise term, the expected value will depend only on the previous expected value $\vec x_{t-1|t-1}$: $$ \vec x_{t|t-1} = \int \vec x_t~p(\vec x_t|\vec x_{t-1})~d\vec x_t ~~~~. $$ The conditional distribution $p(\vec x_t|\vec x_{t-1})$ consists of the deterministic distribution $\delta(\vec x_{t} - \mat{A}_t\cdot\vec x_{t-1})$ convolved by the (zero mean) noise distribution $p_\eta$, so $$ \vec x_{t|t-1} = \int \vec xt~p\eta (\vec x_t

  - \mat{A}_{t-1} \cdot \vec x_{t-1}) ~d\vec x_t ~~~~,
$$ and since the noise distribution is zero mean $$

\vec x{t|t-1} = \mat{A}{t-1}\cdot\vec x{t-1|t-1} ~~~~. $$ Similarly, $$ \vec y{t|t-1} = \mat{B}t\cdot\vec x{t|t-1} ~~~~. $$

This gives us the estimates for the new internal state and observable. To update the internal state estimate, these can be linearly combined with the new observation $\vec y_t$ $$ \vec x_{t|t} = \vec x_{t|t-1} + \mat{K}_t\cdot(\vec y_t-\vec y_{t|t-1}) ~~~~. $$ The matrix $\mat{K}_t$ is called the Kalman gain matrix; we will derive the optimal form for it. Given this estimate we can define the error covariance matrix in terms of the (inaccessible) true state $\vec x_t$ by $$ \mat{E}_{t|t} = \la (\vec x_t-\vec x_{t|t}) (\vec x_t-\vec x_{t|t})^T \ra ~~~~. $$ The difference between the true state and the estimate is $$ \vec x_t - \vec x_{t|t} = \vec x_t - \vec x_{t|t-1} - \mat{K}_t\cdot (\vec y_t - \vec y_{t|t-1}) ~~~~, $$ and the difference between the predicted and the true observation is $$ \tag{est-obs} y_t - y_{t|t-1} = \mat{B}_t\cdot\vec x_t + \vec\eps_t - \mat{B}_t\cdot\vec x_{t|t-1} ~~~~. $$ Combining these, $$ \ba \vec x_t - \vec x_{t|t} &=& \vec x_t - \vec x_{t|t-1} - \mat{K}_t \mat{B}_t \cdot (\vec x_t - \vec x_{t|t-1}) - \mat{K}_t\cdot\vec\eps_t \\ &=& (\mat{I} - \mat{K}_t\mat{B}_t) \cdot (\vec x_t - \vec x_{t|t-1}) - \mat{K}_t\cdot\vec\eps_t ~~~~. \ea $$ Therefore the error matrix is updated by $$ \ba \tag{est-error} \mat{E}_{t|t} &=& (\mat{I} - \mat{K}_t \mat{B}_t) ~ \la (\vec x_t - \vec x_{t|t-1}) (\vec x_t - \vec x_{t|t-1})^T \ra ~ (\mat{I} - \mat{K}_t \mat{B}_t)^T \\ & & +~\mat{K}_t \la \vec\eps_t \vec\eps_t^T \ra \mat{K}_t^T \\ &=& (\mat{I} - \mat{K}_t \mat{B}_t) ~ \mat{E}_{t|t-1} ~ (\mat{I} - \mat{K}_t \mat{B}_t)^T + \mat{K}_t \mat{N}_t^y \mat{K}_t^T \ea $$ (there are no cross terms because the measurement noise $\vec\eps_t$ is independent of the state estimation error $\vec x_t - \vec x_{t|t-1})$. The diagonal terms of the error covariance matrix are the state errors; we want to choose the Kalman gain matrix $\mat{K}$ to minimize the sum of the diagonal terms of the matrix, i.e., minimize the trace $$ {\rm Tr}(\mat{E}_{t|t}) = \left\langle|\vec x_t - \vec x_{t|t}|^2\right\rangle ~~~~. $$

To do this minimization, we will use two matrix identities $$ \frac{d~{\rm Tr}(\mat{A}\mat{B})}{d\mat{A}} = \mat{B}^T ~~~~ ({\rm if~} \mat{A}\mat{B} {\rm ~is~square}) $$ and $$ \frac{d~{\rm Tr}(\mat{A}\mat{C}\mat{A}^T)}{d\mat{A}} = 2\mat{A}\mat{C} ~~~~ ({\rm if~} \mat{C} {\rm ~is~symmetric}) ~~~~, $$ where $$ \left(\frac{df}{d\mat{A}}\right)_{ij} \equiv \frac{df}{dA_{ij}} $$ (these can be proved by writing out the components). Equation (est-error) can be expanded out as $$ \ba \tag{est-errorexp} \mat{E}{t|t} &=& \mat{E}{t|t-1} - \mat{K}_t\mat{B}t\mat{E}{t|t-1}

   - \mat{E}_{t|t-1}\mat{B}_t^T\mat{K}_t^T \\

& & + \mat{K}_t (\mat{B}t\mat{E}{t|t-1}\mat{B}_t^T + \mat{N}^y_t)\mat{K}t^T \ea $$ (recalling that $(\mat{A}\mat{B})^T = \mat{B}^T\mat{A}^T$). Applying the two matrix identities to take the derivative of the trace of this equation with respect to $\mat{K}$, and using the fact that the trace is unchanged by taking the transpose ${\rm Tr} (\mat{E}{t|t-1} \mat{B}_t^T \mat{K}t^T)$ $= {\rm Tr} ([\mat{E}{t|t-1} \mat{B}_t^T \mat{K}_t^T]^T)$ $={\rm Tr} (\mat{K}_t \mat{B}t \mat{E}{t|t-1})$, gives $$ \frac{d~{\rm Tr}(\mat{E}_{t|t})}{d~\mat{K}_t} = -2(\mat{B}_t\mat{P}_{t|t-1})^T + 2\mat{K}_t(\mat{B}_t\mat{E}_{t|t-1} \mat{B}_t^T + \mat{N}_t^y) = 0 ~~~~. $$ This equation defines the Kalman gain matrix that makes the error extremal; checking the second derivative shows that this is a minimum. Solving for the optimal gain matrix, $$ \mat{K}_t = \mat{E}_{t|t-1} \mat{B}_t^T \left(\mat{B}_t \mat{E}_{t|t-1} \mat{B}_t^T + \mat{N}_t^y\right)^{-1} ~~~~. $$ Substituting the gain matrix back into equation (\ref{est-errorexp}), the third and fourth terms cancel, leaving $$ \mat{E}_{t|t} = \mat{E}_{t|t-1} - \mat{E}_{t|t-1}\mat{B}_t^T (\mat{B}_t\mat{E}_{t|t-1}\mat{B}_t^T + \mat{N}^y_t)^{-1} \mat{B}_t\mat{E}_{t|t-1} $$ or $$ \mat{E}_{t|t} = (\mat{I} - \mat{K}_t\mat{B}_t) \mat{E}_{t|t-1} ~~~~. $$ This gives the update rule for the error matrix given a new measurement of the observable.

The last piece that we need is the predicted error after the state prediction step $\vec x_{t+1|t} = \mat{A}_t\cdot\vec x_{t|t}$, which will be $$ \ba \tag{est-errorpred} \mat{E}_{t+1|t} &=& \la (\vec x_{t+1} - \vec x_{t+1|t}) (\vec x_{t+1} - \vec x_{t+1|t})^T \ra \\ &=& \la (\mat{A}_t\cdot\vec x_t + \vec\eta_t - \mat{A}_t\cdot\vec x_{t|t}) (\mat{A}_t\cdot\vec x_t+\vec\eta_t-\mat{A}_t\cdot\vec x_{t|t})^T\ra\\ &=& \la (\mat{A}_t \cdot (\vec x_t - \vec x_{t|t}) + \vec\eta_t) (\mat{A}_t \cdot (\vec x_t - \vec x_{t|t}) + \vec\eta_t)^T \ra \\ &=& \la \mat{A}_t \cdot (\vec x_t - \vec x_{t|t})(\vec x_t - \vec x_{t|t})^T \cdot \mat{A}^T \ra + \la \vec\eta_t \vec\eta_t^T \ra \\ &=& \mat{A}_t \mat{E}_{t|t} \mat{A}_t^T + \mat{N}^x_t ~~~~. \ea $$

This completes the derivation of the Kalman filter, the linear estimator with the minimum square error. Recapping, the procedure starts with an initial estimate for the state $\vec x_{t|t-1}$ and error $\mat{E}_{t|t-1}$, and then the sequence is:

  • Estimate the new observable
    $$
       \vec y_{t|t-1} = \mat{B}_t\cdot\vec x_{t|t-1} ~~~~.
    $$
  • Measure a new value for the observable
    $$
       \ve{y}_t ~~~~.
    $$
  • Compute the Kalman gain matrix
    $$
       \mat{K}_t = \mat{E}_{t|t-1} \mat{B}_t^T \left(\mat{B}_t\mat{E}_{t|t-1}
       \mat{B}_t^T + \mat{N}_t^y\right)^{-1} ~~~~.
    $$
  • Estimate the new state
    $$
       \vec x_{t|t}=\vec x_{t|t-1}+\mat{K}_t\cdot(\vec y_t-\vec y_{t|t-1}) ~~~~.
    $$
  • Update the error matrix
    $$
       \mat{E}_{t|t} = (\mat{I} - \mat{K}_t\mat{B}_t) ~\mat{E}_{t|t-1} ~~~~.
    $$
  • Predict the new state
    $$
       \vec x_{t+1|t} = \mat{A}_{t}\cdot\vec x_{t|t} ~~~~.
    $$
  • Predict the new error
    $$
       \mat{E}_{t+1|t}=\mat{A}_t \mat{E}_{t|t} \mat{A}_t^T + \mat{N}_t^x ~~~~.
    $$
    
    Remarkably, the iterative application of this recursive algorithm gives the best estimate of $\ve{x}(t)$ from the history of $\ve{y}(t)$ that can be made by a linear estimator; it cannot be improved by analyzing the entire data set off-line {Catlin:89}.

Stepping back from the details of the derivation, these equations have very natural limits. If $\mat{B}\rightarrow 0$ (the observable $\ve{y}$ does not depend on the internal state $\ve{x}$) or $\mat{N}^y\rightarrow\infty$ (the observable is dominated by measurement noise) then $\mat{K}\rightarrow 0$ and the measurements are not used in the state estimate. Conversely, if $\mat{N}^y\rightarrow 0$ and $\mat{B}\rightarrow\mat{I}$ (there is no noise in the observable, and the transformation from the internal state reduces to the identity matrix) then the update replaces the internal state with the new measurement. Problem \ref{zkf} looks at the suggestive form of these equations for the case of small measurement noise.

Nonlinearity and Entrainment

The derivation of the Kalman filter has assumed linearity in two places: linear observables and dynamics, and linear updates of the state following a new measurement. The former can be relaxed by local linearization; we'll return to the latter in the next chapter.

The nonlinear governing equations now are $$ \vec x_t = \vec f(\vec x_{t-1}) + \vec\eta_t ~~~~~~~~ \vec y_t = \vec g(\vec x_t) + \vec\eps_t ~~~~. $$ The system governing equation is needed to predict the new state $\vec x_t = \vec f(\vec x_{t-1})$, and to predict the new error $$ \ba \mat{E}_{t+1|t} &=& \la (\vec x_{t+1} - \vec x_{t+1|t}) (\vec x_{t+1} - \vec x_{t+1|t})^T \ra \\ &=& \la [\vec f(\vec x_t) + \vec\eta_t - \vec f(\vec x_{t|t})] [\vec f(\vec x_t)+\vec\eta_t-\vec f(x_{t|t})]^T\ra ~~~. \ea $$ If the prediction error is not large (\ie the noise $\vec\eta$ is small), then $\vec f$ can be replaced by its local linearization $$ \ba \vec f(\vec x_t) - \vec f(\vec x_{t|t}) &\approx& \left. \ps{\vec f}{\vec x}\right|_{\disp \vec x_{t|t}} \cdot (\vec x_t - \vec x_{t|t}) \\ &\equiv& \mat{A}_t \cdot (\vec x_t - \vec x_{t|t}) ~~~~. \ea $$ With this revised definition for $\mat{A}$ then equation ({est-errorpred}) can be used as before. Similarly, the observable equation appears in the derivation of the Kalman gain matrix as $$ \ba \vec y_t - \vec y_{t|t-1} &=& \vec g(\vec x_t) + \vec\eps_t - \vec g(\vec x_{t|t-1}) \\ &\approx& \left. \ps{\vec g}{\vec x}\right|_{\disp \vec x_{t|t-1}} \cdot (\vec x_t - \vec x_{t|t-1}) + \vec\eps_t \\ &\equiv& \mat{B}_t \cdot (\vec x_t - \vec x_{t|t-1}) + \vec\eps_t ~~~~. \ea $$ Once again, by taking $\mat{B}$ to be the local linearization this is the same as equation (est-obs). Redefining the Kalman filter to use local linearizations of nonlinear observables and dynamics in the gain and error calculations gives the extended Kalman filter (the nonlinear functions can be retained in the the state and observable predictions). As with most things nonlinear it is no longer possible to prove the same kind of optimality results about an extended Kalman filter, a liability that is more than made up for by its broader applicability.

The magic of Kalman filtering happens in the step $$ \tag{coupleq} \vec x_{t|t}=\vec x_{t|t-1}+\mat{K}_t\cdot(\vec y_t-\vec y_{t|t-1}) ~~~~. $$ A correction is added to the internal state based on the difference between what you predicted and what you observed, scaled by how much you trust your predictions versus the observations. Officially, to be able to apply this you must know enough about the system to be able to calculate the noise covariances in both the dynamics and the measurements. In practice this is often not the case, particularly since the ``noise'' represents all aspects of the system not covered by the model. Then the noise terms become adjustable parameters that are selected to give satisfactory performance (Problem {est-prob} provides an example of this trade-off).

The success of Kalman filtering even when it is not formally justified hints at the power of equation (coupleq). Many nonlinear systems share the property that a small interaction with an independent copy of the system can cause their states to become synchronized. This process is called entrainment. For example, let $d\ve{x}/dt = \ve{f}(\ve{x})$, and take $d\ve{x}'/dt = \ve{f}(\ve{x}')$ to obey the same governing equation but have different initial conditions. Then if we couple one of the degrees of the freedom of the two systems with a linear correction that seeks to drive those variables to the same value, $$ \ds{x_i}{t} = f_i(x_i) + \eps(x_i'-x_i) ~~~~, $$ then for most choices of $f$, $\eps$, and $i$, $\ve{x}$ will approach $\ve{x}'$ as long as $x_i$ interacts with the other components of $\ve{x}$ and there is dissipation to damp out errors. Because dissipation reduces the dimension of the subspace of a system's configuration space that it actually uses {Temam:88}, it's needed to separate the tugs from the coupling between the systems from the internal evolution of the system. $\eps$ is a small parameter that controls the trade-off between responding quickly and ignoring noise.

Entrainment requires that the largest Lyapunov exponent associated with the coupling between the systems is negative {Pecora:97}; this does not even require the systems to be identical {Parlitz:97}. A formerly-familiar example is provided by mechanical clocks on the wall of a clock shop; the vibrations coupled through the wall could entrain the clock mechanisms so that they would tick in synchrony.

Entrainment can be used to design systems whose simple dynamics replaces complicated algorithms for the job of state estimation. An example is spread spectrum acquisition, a very important task in engineering practice. A transmitter that seeks to make optimal use of a communications channel uses a linear feedback shift register (LFSR, Chapter stoch) to generate ideal pseudo-random noise as a modulation source. To decode the message, the receiver must maintain a copy of the transmitter's shift register that remains faithful even if there is noise in the transmission or the two system's clocks drift apart. This is conventionally done by a coding search for the best setting of the receiver's shift register {Simon:94}.

An LFSR uses a binary recursion $$ x_{n} = \sum_{i=1}^N a_i x_{n-i} ~~~~ (\rm{mod}~2) ~~~~, $$ with the $a_i$'s chosen to make the $z$-transform irreducible. It's possible to add small perturbations to this discrete function if the LFSR is replaced by an analog feedback shift register (AFSR) {Gershenfeld:95}, $$ x_n = \frac{1}{2} \left[ 1-\cos\left( \pi \sum_{i=1}^N a_i x_{n-i} \right)\right]~~~~. $$ This analog function matches the value of the LFSR for binary arguments. Because the magnitude of the slope of the map is less than 1 at the digital values, these are stable fixed points \cite{Guckenheimer:83} that attract an arbitrary initial condition of the register onto the LFSR sequence. If we now add to this a small correction $$ x_n = \frac{1}{2} \left[ 1-\cos\left( \pi \sum_{i=1}^N a_i x_{n-i} \right)\right] + \eps(x_n'-x_n) ~~~~, $$ where $x_n'$ is a signal received from another shift register, then the two systems can entrain. This is shown in Figure {afsr}. If $\eps$ is large the receiver locks quickly but it will also try to follow any modulation and noise in the signal; if $\eps$ is small it will take longer to lock but will result in a cleaner estimate of the transmitter's state.

Screenshot from 2026-05-03 20-57-34.png

Figure: Entrainment of an analog feedback shift register ($\circ$) with a linear feedback shift register ($+$).

Hidden Markov Models

The job of a Kalman filter is to provide an estimate of the internal state given a history of measurements of an external observable. It presumes, however, that you already know how to calculate the transition probabilities, and further that you're not interested in the probability distribution for the internal state. A Hidden Markov Model (HMM) addresses these limitations.

For example, consider coin flipping done by a corrupt referee who has two coins, one biased and the other fair, with the biased coin occasionally switched in surreptitiously. The observable is whether the outcome is heads or tails; the hidden variable is which coin is being used. Figure~\ref{markhide} shows this situation. There are transition probabilities between the hidden states $A$ and $B$, and emission probabilities associated with the observables 0 and 1. Given this architecture, and a set of measurements of the observable, our task is to deduce both the fixed transition probabilities and the changing probabilities to see the internal states. This is a discrete HMM; it's also possible to use HMMs with continuous time models {Rabiner:89}.

Screenshot from 2026-05-03 20-57-59.png

Figure: A Hidden Markov Model.

Just as with the relationship between AR and MA models (Section {arma}), an HMM can be approximated by an ordinary Markov model, but the latter might require an enormous order to capture the behavior of the former because the rules for dynamics in the present can depend on a change of models that occurred a long time ago.

We'll assume that the internal state $x$ can take on $N$ discrete values, for convenience taken to be $x=\{1,\ldots,N\}$. Call $x_t$ the internal state at time $t$, and let $\{y_1,\ldots,y_T\}$ be a set of measurements of the observable. An HMM is specified by three sets of probabilities: $p(x_{t+1}|x_t)$, the internal transitions, $p(y_t|x_t)$, the emission of an observable given the internal state, and $p(x_1)$, the initial distribution of the internal states.

The key quantity to estimate is $p(x_t,x_{t+1},y_1,\ldots,y_T)$, the probability to see a pair of internal states along with the observations. From this we can find the probability to see a transition given the observations, $$ \ba p(x_{t+1}|x_t,y_1,\ldots,y_T) &=& \frac{p(x_t,x_{t+1},y_1,\ldots,y_T)}{p(x_t,y_1,\ldots,y_T)} \\ &=& \frac{p(x_t,x_{t+1},y_1,\ldots,y_T)} {\sum_{x_{t+1}=1}^N p(x_t,x_{t+1},y_1,\ldots,y_T)} ~~~, \ea $$ and then the absolute transition probability can be estimated by averaging over the record $$ \tag{hmmestx} p(x_{t+1}=j|x_t=i) \approx \frac{1}{T} \sum_{t'=1}^T p(x_{t'+1}=j|x_{t'}=i,y_1,\ldots,y_T) ~~~~. $$ Similarly, the probability of seeing an internal state is $$ \ba p(x_t|y_1,\ldots,y_T) &=& \frac{p(x_t,y_1,\ldots,y_T)}{p(y_1,\ldots,y_T)} \\ &=& \frac{\sum_{x_{t+1}=1}^N p(x_t,x_{t+1},y_1,\ldots,y_T)} {\sum_{x_t=1}^N \sum_{x_{t+1}=1}^N p(x_t,x_{t+1},y_1,\ldots,y_T)} ~~~, \ea $$ which can be used to estimate the observable probability by another sum over the data $$ \tag{hmmesty} p(y_t=j|x_t=i) \approx \frac{\sum_{t'|y_{t'}=j} p(x_{t'}=i|y_1,\ldots,y_T)} {\sum_{t'=1}^T p(x_{t'}=i|y_1,\ldots,y_T)} ~~~~, $$ as well as the absolute probability of an internal state $$ p(x=i) \approx \frac{1}{T} \sum_{t=1}^T p(x_t=i|y_1,\ldots,y_T) ~~~~. $$

There is a problem lurking in the estimation of these quantities. Consider the probability of the model to produce the observations $p(y_1, \ldots, y_T)$. Since we don't know the sequence of internal states we have to sum over all of the possibilities $$ p(y_1,\ldots,y_T) = \sum_{x_1=1}^N \cdots \sum_{x_T=1}^N p(x_1,\ldots,x_T,y_1,\ldots,y_T) ~~~~. $$ This is a set of $T$ sums over $N$ terms, requiring $N^T$ operations. That's a a big number! A model with 10 internal states and an observed sequence of 100 points requires adding $10^{100}$ terms, which is larger than the number of atoms in the universe ($\sim$$10^{70}$). The problem may be seen in Figure {marktrel}. The observed outputs are written across the top, with the possible internal states under them. The exponential explosion comes in the number of different paths through this trellis.

Screenshot from 2026-05-03 20-58-16.png

Figure: Hidden Markov Model trellis.

The trellis also points to a solution: each column depends only on the previous column, and so we are doing far too much work by recalculating each column over and over for every path that passes through it. Let's start with the last step. Notice that it can be written as a sum over the internal states, $$ \ba p(y_1,\ldots,y_T) &=& \sum_{x_T=1}^N p(x_T,y_1,\ldots,y_T) \\ &=& \sum_{x_T=1}^N p(y_T|x_T,y_1,\ldots,y_{T-1}) ~p(x_T,y_1,\ldots,y_{T-1}) ~~~~. \ea $$ Because the output probability depends only on the internal state this can be simplified to $$ p(y_1,\ldots,y_T) = \sum_{x_T=1}^N p(y_T|x_T) ~ p(x_T,y_1,\ldots,y_{T-1}) ~~~~. $$ Factored again over the previous step, $$ \ba p(y_1,\ldots,y_T) &=& \sum_{x_T=1}^N p(y_T|x_T) \sum_{x_{T-1}=1}^N p(x_T,x_{T-1},y_1,\ldots,y_{T-1}) \\ &=& \sum_{x_T=1}^N p(y_T|x_T) \sum_{x_{T-1}=1}^N p(x_T|x_{T-1},y_1,\ldots,y_{T-1}) ~p(x_{T-1},y_1,\ldots,y_{T-1}) \\ &=& \sum_{x_T=1}^N p(y_T|x_T) \sum_{x_{T-1}=1}^N p(x_T|x_{T-1})~p(x_{T-1},y_1,\ldots,y_{T-1}) ~~~~, \ea $$ dropping the dependence of the internal transition probability on anything but the previous state. Continuing in this fashion back to the beginning we find that $$ \ba p(y_1,\ldots,y_T) &=& \sum_{x_T=1}^N p(y_T|x_T) \sum_{x_{T-1}=1}^N p(x_T|x_{T-1}) ~p(y_{T-1}|x_{T-1}) \\ &\cdots& \sum_{x_2=1}^N p(x_3|x_2)~p(y_2|x_2) \sum_{x_1=1}^N p(x_2|x_1)~p(y_1|x_1)~p(x_1) ~~~~. \ea $$ The $x_1$ sum has $N$ terms and must be done for all values of $x_2$, a total of $N^2$ operations. Since there are $T$ of these, the cost of the calculation drops to ${\cal O}(N^2T)$ -- quite a saving over $N^T$. As in so many other areas, a hard problem becomes easy if it is written recursively. For an HMM this is called the \trm{forward algorithm}{Forward algorithm}.

The same idea works in reverse. Start with the probability to see a sequence of observables given a starting initial state, and factor it over the first step: $$ \ba p(y_t,\ldots,y_T|x_t) &=& \sum_{x_{t+1}=1}^N p(x_{t+1},y_t,\ldots,y_T|x_t) \\ &=& \sum_{x_{t+1}=1}^N p(y_t|x_t,x_{t+1},y_{t+1},\ldots,y_T) ~p(x_{t+1},y_{t+1},\ldots,y_T|x_t) \\ &=& p(y_t|x_t) \sum_{x_{t+1}=1}^N p(y_{t+1},\ldots,y_T|x_t,x_{t+1}) ~ p(x_{t+1}|x_t) \\ &=& p(y_t|x_t) \sum_{x_{t+1}=1}^N p(x_{t+1}|x_t) ~p(y_{t+1},\ldots,y_T|x_{t+1}) ~~~. \ea $$ Continuing on to the end, $$ \ba & & p(y_t,\ldots,y_T|x_t) ~=~ p(y_t|x_t) \sum_{x_{t+1}=1}^N p(x_{t+1}|x_t)~p(y_{t+1}|x_{t+1}) \\ & & \times \sum_{x_{t+2}=1}^N p(x_{t+2}|x_{t+1}) ~ p(y_{t+2}|x_{t+2}) ~\cdots\sum_{x_{T-1}=1}^N p(x_{T-1}|x_{T-2})~p(y_{T-1}|x_{T-1}) \\ & & \times \sum_{x_T=1}^N p(x_T|x_{T-1})~p(y_T|x_T) ~~~~. \ea $$ This is called (can you guess?) the backwards algorithm.

Now return to the probability to see a pair of internal states and the observations. This can be factored as $$ \ba p(x_t,x_{t+1},y_1,\ldots,y_T) &=& p(x_t,y_1,\ldots,y_t)~p(x_{t+1}|x_t,y_1,\ldots,y_t) \\ & & p(y_{t+1},\ldots,y_T|x_t,x_{t+1},y_t,\ldots,y_T) ~, \ea $$ or dropping irrelevant variables, $$ p(x_t,x_{t+1},y_1,\ldots,y_T) = p(x_t,y_1,\ldots,y_t)~p(x_{t+1}|x_t) ~p(y_{t+1},\ldots,y_T|x_{t+1}) ~~. $$ There are three factors on the right. The first is what we find from the forward algorithm, the middle one is the transition probability specified by the HMM, and the last is the result of the backward algorithm. Therefore this quantity can be calculated for all points by a linear-time pass through the data.

Once that's been done the resulting distributions can be used to update the transition probabilities according to equations (hmmestx) and (hmmesty). This procedure can then be iterated, first using the transition probabilities and the observables to update the estimate of the internal probabilities, then using the internal probabilities to find new values of the transition probabilities. Going back and forth between finding probabilities given parameters and finding the most likely parameters given probabilities is just the Expectation-Maximization (EM) algorithm that we saw in Section {EM}, which in the context of HMMs is called the Baum--Welch algorithm. It finds the maximum likelihood parameters starting from initial guesses for them. For a model with continuous parameters the M step becomes a maximization with respect to the parameterized distribution of internal states.

The combination of the forward-backward algorithm and EM finds the parameters of an HMM but it provides no guidance into choosing the architecture. The need to specify the architecture is the weakness, and strength, of using HMMs. In applications where there is some a priori insight into the internal states it is straightforward to build that in. A classic example, which helped drive the development of HMMs, is in speech recognition. Here the outputs can be parameters for a sound synthesis model, say ARMA coefficients (Section arma), and the internal states are phonemes and then words. It's hard to recognize these primitives from just a short stretch of sound, but the possible utterances are a strong function of what has preceeded them. The same thing applies to many other recognition tasks, such as reading handwriting, where a scrawled letter can be interpreted based on its context. An HMM provides the means to express these ideas.

The most important application of an HMM comes not on the training data but in applying the resulting model to deduce the hidden states given new data. To do this we want to find the most likely sequence of states given the data, $$ \argmax{x_1\ldots x_T} p(x_1,\ldots,x_T|y_1,\ldots,y_T) = \argmax{x_1\ldots x_T} \frac{p(x_1,\ldots,x_T,y_1,\ldots,y_T)} {p(y_1,\ldots,y_T)} $$ $$ = \argmax{x_1\ldots x_T} p(x_1,\ldots,x_T,y_1,\ldots,y_T) $$ (the denominator can be dropped because it doesn't affect the maximization). ${\rm argmax}_x f(x)$ is defined to be the argument $x$ that gives $f$ the maximum value, as compared to ${\rm max}_x f(x)$ which is the value of $f$ at the maximum.

Naively this requires checking the likelihood of every path through the trellis, an ${\cal O}(N^T)$ calculation. Not surprisingly, the same recursive trick that we used before also works here. Start by factoring out the final step and dropping terms that are irrelevant to the distribution, $$ \ba & & \max_{x_1 \ldots x_T} p(x_1,\ldots,x_T,y_1,\ldots,y_T) \\ &=& \max_{x_1 \ldots x_T} p(x_T,y_T|x_1,\ldots,x_{T-1},y_1,\ldots,y_{T-1}) ~ p(x_1,\ldots,x_{T-1},y_1,\ldots,y_{T-1}) \\ &=& \max_{x_1 \ldots x_T} p(x_T,y_T|x_{T-1}) ~ p(x_1,\ldots,x_{T-1},y_1,\ldots,y_{T-1}) \\ &=& \max_{x_1 \ldots x_T} p(y_T|x_T,x_{T-1}) ~ p(x_T|x_{T-1}) ~ p(x_1,\ldots,x_{T-1},y_1,\ldots,y_{T-1}) \\ &=& \max_{x_T} ~ p(y_T|x_T) \max_{x_1 \ldots x_{T-1}} p(x_T|x_{T-1}) ~ p(x_1,\ldots,x_{T-1},y_1,\ldots,y_{T-1}) ~~. \ea $$ Continuing in this fashion back to the beginning, $$ \ba & & \max_{x_1 \ldots x_T} p(x_1,\ldots,x_T,y_1,\ldots,y_T) \\ &=& \max_{x_T}~p(y_T|x_T) ~\max_{x_{T-1}}~p(x_T|x_{T-1})~p(y_{T-1}|x_{T-1}) \\ & & \cdots~\max_{x_2}~p(x_3|x_2)~p(y_2|x_2) ~\max_{x_1}~p(x_2|x_1)~p(y_1|x_1)~p(x_1) ~~. \ea $$ This is now once again an ${\cal O}(N^2T)$ calculation. It is called the Viterbi algorithm, and is very important beyond HMMs in decoding signals sent through noisy channels that have had correlations introduced by a convolutional coder {Sklar:88}. There is one subtlety in implementing it: each maximization has a set of outcomes based on the unknown value of the following step. This is handled by using the maximum value for each outcome and keeping track of which one was used at each step, then backtracking from the end of the calculation once the final maximization is known.

Figures {fittrans} and {marktrel} were used to help explain Kalman filtering and HMMs by drawing the connections among the variables. It's possible to go much further with such diagrams, using them to write down probabilistic models with more complex dependencies than what we've covered, and applying graphical techniques to arrive at the kinds of simplifications we found to make the estimation problems tractable {Jordan:99}. Such architectural complexity is useful when there is advance knowledge to guide it; the next chapter turns to the opposite limit.

  • Factor graphs
$$ \ba m_{x_i\rightarrow f_i}(x_i) &=& \prod_{f_j~(\sim f_i)} m_{f_j\rightarrow x_i}(x_i) \\ m_{f_i\rightarrow x_i}(x_i) &=& \sum_{\sim x_i} f_i(\{\vec{x}\}_i) \prod_{x_j~(\sim x_i)} m_{x_j\rightarrow f_i}(x_j) \ea $$

References

Brown, Robert Grover, & Hwang, Patrick~Y.C. (1997). Introduction to Random Signals and Applied Kalman Filtering. 3rd edn. New York, NY: Wiley.

A good practical introduction to estimation theory.

Catlin, Donald E. (1989). Estimation, Control, and the Discrete Kalman Filter. Applied Mathematical Sciences, vol. 71. New York, NY: Springer Verlag.

The rigorous mathematical background of Kalman filtering.

Honerkamp, Josef (1994). Stochastic Dynamical Systems: Concepts, Numerical Methods, Data Analysis. New York, NY: VCH. Translated by Katja Lindenberg.

The connection between estimation theory and the theory of stochastic dynamical systems.

Duda, Richard O., & Hart, Peter E. (1973). Pattern Classification and Scene Analysis. New York, NY: Wiley.

Therrien, Charles W. (1989). Decision, Estimation, and Classification: An Introduction to Pattern Recognition and Related Topics. New York, NY: Wiley.

Fukunaga, Keinosuke (1990). Introduction to Statistical Pattern Recognition. 2nd edn. Boston, MA: Academic Press.

These are classic pattern recognition texts.

Problems

  1. What is the approximate Kalman gain matrix in the limit of small measurement noise? How is the error matrix updated in this case?

  2. Take as a test signal a periodically modulated sinusoid with noise added, $$ y_n = \sin[0.1 t_n + 4 \sin(0.01 t_n)] + \eta

    \equiv \sin(\theta_n) + \eta ~~~~,
    

    $$ where $\eta$ is a Gaussian noise process with $\sigma=0.1$. Design an extended Kalman filter to estimate the noise-free signal. Use a two-component state vector $\ve{x}_n = (\thetan,\theta{n-1})$, and assume for the internal model a linear extrapolation $\theta_{n+1} = \theta_n+(\thetan-\theta{n-1})$. Take the system noise matrix $\mat{N}^x$ to be diagonal, and plot the predicted value of $y$ versus the measured value of $y$ if the standard deviation of the system noise is chosen to be $10^{-1}$, $10^{-3}$, and $10^{-5}$. Use the identity matrix for the initial error estimate.

  3. Given an HMM trellis (Figure {marktrel}), work out the probability to see an internal state given the observations $p(x_t|y_1,\ldots,y_T)$ in terms of forward and backward terms.

    1. Generate observations from the model in Figure {markhide}, with a fair and a biased coin as the internal states, and probabilities to switch between them.

    2. Use knowledge of this model and those observations to estimate the probabilities for which coin was used when, and compare with the correct values.


(c) Neil Gershenfeld 5/2/26