Function Fitting¶

The Nature of Mathematical Modeling (draft)¶

$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} $

The goal of function fitting is to choose values for the parameters in a function to best describe a set of data. There are many possible reasons to do this. If a specific meaningful form for the function with a small number of free parameters is known in advance, this is called parametric fitting, and finding the parameter values themselves may be the goal. For example, an exponential decay constant might be sought to determine a reaction rate. If the form of the function is not known, and so a very flexible function with many free parameters is used, this becomes nonparametric fitting (although this distinction is often vague). One reason to do nonparametric fitting is to try to find and describe underlying trends in noisy data, in which case the fit must build in some prior beliefs and posterior observations about what defines the difference between the signal and the noise. In function approximation there is no noise; the goal is to take known (and perhaps laboriously calculated) values of a function and find a new function that is easier to evaluate and that can interpolate between, or extrapolate beyond, the known values.

It's useful to view function fitting in a context such as the Minimum Description Length principle (MDL) {Rissanen:86}, or the related Algorithmic Information Theory {Chaitin:90}. One extreme is to report the observed data itself as your model. This is not hard to do, but the ``model'' is very large and has no ability to generalize. Another extreme is report the smallest amount of information possible needed to describe the data, but this may require a great deal of supporting documentation about how to use the model. A tidy computer program is of no use to a Martian unless it comes with a complete description of a computer that can run it. The best model typically lies between these extremes: there is some amount of information about the data, and some about the model architecture. According to MDL, the sum of these two kinds of information taken together should be as small as possible. While this principle cannot be applied directly (like so many other attractive ideas, it includes a solution to the halting problem of deciding if an arbitrary program will terminate, which is known to be impossible {Turing:36}{Chaitin:94}), it is a useful guiding principle that can be made explicit given specific assumptions about a problem.

In this chapter we will look at the basic features of function fitting: the general principles by which data can be used to constrain a model, the (often overlooked) connection with the choice of an error measure, how to fit a model with linear and nonlinear parameters, and the limits on what we can expect to learn from fitting. We will not be particularly concerned with the functional form used; coming chapters will look in much more detail at the representation of data, functions, and optimization strategies.

Model Estimation¶

The general fitting problem has three ingredients: a model architecture (which we'll call $m$) that has a set of adjustable parameters $\phi$, and measured data $d$. The goal is to find values of the parameters that lead to the best agreement (in some sense) between the predictions of the model and the data. An example of $m$ might be a polynomial of a given order, where the $\phi$ are the coefficients.

A reasonable way to go about finding the best coefficients is to ask for the $\phi$ that are most likely given the choice of the model and the measured data. This means that we want to find the $\phi$ that maximizes $p(\phi|d,m)$. In Random Systems we saw that our job is then to find $$ \ba \max_\phi~p(\phi|d,m) &=& \max_\phi~\frac{p(\phi,d,m)}{p(d,m)} \\ &=& \max_\phi~\frac{p(d|\phi,m)~p(\phi|m)~p(m)}{p(d|m)~p(m)} \\ &=& \max_\phi~\frac{p(d|\phi,m)~p(\phi|m)}{p(d|m)} \\ &=& \max_\phi~\frac{p(d|\phi,m)~p(\phi|m)}{\int_\phi p(d,\phi|m)~d\phi} \\ &=& \max_\phi~\frac{p(d|\phi,m)~p(\phi|m)}{\int_\phi p(d|\phi,m) ~p(\phi|m)~d\phi} \\ &=& \max_\phi~\frac{{\rm likelihood}\times{\rm prior}}{\rm evidence} ~~~~. \tag{Bayesian} \ea $$ Much of the work in Bayesian model estimation goes into the integration for the evidence term. But this does not affect a single maximization over $\phi$; it comes in making comparisons among competing model architectures. If we decide in advance that we are going to stick with one architecture then equation (Bayesian) can be simplified by dropping the conditioning on the model: $$ \max_\phi~p(\phi|d) = \max_\phi~\frac{p(d|\phi)~p(\phi)}{p(d)} ~~~~. \tag{MAP} $$ Now the evidence term has become a simple prior on the likelihood of the data set. Even this can usually be dropped; it's relevant only in combining multiple data sets of varying pedigrees. Finding parameters with equation (MAP) is called Maximum A Posteriori estimation.

MAP still requires putting a prior on the parameters. This is a very powerful idea, to be explored in the next chapter, but if we make the simplest choice of a uniform prior $p(\phi)=p(d)=1$ then we're left with $$ \max_\phi~p(\phi|d) = \max_\phi~p(d|\phi) ~~~~. \tag{ML} $$ This is the easiest kind of model estimation of all, called Maximum Likelihood (ML). That is what we will now apply.

Least Squares¶

Let's assume that we are given a set of $N$ noisy measurements of a quantity $y_n$ as a function of a variable $x_n$, and we seek to find values for coefficients $\phi$ in a function $y_n = y(x_n, \phi)$ that describes their relationship (the generalization to vector variables will be straightforward). In Section {stoch-clt} we learned that in the absence of any other information the Central Limit Theorem tells us that the most reasonable choice for the distribution of a random variable is Gaussian, and so we will make that choice for the distribution of errors in $y_n$. Problem {maxent} will use an entropy argument to reach the same conclusion. In practice, many systems choose to ignore this insight and have non-Gaussian distributions; the real reason why Gaussianity is so commonly (and frequently implictly) assumed is that it leads to a particularly simple and useful error model: least squares.

If the errors do have a Gaussian distribution around the true value $y(x_n,\phi)$ then the probability to observe a value between $y$ and $y+dy$ is given by a Gaussian centered on the correct value $$ p(y)~dy = \frac{1}{\sqrt{2\pi\sigma_n^2}} ~e^{-[y-y(x_n,\phi)]^2/(2\sigma_n^2)}~dy ~~~~. $$ The variance $\sigma_n^2$ might depend on quantities such as the noise in a photodetector or the number of samples that are measured.

We will further assume that the errors between samples are independent as well as identically distributed (iid). This means that the probability to see the entire data set is given by the product of the probabilities to see each point, $$ p({\rm data}|{\rm model}) = \prod_{n=1}^N \frac{1}{\sqrt{2\pi\sigma_n^2}} ~e^{-[y_n-y(x_n,\phi)]^2/(2\sigma_n^2)}~~~~. $$
We seek the $\phi$ that maximizes this probability. If $p$ is maximal then so is its logarithm (the log-likelihood), and since the log of a product is equal to the sum of the logs, this becomes $$ -\log p({\rm data}|{\rm model}) = \sum_{n=1}^N \frac{[y_n-y(x_n,\phi)]^2}{2\sigma_n^2} + \frac{1}{2}\log(2\pi\sigma_n^2) ~~~~. $$ Because we've moved the minus sign to the left hand side we now want to find the $\phi$ that minimizes the right hand side. The first term measures the distance between the data and the model, and the second one catches us if we try to cheat and make a model with a huge variance that explains everything equally well (or poorly). We can drop the second term since it does not depend on the parameters $\phi$ that we are adjusting, and so we want to find the values that satisfy $$ \min_{\phi} \sum_{n=1}^N \frac{[y_n-y(x_n,\phi)]^2}{2\sigma_n^2} ~~~~. $$ If the variances $\sigma_n^2$ are constant (in particular, if we set $\sigma_n^2=1$ when we have no idea at all what it should be) this reduces to $$ \min_{\phi} \sum_{n=1}^N [y_n-y(x_n,\phi)]^2 ~~~~. $$ This is the familiar least squares error measure. It is the maximum likelihood estimator for data with normally distributed errors, but it is used much more broadly because it is simple, convenient, and frequently not too far off from an optimal choice for a particular problem. An example of where least squares might be a bad choice for an error measure is a bi-modal data set that has two peaks. The least squares error is minimized by a point between the peaks, but such a point has very little probability of actually occurring in the data set.

Instead of the square of the deviation between the model and the data, other powers can be used as an error measure. The first power (the magnitude of the difference) is the maximum likelihood estimate if the errors are distributed exponentially, and higher powers place more emphasis on outliers.

Linear Least Squares¶

Once we've chosen our error measure we need to find the parameters for the distribution that minimizes it. Perhaps the most important example of such a technique is linear least squares, because it is straightforward to implement and broadly applicable.

To do a least squares fit we will start by expanding our unknown function as a linear sum of $M$ known basis functions $f_m$ $$ y(x) = \sum_{m=1}^M a_m f_m(x) ~~~~. $$ We want to find the coefficients $a_m$ that minimize the sum of the squared errors between this model and a set of $N$ given observations $y_n(x_n)$. The basis functions $f_m$ need not be orthogonal, but they must not be linear (otherwise the sum would be trivial); it is the coefficients $a_m$ that enter linearly. For example, the $f_m$ could be polynomial terms, with the $a_m$ as the coefficients of the polynomial.

A least squares fit can be written as a matrix problem $$ \left( \begin{array}{c c c c} f_1(x_1) & f_2(x_1) & \cdots & f_M(x_1) \\ f_1(x_2) & f_2(x_2) & \cdots & f_M(x_2) \\ f_1(x_3) & f_2(x_3) & \cdots & f_M(x_3) \\ \vdots & \vdots & \vdots & \vdots \\ f_1(x_{N-1}) & f_2(x_{N-1}) & \cdots & f_M(x_{N-1}) \\ f_1(x_N) & f_2(x_N) & \cdots & f_M(x_N) \end{array} \right) \left( \begin{array}{c} a_1 \\ a_2 \\ \vdots \\ a_M \end{array} \right) = \left( \begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{N-1} \\ y_N \end{array} \right) ~~~~. $$ If we have the same number of free parameters as data points then the matrix will be square, and so the coefficients can be found by inverting the matrix and multiplying it by the observations. As long as the matrix is not singular (which would happen if our basis functions were linearly dependent), this inversion could be done exactly and our fit would pass through all of the data points. If our data are noisy this is a bad idea; we'd like to have many more observations than we have model parameters. We can do this if we use the pseudo-inverse of the matrix to minimize the least squared error with the Singular Value Decomposition (SVD) that we saw in Linear Algebra and Calculus.

Singular Value Decomposition¶

The computational cost of finding the SVD of an $N\times M$ matrix is ${\cal O} (NM^2+M^3)$. This is comparable to the ordinary inversion of an $M\times M$ matrix, which is ${\cal O} (M^3)$, but the prefactor is larger. Because of its great practical significance, good SVD implementations are available in most mathematical packages. We can now see that it is ideal for solving equation ({fit-mat}). If the basis functions are chosen to be polynomials, the matrix to be inverted is called a Vandermonde matrix. For example, let's say that we want to fit a 2D bilinear model $z = a_0 + a_1 x + a_2 y + a_3 x y$. Then we must invert $$ \left( \begin{array}{c c c c} 1 & x_1 & y_1 & x_1 y_1 \\ 1 & x_2 & y_2 & x_2 y_2 \\ 1 & x_3 & y_3 & x_3 y_3 \\ \vdots & \vdots & \cdots & \vdots \\ 1 & x_{N-1} & y_{N-1} & x_{N-1} y_{N-1} \\ 1 & x_N & y_N & x_N y_N \end{array} \right) \left( \begin{array}{c} a_0 \\ a_1 \\ a_2 \\ a_3 \end{array} \right) = \left( \begin{array}{c} z_1 \\ z_2 \\ z_3 \\ \vdots \\ z_{N-1} \\ z_N \end{array} \right) ~~~~. $$ If the matrix is square ($M=N$), the solution can go through all of the data points. These are interpolating polynomials, like those we used for finite elements. A rectangular matrix ($M<N$) is the relevant case for fitting data. If there are any singular values near zero (given noise and the finite numerical precision they won't be exactly zero) it means that some of our basis functions are nearly linearly dependent and should be removed from the fit. If they are left in, SVD will find the set of coefficients with the smallest overall magnitude, but it is best for numerical accuracy (and convenience in using the fit) to remove the terms with small singular values. The SVD inverse will then provide the coefficients that give the best least squares fit. Choosing where to cut off the spectrum of singular values depends on the context; a reasonable choice is the largest singular value weighted by the computer's numerical precision, or by the fraction of noise in the data {Golub:96}.

In addition to removing small singular values to eliminate terms which are weakly determined by the data, another good idea is to scale the expansion terms so that the magnitudes of the coefficients are comparable, or even better to rescale the data to have unit variance and zero mean (almost always a good idea in fitting). If 100 is a typical value for $x$ and $y$, then $10^4$ will be a typical value for their product. This means that the coefficient of the $xy$ term must be $\sim 100$ times smaller than the coefficient of the $x$ or $y$ terms. For higher powers this problem will be even worse, ranging from an inconvenience in examining the output from the fit, to a serious loss of numerical precision as a result of multiplying very large numbers by very small numbers.

Nonlinear Least Squares¶

model with $M$ parameters $\ve{a}$ in a function $f(x,\ve{a})$

error $$ \ba \chi^2(\ve{a}) &=& \sum_{n=1}^N \frac{[y_n-f(x_n,\ve{a})]^2}{\sigma_n^2} \\ &=& (\ve{y}-\ve{f})^T\mat{W}(\ve{y}-\ve{f}) \tag{error} \ea $$

vector of $N$ observations $\ve{y} = y_1,\ldots,y_N$, $\ve{f} = f(x_1,\ve{a}),\ldots,f(x_N,\ve{a})$

weights $\mat{W}$, $W_{ii} = 1/\sigma_i^2$, could be more general error model

Gradient Descent¶

gradient of the error

$$ \ba \ps{\chi^2}{\ve{a}} &=& 2(\ve{y}-\ve{f})^T\mat{W}\ps{}{\ve{a}}(\ve{y}-\ve{f}) \\ &=& -2(\ve{y}-\ve{f})^T\mat{W}\ps{\ve{f}}{\ve{a}} \\ &=& -2(\ve{y}-\ve{f})^T\mat{W}\mat{J} \\ \ea $$

$M\times N$ Jacobian $\mat{J}$ matrix of partial derivatives

$$J_{ij} = \ps{f_i}{a_j}$$

take a step in the gradient direction

$$ \ve{a}_{new} = \ve{a}_{old}+\alpha\mat{J}^T\mat{W}(\ve{y}-\ve{f}) $$

slows down near minimum .

Newton's Method¶

second order expansion or error

$$ \chi^2(\ve{a}+\ve{h}) \approx \chi^2(\ve{a})+[\del\chi^2(\ve{a})]^T\ve{h}+\frac{1}{2}\ve{h}^T\mat{H}\ve{h} $$

$\mat{H}$ Hessian

$$H_{ij}=\frac{\partial^2\chi^2}{\partial a_i\partial a_j}$$

solve for gradient = 0

$$ \ba 0 &=& \del\chi^2(\ve{a})+\mat{H}\ve{h}\\ \ve{h} &=& -\mat{H}^{-1}\del\chi^2(\ve{a}) \ea $$

Hessian gives step size

exact for quadratic functions

Gauss-Newton Method¶

special case of Newton's method for least-squares

first-order expansion of the function

$$ \ve{f}(\ve{a}+\ve{h}) \approx \ve{f}(\ve{a})+\mat{J}\ve{h} $$

substitute into equation (error), gives a quadratic approximation to the error

$$ \chi^2(\ve{a}+\ve{h}) \approx \ve{y}^T\mat{W}\ve{y}+ \ve{f}^T\mat{W}\ve{f}+ -2\ve{y}^T\mat{W}\ve{f} -2(\ve{y}-\ve{f})\mat{W}\mat{J}\ve{h} +\ve{h}^T\mat{J}^T\mat{W}\mat{J}\ve{h} $$

take derivative with respect to $\ve{h}$

$$ \ps{}{\ve{h}}\chi^2(\ve{a}+\ve{h}) \approx -2(\ve{y}-\ve{f})\mat{W}\mat{J}+ 2\ve{h}^T\mat{J}^T\mat{W}\mat{J} $$

set equal to 0 to find the minimum

$$ \ve{h} = \left[\mat{J}^T\mat{W}\mat{J}\right]^{-1}\mat{J}^T\mat{W}(\ve{y}-\ve{f}) $$

unreliable away from the minimum

Levenberg-Marquardt Method¶

$$ \ve{h} = \left[\mat{J}^T\mat{W}\mat{J}+\lambda\mat{I}\right]^{-1}\mat{J}^T\mat{W}(\ve{y}-\ve{f}) $$

$\lambda=0$ Gauss-Newton

$\lambda$ large gradient descent

start with lambda large
take a step
if error decreases, keep step and decrease lambda
if error increases, reject step and increase lambda

can use local rather than global $\lambda$

starting guess needs to be sufficiently close, doesn't handle local minima, to be discussed in Search

Errors¶

The preceeding fitting procedures find the best values for adjustable parameters, but there's no guarantee that the best is particularly good. As important as finding the values is estimating their errors. These come in two flavors: statistical errors from the experimental sampling procedure, and systematic errors due to biases in the measuring and modeling process. There are a number of good techniques for estimating the former; bounding the latter can be more elusive and requires a detailed analysis of how the data are acquired and handled.

It's frequently the case that there is an error model for the measurements known in advance, such as counting statistics or the presence of Johnson noise in an amplifier {Gershenfeld:00pit}. For a linear fitting procedure it is then straightforward to propagate this through the calculation. If there is an error $\ve{\zeta}$ in a measurement, it causes an error $\ve{\eta}$ in an SVD fit: $$ \ve{v}+\ve{\eta} = \mat{V}\cdot\mat{W}^{-1}\cdot\mat{U}^T\cdot(\ve{b}+\ve{\zeta}) ~~~~. $$ Since this is a linear equation, these random variables are related by $$ \ve{\eta} = \mat{V}\cdot\mat{W}^{-1}\cdot\mat{U}^T\cdot\ve{\zeta} ~~~~. $$ If the components of $\vec{\zeta}$ are zero-mean with a variance $\sigma\zeta^2$ then the variance in the $i$th component of $\vec{\eta}$ is

fix:

$$ \ba \sigma_{\eta,i}^2 &=& \la \eta_{i} \eta_{i} \ra \\ &=& \left\langle \sum_j V_{ij} \frac{1}{w_j} \sum_k U_{kj} \zeta_{k} \sum_l V_{il} \frac{1}{w_l} \sum_m U_{ml} \zeta_{m} \right\rangle \\ &=& \sum_j \sum_l V_{ij} V_{il} \frac{1}{w_j} \frac{1}{w_l} \sum_k \sum_m U_{kj} U_{ml} \underbrace{ \la\zeta_{k}\zeta_{m}\ra}_{\disp \sigma_{\zeta}^2\delta_{km}} \\ &=& \sum_j \sum_l V_{ij} V_{il} \frac{1}{w_j} \frac{1}{w_l} \underbrace{\sum_k U_{kj} U_{kl}}_{\disp \delta_{jl}} \sigma_{\zeta}^2 \\ &=& \sigma_{\zeta}^2 \sum_j \frac{V_{ij}^2}{w_j^2} ~~~~. \ea $$

In the second line we've assumed that the measurement errors are an uncorrelated random variable with a fixed variance $\sigma_b^2$; if this is not the case then their covariance must be carried through the calculation. In the third line we've used the orthonormality of $\mat{U}$. Therefore, the error in the fit relative to that in the data is $$ \tag{svderror} \frac{\sigma_{\eta,i}^2}{\sigma_\zeta^2} = \sum_j \frac{V_{ij}^2}{w_j^2} ~~~~. $$

Absent such insight into an error model for the measurements it's necessary to use the observations themselves to estimate the fitting errors. If you are in the fortunate position of being able to generate unlimited amounts of data this can be done following its definition by analyzing an ensemble of independent data sets and then reporting the distribution of fitting results. But in the much more likely circumstance of limited data it's not possible to produce an ensemble of measurements.

Or is it? Bootstrap resampling is based on the apparently-circular reasoning that the data set is drawn from the distribution that describes it and hence drawing from the data approximates that distribution. This is done by sampling with replacement. A random number generator is used to choose elements of the data set, with an element remaining in the original data set after it is chosen so that it can reappear in the derived one multiple times. Random selection continues until a new data set has been produced of the same size as the original one. This one uses the same elements, but it is an independent sampling of them. These data can then be fit, and the resampling done again as many times as desired. Problem {SVDerror} looks at an example of this.

Bootstrap error estimation was originally statistically suspect because it appears to violate the deeply-held belief that data should not be reused in analysis. But, over time, both theoretical and experimental work has shown that, while it is not as reliable as using truly independent data sets, bootstrap can provide useful information about errors not available in a single fit {Efron:94}. Its typical performance can in fact be quite good, although the worst-case errors in the errors can be quite bad.

Beyond bootstrap, in fitting functions that will be used for making forecasts the most important error is how well the model generalizes on data not seen in the fitting procedure. This can be evaluated through cross-validation, discussed in Section {fit-overfit}.

Functions¶

Functions can be fit to interpolate among and extrapolate beyond observations. There is a recurring tension between describing too little (using a function that does not have enough flexibility to follow the data) and too much (using a function that is so flexible that it fits noise, or produces artifacts absent in the original data). Along with matching the data, supporting concerns will be how easy a family of functions is to use and to understand. Most anyone who has done any kind of mathematical modeling has made such choices, often without being aware of the problems of old standbys and the power of less well-known alternatives.

We'll now look at some important classes of functions for fitting data along with broadly applicable princples for their use; Chapter {dense} will continue this discussion with ways to represent probability distributions, and Chapter {ml} will conclude with how function fitting is generalized in machine learning.

Polynomials¶

Padé Approximants

As we've already seen many times, a polynomial expansion has the form (in 1D) $$ y(x) = \sum_{n=0}^N a_n x^n ~~~~. $$ One failing of polynomials is that they have a hard time matching a function that has a pole. Since $(1-x)^{-1}=1+x+x^2+x^3+\ldots$, a simple pole is equivalent to an infinite-order power series. This is more serious than you might think because even if a function is evaluated only along the real axis, a nearby complex pole can slow down the convergence of a polynomial expansion. An obvious improvement is to use a ratio of polynomials, $$ y(x) = \frac{\sum_{n=0}^N a_n x^n}{1 + \sum_{m=1}^M b_m x^m} ~~~~. $$ This is called a Padé approximant {Baker:96}. By convention the constant term in the denominator is taken to be 1 since it's always possible to rescale the coefficients by multiplying the numerator and denominator by a constant. The order of the approximation is written $[N/M]$.

One application of Padé approximants is for functional approximation. If a function has a known Taylor expansion $$ f(x) = \sum_{l=0}^\infty c_l x^l ~~~~, $$ we can ask that it be matched up to some order by a Pad\'e approximant $$ \frac{\sum_{n=0}^N a_n x^n}{1 + \sum_{m=1}^M b_m x^m} = \sum_{l=0}^L c_l x^l ~~~~. $$ The coefficients can be found by multiplying both sides by the denominator of the left hand side, $$ \sum_{n=0}^N a_n x^n = \sum_{l=0}^L c_l x^l + \sum_{m=1}^M \sum_{l=0}^L b_m c_l x^{l+m} ~~~~, $$ and equating powers of $x$. For powers $n\leq N$ this gives $$ a_n = c_n + \sum_{m=1}^N b_m c_{n-m} ~~~~(n = 0, \ldots, N) $$ and for higher powers $$ 0 = c_l + \sum_{m=1}^M b_m c_{l-m} ~~~~(l = N+1, \ldots, N+M) ~~~~. $$ The latter relationship gives us $M$ equations that can be solved to find the $M$ $b_m$'s in terms of the $c$'s, which can then be plugged into the former equation to find the $a$'s.

Alternatively, given a data set $\{y_i,x_i\}$ we can look for the best agreement by a Pad\'e approximant. Relabeling the coefficients, this can be written as $$ \frac{\sum_{n=1}^N a_n x^n_i}{1 + \sum_{n=N+1}^{M+N} a_n x^{n-N}_i} = y_i ~~~~. $$ Once again multiplying both sides by the denominator of the left hand side and rearranging terms gives $$ \sum_{n=0}^N x_i^n a_n - \sum_{n=N+1}^{N+M} x_i^{n-N} y_i a_n = y_i ~~~~, $$ or $\mat{M}\cdot\ve{a} = \ve{y}$, where the $i$th row of $\mat{M}$ is $(x_i^0, \ldots, x_i^N, -x_i^1 y_i, \ldots, -x_i^N y_i)$, $\ve{a} = (a_0, \ldots, a_{N+M})$, and $\ve{y}$ is the vector of observations. This can now be solved by SVD to find $\ve{a}=\mat{M}^{-1}\cdot\ve{y}$, the set of coefficients that minimize the sum of the squares of the errors.

When Padé approximants are appropriate they work very well. They can converge remarkably quickly (Problem {Padeprob}), and can have an almost mystical ability to find analytical structure in a function that lets the approximation be useful far beyond where there are any small parameters in an expansion. However, their use requires care. After all they're guaranteed to have poles whether or not the function being approximated does; the domain and rate of convergence can be difficult to anticipate.

Splines¶

Another problem with global polynomials is that they have a hard time following local behavior, resulting in undesirable artifacts such as overshooting and ringing. In Chapter {finite} we saw one solution to this problem: define polynomials in finite elements, and then patch these local functions together with appropriate matching conditions. Splines use cubic polynomials to match their value and first and second derivatives at the boundaries (although the term is sometimes used even if just slopes are being matched). There are almost as many types of splines as there are applications, differing in how the constraints are imposed. Splines are not commonly used for data fitting or functional approximation, because they require finding and keeping track of the boundaries as well as the coefficients, but they are the workhorse of computer graphics, where they are used to represent curves and surfaces.

Natural splines pass through a given set of points, matching the derivatives as they cross the points. This is the most obvious way to define a spline, but it has the liability that moving one point affects all of the others. To be useful for an interactive application such as computer graphics it must be possible to make a local change and not have to recompute the entire curve. Nonuniform B-splines provide a clever solution to this problem {Foley:90}. The "B" part refers to the bases the polynomials provide, and the "nonuniform" part refers to the flexibility in defining how they join together.

A nonuniform B-spline curve is defined by $$ \ba \ve{x}_i(t) &=& \ve{c}_{i-3}\phi_{i-3}(t) + \ve{c}_{i-2}\phi_{i-2}(t) \\ & & +\ve{c}_{i-1}\phi_{i-1}(t) + \ve{c}_{i}\phi_{i}(t) ~~~~(t_i \leq t < t_{i+1})~~. \ea $$ $t$ parameterizes the position along the curve. The $\ve{c}_i$'s are control points, defined at given values of $t$ called knots, $t_i$. The $\phi_i$'s are the basis (or blending) functions that weight the control points. The basis functions are found recursively: $$ \ba \phi_i^{(1)}(t) &=& \left\{ \begin{array}{l l} 1 & (t_i \leq t < t_{i+1}) \\ 0 & {\rm otherwise} \end{array} \right. \\ \phi_i^{(2)}(t) &=& \frac{t-t_i}{t_{i+1}-t_i}\phi_i^{(1)}(t) +\frac{t_{i+2}-t}{t_{i+2}-t_{i+1}}\phi_{i+1}^{(1)}(t) \\ \phi_i^{(3)}(t) &=& \frac{t-t_i}{t_{i+2}-t_i}\phi_i^{(2)}(t) +\frac{t_{i+3}-t}{t_{i+3}-t_{i+1}}\phi_{i+1}^{(2)}(t) \\ \phi_i(t) \equiv \phi_i^{(4)}(t) &=& \frac{t-t_i}{t_{i+3}-t_i}\phi_i^{(3)}(t) +\frac{t_{i+4}-t}{t_{i+4}-t_{i+1}}\phi_{i+1}^{(3)}(t) ~~. \ea $$ The first line sets the basis function to zero except in intervals where the knots are increasing. The second linearly interpolates between these constants, the third does a linear interpolation between these linear functions to give quadratic polynomials, and the last line does one more round of linear interpolation to provide the final cubics. This hierarchy of linear interpolation preserves the original normalization, so that in each interval the basis functions sum to 1. This means that the curve is bounded by the polygon that has the control points as vertices, with the basis functions controlling how close it comes to them.

Figure {nurb1} shows the construction of the basis functions for the knot sequence $[1~2~3~4~5~6~7~8]$. The curve will approach each control point in turn, but not quite reach it because the maximum of the basis functions is less than 1. Figure {nurb2} repeats the construction for the knot sequence $[0~0~0~0~1~1~1~1]$. The repetitions force the curve to go through the first and last control points, and the two intermediate control points set the slopes at the boundaries. In the context of computer graphics the B-spline is called a B\'ezier curve, and the basis functions are called Bernstein polynomials.

Screenshot from 2026-03-31 21-22-16.png

Figure: Construction of non-uniform B-spline basis functions for the knot sequence $[1~2~3~4~5~6~7~8]$.

Screenshot from 2026-03-31 21-22-37.png

Figure: Construction of nonuniform B-spline basis functions for the knot sequence $[0~0~0~0~1~1~1~1]$.

B-splines can be generalized to higher dimensions, parameterized by more than one degree of freedom. Another important extension is to divide the polynomials for each coordinate by a scale-setting polynomial. These are called nonuniform rational B-splines, or NURBS. Along with the other benefits of ratios of polynomials described in the last section, this is convenient for implementing transformations such as a change of perspective. Nonuniform nonrational B-splines are a special case with the divisor polynomial equal to 1.

Orthogonal Functions¶

In a polynomial fit the coefficients at each order are intimately connected. Dropping a high-order term gives a completely different function; it's necessary to re-fit to find a lower-order approximation. For many applications it's convenient to have a functional representation that lets successive terms be added to improve the agreement without changing the coefficients that have already been found. For example, if an image is stored this way then the fidelity with which it is retrieved can be varied by changing the number of terms used based on the available display, bandwidth, and processor.

Expansions that let successive corrections be added without changing lower-order approximations are done with orthogonal functions, the functional analog to the orthogonal transformations that we saw in the last chapter. We have been writing an expansion of a function as a sum of scaled basis terms $$ y(\vec x) = \sum_{i=1}^M a_i f_i(\vec x) \tag{func-basis} $$ without worrying about how the $f_i$'s relate to each other. However, if we choose them to be orthonormal $$ \int_{-\infty}^{\infty} f_i(\vec x) f_j(\vec x) ~d\vec x = \delta_{ij} ~~~~, $$ then the expansion coefficients can be found by projection: $$ \ba \tag{func-orthog} \int_{-\infty}^{\infty} f_i(\vec x) y(\vec x) ~d\vec x &=& \int_{-\infty}^{\infty}f_i(\vec x)\sum_{j=1}^M a_j f_j(\vec x) ~d\vec x \\ &=& \sum_{j=1}^M a_j \delta_{ij} \\ &=& a_i ~~~~. \ea $$ Such an expansion is useful if we can analytically or numerically do the integral on the left hand side of equation ({func-orthog}).

Assume that we're given an arbitrary complete set of basis functions $\phi_i(\vec x)$ that can represent any function of interest, for example polynomials up to a specified order. An orthonormal set $f_i$ can be constructed by Gram--Schmidt orthogonalization. This is exactly the same procedure used to construct orthogonal vectors, with the dot product between vectors replaced by an integral over the product of functions. The first orthonormal function is found by scaling the first basis function $$ f_1(\vec x) = \frac{\phi_1(\ve{x})} {[\int_{-\infty}^{\infty} \phi_1(\vec x)\phi_1(\ve{x})~d\vec x]^{1/2}} $$ so that it is normalized: $$ \int_{-\infty}^\infty f_1(\ve{x}) f_1(\ve{x}) ~d\ve{x} ~=~ \frac{\int_{-\infty}^\infty \phi_1(\ve{x})\phi_1(\ve{x})~d\ve{x}} {\int_{-\infty}^{\infty} \phi_1(\vec x)\phi_1(\ve{x})~d\ve{x}} ~=~ 1 ~~~~. $$ The next member of the set is found by subtracting off the $f_1$ component of $\phi_2$ $$ f_2'(\ve{x}) = \phi_2(\ve{x}) - f_1(\ve{x}) \int_{-\infty}^\infty f_1(\ve{x}) \phi_2(\ve{x}) ~d\vec x $$ to produce a function that is orthogonal to $f_1$ $$ \ba \int_{-\infty}^\infty f_1(\ve{x}) f_2'(\ve{x})~d\ve{x} &=& \int_{-\infty}^\infty f_1(\ve{x}) \phi_2(\ve{x}) ~d\vec x \\ & & -\underbrace{\int_{-\infty}^\infty f_1(\ve{x}) f_1(\ve{x}) ~d\vec x}_ {\disp 1} \int_{-\infty}^\infty f_1(\ve{x}) \phi_2(\ve{x}) ~d\vec x \\ &=& 0 ~~~~, \ea $$ and then normalizing it $$ f_2(\ve{x}) = \frac{f_2'(\ve{x})} {[\int_{-\infty}^\infty f_2'(\ve{x})f_2'(\ve{x})~d\vec x]^{1/2}} ~~~~. $$ The third one is found by subtracting off these first two components and normalizing: $$ \ba f_3'(\ve{x}) &=& \phi_3 - f1\int{-\infty}^\infty f_1 \phi_3 ~d\ve{x}

                - f_2\int_{-\infty}^\infty f_2 \phi_3 ~d\vec x \n\\

f_3(\ve{x}) &=& \frac{f3'(\ve{x})} {[\int{-\infty}^\infty f_3'(\ve{x})f_3'(\ve{x})~d\vec x]^{1/2}} ~~~~, \ea $$ and so forth until the basis is constructed.

If a set of experimental observations $\{y_n,\ve{x}_n\}_{n=1}^N$ is available instead of a function form $y(\ve{x})$ then it is not possible to directly evaluate the projection integrals in equation ({func-orthog}). We can still use orthonormal functions in this case, but they must now be constructed to be orthonormal with respect to the unknown probability density $p(\ve{x})$ of the measurements (assuming that the system is stationary so that the density exists). If we use the same expansion (equation {func-basis}) and now choose the $f_i$'s so that $$ \la f_i(\ve{x}) f_j(\ve{x}) \ra = \int_{-\infty}^\infty f_i(\ve{x}) f_j(\ve{x}) p(\ve{x}) ~d\ve{x} = \delta_{ij} ~~~~, $$ then the coefficients are found by the expectation $$ \ba \la y(\ve{x}) f_i(\ve{x}) \ra &=& \int_{-\infty}^\infty y(\ve{x}) f_i(\ve{x}) p(\ve{x}) ~d\ve{x} \\ &=& a_i ~~~~. \ea $$ By definition an integral over a distribution can be approximated by an average over variables drawn from the distribution, providing an experimental means to find the coefficients $$ \ba a_i &=& \int_{-\infty}^\infty y(\ve{x}) f_i(\ve{x}) p(\ve{x})~d\ve{x} \\ &\approx& \frac{1}{N}\sum_{n=1}^N y_n f_i(\ve{x}_n) ~~~~. \ea $$ But where do these magical $f_i$'s come from? The previous functions that we constructed were orthogonal with respect to a flat distribution. To orthogonalize with respect to the experimental distribution all we need to do is replace the integrals with sums over the data, $$ \ba f_1(\vec x) &=& \frac{\phi_1(\ve{x})} {[\int_{-\infty}^\infty\phi_1(\ve{x})\phi_1(\ve{x})p(\ve{x})]^{1/2} ~d\ve{x}} \\ &\approx& \frac{\phi_1(\ve{x})} {[\frac{1}{N}\sum_{n=1}^N\phi_1(\ve{x}_n)\phi_1(\ve{x}_n)]^{1/2}} \\ f_2'(\ve{x}) &=& \phi_2(\ve{x}) - f_1(\ve{x}) \int_{-\infty}^\infty f_1(\ve{x}) \phi_2(\ve{x}) p(\ve{x})~d\ve{x} \\ &\approx& \phi_2(\ve{x}) - f_1(\ve{x}) \frac{1}{N}\sum_{n=1}^N f_1(\ve{x}_n)\phi_2(\ve{x}_n) \ea $$ and so forth. This construction lets functions be fit to data by evaluating experimental expectations rather than doing an explicit search for the fit coefficients. Further work can be saved if the starting functions are chosen to satisfy known constraints in the problem, such as using sin functions to force the expansion to vanish at the boundaries of a rectangular region.

Basis functions can of course be orthogonalized with respect to known distributions. For polynomials, many of these families have been named because of their value in mathematical methods, including the Hermite polynomials (orthogonal to with respect to $e^{-x^2}$), Laguerre polynomials (orthogonal with respect to $e^{-x}$), and the Chebyshev polynomials (orthogonal with respect to $(1-x^2)^{-1/2}$) {Arfken:95}.

Radial Basis Functions¶

Even dressed up as a family of orthogonal functions, there is a serious problem inherent in the use of any kind of polynomial expansion: the only way to improve a fit is to add higher-order terms that diverge ever more quickly. Matching data that isn't similarly diverging requires a delicate balancing of the coefficients, resulting in a function that is increasingly "wiggly." Even if it passes near the training data it will be useless for interpolation or extrapolation. This is particularly true for functions that have isolated features, such as discontinuities or sharp peaks.

Radial basis functions (RBFs) offer a sensible alternative to the hopeless practice of fighting divergences with still faster divergences. The idea is to use as the basis a set of identical functions that depend only on the radius from the data point to a set of "representative" locations $\vec{c}_i$: $$ y(\ve{x}) = \sum_{i=1}^M f(|\vec{x}-\vec{c}_i|;\ve{a}_i) ~~~~. $$ The $\ve{a}_i$ are coefficients associated with the $i$th center $\ve{c}_i$. Now extra terms can be added without changing how quickly the model diverges, and the centers can be placed where they're needed to improve the fit in particular areas. And unlike splines the basis functions are defined everywhere, eliminating the need to keep track of element boundaries.

If the centers are fixed and the coefficients enter linearly then fitting an RBF becomes a linear problem $$ y = \sum_{i=1}^M a_i f(|\vec{x}-\vec{c}_i|) ~~~~. $$ Common choices for these $f$ include linear ($f(r)=r$), cubic ($f(r)=r^3$), and fixed-width Gaussian ($f(r) = \exp(-r^2)$). $f(r)=r^2$ is not included in this list because it is equivalent to a single global second-order polynomial ($|\vec x - \vec c|^2 = \sum_j [x_j - c_j]^2$). For example, fitting $f(r)=r^3$ requires a singular value decomposition to invert $$ \left( \begin{array}{c c c c} |\vec x_1-\vec c_1|^3&|\vec x_1-\vec c_2|^3&\cdots&|\vec x_1-\vec c_M|^3\\ |\vec x_2-\vec c_1|^3&|\vec x_2-\vec c_2|^3&\cdots&|\vec x_2-\vec c_M|^3\\ |\vec x_3-\vec c_1|^3&|\vec x_3-\vec c_2|^3&\cdots&|\vec x_3-\vec c_M|^3\\ \vdots & \vdots & \cdots & \vdots \\ |\vec x_{N-1}-\vec c_1|^3&|\vec x_{N-1}-\vec c_2|^3&\cdots&|\vec x_{N-1}-\vec c_M|^3\\ |\vec x_N-\vec c_1|^3&|\vec x_N-\vec c_2|^3&\cdots&|\vec x_N-\vec c_M|^3 \end{array} \right) \left( \begin{array}{c} a_1 \\ a_2 \\ \vdots \\ a_M \end{array} \right) $$

$$ = \left( \begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{N-1} \\ y_N \end{array} \right) $$

to find the solution with the least squares error.

Linear and cubic basis functions appear to violate the spirit of locality of RBFs because they try to expand the unknown function in a set of globally diverging functions, which might seem to require the same kind of delicate cancellation that dooms a high-order polynomial expansion. This intuition misses the fact that all of the terms have the same order and so this can be a sensible thing to do. In fact, a possibly counter-intuitive result is that these diverging basis functions can have better error convergence properties than local ones {Powell:92}. They share the crucial attribute that if the data are rescaled so that $r\rightarrow\alpha r$, then $r^3\rightarrow (\alpha r)^3 = \alpha^3 r^3$, simply changing the amplitude of the term, which we are going to fit anyway. But for a Gaussian $\exp(-r^2) \rightarrow \exp(-\alpha^2r^2)$, changing the variance of the Gaussian which cannot be modified by the amplitude of the term.

There are many strategies for choosing the $\ve{c}_i$: randomly, uniformly, drawn from the data, uniformly on the support of the data (regions where the probability to see a point is nonzero). If the basis coefficients enter nonlinearly (for example, changing the variance of a Gaussian) then an iterative nonlinear search is needed and RBFs are closely related to the techniques to be discussed in Chapter {dense}. If the centers are allowed to move (for example, by varying the mean of a Gaussian), RBFs become a particular case of the general theory of approximation by clustering (Chapter {dense}). The last decision in using an RBF is the number of centers $M$, to be discussed in the next section.

Overfitting¶

Overfitting is a simple idea with deep implications. Any reasonable model with $N$ free parameters should be able to exactly fit $N$ data points, but that is almost always a bad idea. The goal of fitting usually is to be able to interpolate or extrapolate. If the data are at all noisy, a model that passes through every point is carefully fitting idiosyncracies of the noise that do not generalize. On the other hand, if too few parameters are used, the model may be forced to not only ignore the noise but also miss the meaningful behavior of the data. Successful function fitting requires balancing underfitting, where there are model mistmatch errors because the model can not describe the data, with overfitting, where there are model estimation errors from poor parameter choices in a flexible model. This is another example of a bias--variance tradeoff: you must choose between a reliable estimate of a biased model, or a poor estimate of a model that is capable of a better fit.

Screenshot from 2026-03-31 21-22-57.png

Figure: Cross-validation.

Unlike the parameters of a model that are determined by fitting, the hyper-parameters that control the architecture of a model must be determined by some kind of procedure outside the fitting. Bootstrap sampling (Section {fiterrors}) provides a way to estimate the reliability of a model's parameters, but by mixing up the data used for training and evaluating the model it can easily lead to overconfidence in a model's ability to generalize on entirely new measurements. Cross-validation separates these functions. The idea is to fit your model on part of the data set, but evaluate the fit on another part of the data set that you have withheld (for example, by randomly choosing a subset of the data). Now repeat this procedure as you vary the complexity of the model, such as the number of RBF centers. The in-sample error will continue to decrease if the model is any good, but the out-of-sample error will initially decrease but then stop improving and possibly start increasing once the model begins to overfit (Figure {fitcross}). The best model is the one at the minimum of this curve (although the curve usually won't be this nice). Once this set of hyper-parameters has been found the model can be retrained on the full data set to make predictions for completely new values. The essential observation is that the only way to distinguish between a signal and noise, without any extra information, is by testing the model's generalization ability. Cross-validation is a sensible, useful way to do that, albeit with a great deal of associated debate about when and how to use it properly, e.g. {Goutte:97}.

A strict Bayesian views (correctly) cross-validation as an {\em ad hoc} solution to finding hyperparameters. Better to use equation ({Bayesian}) and integrate over all possible models {Buntine:91}{Besag:95}. This is an enormous amount of work, and requires specifying priors on all the variables in the problem. There are some problems where this effort is justified, but for the common case of there being many equally acceptable models cross-validation is a quick and easy way to find one of them.

Finally, just as the Fisher information (Section {Fisher}) can be used to bound the performance of an estimator, it can be possible to relate the descriptive power of a model architecture to the amount of data needed to use it, in advance of making any measurements. The Vapnik--Chervonenkis (VC) dimension is equal to the size of the largest set of points that can be classified in all possible ways by a model, and for various architectures it can be used to find the worst-case scaling of the number of points required to learn a model to a given uncertainty {Vapnik:71, Kearns:94}. Unfortunately, as with so many other grand unifying principles in learning theory, the real-world problems where such guidance is most needed is where it is most difficult to apply.

Regularization¶

So far we've handled the trade-off between underfitting and overfitting by varying the model size. A more general framework for constraining flexible models is provided by the powerful concept of regularization {Weigend:90}{Girosi:95}. This is the name for the knob that any good algorithm should have that controls a trade-off between matching the data and enforcing your prior beliefs. As we saw in Section {modest}, maximum {\em a posteriori} estimation requires finding $$ \max_{\rm model}~p({\rm model}|{\rm data}) = \max_{\rm model} \frac{p({\rm data}|{\rm model})~p({\rm model})} {p({\rm data})} ~~~~, $$ or taking logarithms $$ \ba \max_{\rm model}~\log p({\rm model}|{\rm data}) &=& \log p({\rm data}|{\rm model}) \\ & & + \log p({\rm model}) - \log p({\rm data}) ~~. \ea $$ The first term on the right hand side measures how well the model matches the data. This could be a Gaussian error model, or it might be defined to be the profit in a stock trading model. The second term, called a prior, expresses advance beliefs about what kind of model is reasonable. The final term comes in only if training is done over multiple data sets. Taken together, these components define a cost function or penalty function to be searched to find the best model (using methods to be covered in the next chapter).

If there truly is no prior knowledge at all, then setting $p({\rm model})=1$ reduces to maximum likelihood estimation. But that's rarely the case, and even simple priors have great value. Although priors can be formally expressed as a probability distribution, they're usually just added to the cost function with a Lagrange multiplier. A common prior is the belief that the model should be locally linear unless the data forces it to do otherwise; this is measured by the integral of the square of the curvature of the model. If a least squares (Gaussian) error model is used then the quantity to be minimzed for a 1D problem is $$ \tag{curve-reg} I = \sum_{n=1}^N [y_n-f(x_n,\ve{a})]^2

\lambda \int \left[\dd{f}{x}\right]^2 dx ~~~~, $$ where $\ve{a}$ is the set of coefficients to be determined and the bounds of integration are over the interval being fit. The left hand term gives the agreement between the data and the model, and the term on the right depends on the model alone. Making the sum of both terms extremal by solving $\partial I/\ve{a}=0$ (Chapter {vary}) for a particular value of $\lambda$ gives a set of equations to be solved to find the corresponding $\ve{a}$. The best value for $\lambda$ can then be found iteratively by a procedure such as cross-validation. Problem \ref{overfit} gives an example of the use of this regularizer.

Other common regularizers include the entropy of a distribution, expressing the belief that the distribution should be flat, and the total magnitude of a model's coefficients, seeking the most parsimonious model. Regularization plays a central role in the theory of inverse problems {Parker:77}{Engl:96}, the essential task that arises throughout science and engineering of deducing the state of a system from a set of measurements that underdetermine it.

Flexible models can have an impressive ability to find a useful model from a huge space of possibilities, but they aren't magic, and they don't replace the need to think {Gershenfeld:93c}. Clever training alone is not a cure for bad data. In practical applications much of the progress comes from collecting good training data sets, finding appropriate features to use for the model inputs and outputs, and imposing domain-specific constraints on the model to improve its ability to generalize from limited observations. Salvation lies in these details. Given unlimited data and unlimited resolution almost any reasonable model should work with data in whatever form you present it, but in the real world of imperfect measurements, insight into model architectures must be enhaced by a combination of advance analysis of a system and experimentation with the data.

Tikkonov

Curse of Dimensionality¶

We've seen two broad classes of basis functions: those with linear coefficients $$ y(\ve{x}) = \sum_{i=1}^M a_i f_i(\vec{x}) ~~~~, \tag{func-lin} $$ and those with coefficients that enter nonlinearly $$ y(\ve{x}) = \sum_{i=1}^M f_i(\vec{x};\vec{a}_i) ~~~~. \tag{func-nonlin} $$ The former has a single global least-squares minimum that can be found by singular value decomposition; in general the latter requires an iterative search with the attendant problems of avoiding local minima and deciding when to stop. It's impossible to find an algorithm that can solve an arbitrary nonlinear search problem in a single step because the function $f$ could be chosen to implement the mapping performed by a general-purpose computer from an input program $\ve{x}$ to an output $y$; if such an omniscient search algorithm existed it would be able to solve computationally intractable problems {Garey:79}.

For a nonparametric fit where we have no idea what the functional form should be, why would we ever want to use nonlinear coefficients? The answer is the curse of dimensionality. This ominous sounding problem is invoked by a few different incantations, all relating to serious difficulties that occur as the dimensionality of a problem is increased. Let's say that we wanted to do a second-order polynomial fit. In 1D this is simply \$$ y = a_0 + a_1 x + a_2 x^2 ~~~~. $$ In 2D, we have $$ y = a_0 + a_1 x_1 + a_2 x_2 + a_3 x_1 x_2 + a_4 x_1^2 + a_5 x_2^2 ~~~~. $$ We've gone from three to six terms. In 3D we need ten terms: $$ \ba y &=& a_0 + a_1 x_1 + a_2 x_2 + a_3 x_3 + a_4 x_1 x_2 + a_5 x_1 x_3 + a_6 x_2 x_3 \\ & & + a_7 x_1^2 + a_8 x_2^2 + a_9 x_3^2 ~~~~, \ea $$ and in $d$ dimensions we need $1+2d+d(d-1)/2$ terms. As $d$ is increased the quadratic $d^2$ part quickly dominates. If we try to get around this by leaving off some of the polynomial terms then we do not let those particular variables interact at that order. As we increase the order of the polynomial, the exponent of the number of terms required increases accordingly. A 10th-order fit in 10D will require on the order of $10^{10}$ terms, clearly prohibitive. This rapid increase in the number of terms required is an example of the curse of dimensionality. Because of this many algorithms that give impressive results for a low-dimensional test problem fail miserably for a realistic high-dimensional problem.

The reason to use nonlinear coefficients is to avoid the curse. Given some mild assumptions, a nonlinear function of the form of equation ({func-lin}) will have a typical approximation error of ${\cal O}(1/M^{2/d})$ when fitting an arbitrary but reasonable function (one with a bound on the first moment of the Fourier transform), where $M$ is the number of terms and $d$ is the dimension of the space, while a linear function of the form of equation ({func-nonlin}) will have an error of order ${\cal O}(1/M)$ {Barron:93}. In low dimensions the difference is small, but in high dimensions it is essential: exponentially more terms are needed if linear coefficients are used. The nonlinear coefficients are much more powerful, able to steer the basis functions where they are needed in a high-dimensional space. Therefore, in low dimensions it's crazy not to use linear coefficients, while in high dimensions it is crazy to use them.

There is another sense of the curse of dimensionality to be mindful of. In a high-dimensional space, everything is surface. To understand this consider the volume of a hyper-sphere of radius $r$ in a $d$-dimensional space: $$ V(r) = \frac{\pi^{d/2}r^d}{(d/2)!} $$ (noninteger factorials are defined by the gamma function $\Gamma(n+1)=n!$). Now let's look at the fraction of volume in a shell of thickness $\eps$, compared to the total volume of the sphere, in the limit that $d\rightarrow\infty$: $$ \ba \frac{V(r)-V(r-\eps)}{V(r)} &=& \frac{r^d-(r-\eps)^d}{r^d} \\ &=& 1-\underbrace{\left(1-\frac{\eps}{r}\right)^d}_{\disp \lim_{d\rightarrow\infty} = 0} \\ &=& 1 ~~~~. \ea $$ As $d\rightarrow\infty$ all of the volume is in a thin shell near the surface. This is because as $d$ is increased there is more surface available. Now think about what this implies for data collection. "Typical" points come from the interior of a distribution, not near the edges, but in a high-dimensional space distributions are essentially all edges. This means that in a huge data set there might not be a single point in the interior of the distribution, and so any analysis will be dominated by edge effects. Most obviously, the data will appear to be drawn from a lower-dimensional distribution {Gershenfeld:92}. Even if your basis functions work in high dimensions you may not be able to collect enough data to use them.

All is not lost, however; in the next chapter we'll see that lurking behind these ideas is also what could be called a blessing of dimensionality.

Estimation, Fisher Information, and the Cramér-Rao Inequality¶

We've seen increasingly powerful techniques to fit functions to data and find errors. Is there no limit to this cleverness? Unfortunately, and not surprisingly, there is indeed a limit on how much information about unknown parameters can be extracted from a set of measurements. This chapter closes with a view of the information in a probability distribution that sets a limit on the accuracy of measurements.

Let $p_\alpha(x)$ be a probability distribution that depends on a parameter $\alpha$ (such as the variance of a Gaussian); the goal is to estimate the value of $\alpha$ from a series of measurements of $x$. Let $f(x_1,x_2,...,x_N)$ be the estimator of $\alpha$. It is biased if $\langle f(x_1,x_2,\ldots,x_N)\rangle \neq \alpha$, and it is consistent if $\lim_{N\rightarrow\infty} f(x_1,x_2,\ldots,x_N) = \alpha$. An estimator $f_1$ dominates $f_2$ if $\langle (f_1(x_1,x_2,\ldots,x_N)-\alpha)^2\rangle \leq \langle (f_2(x_1,x_2,\ldots,x_N)-\alpha)^2\rangle$. This raises the question of what is the minimum variance possible for an unbiased estimator of $\alpha$? The answer is given by the Cramér--Rao bound.

Start by defining the score: $$ V = \frac{\partial}{\partial\alpha} \log p_\alpha(x) = \frac{\partial_\alpha p_\alpha(x)}{p_\alpha(x)} ~~~~. $$ The expected value of the score is $$ \ba \langle V \rangle & = & \int_{-\infty}^{\infty} p_\alpha(x) \frac{\partial_\alpha p_\alpha(x)}{p_\alpha(x)}~dx \\ & = & \int_{-\infty}^{\infty} \partial_\alpha p_\alpha(x)~dx \\ & = & \partial_\alpha \int_{-\infty}^{\infty} p_\alpha(x)~dx \\ & = & \partial_\alpha 1 \\ & = & 0 ~~~~. \ea $$ This means that $\sigma^2(V) = \langle V^2 \rangle$. The variance of the score is called the Fisher information: $$ J(\alpha) = \langle [\partial_\alpha \log p_\alpha(x)]^2 \rangle ~~~~. $$ The score for a set of independent, identically distributed variables is the sum of the individual scores $$ \ba V(x_1,x_2,\ldots,x_N) & = & \partial_\alpha \log p_\alpha(x_1,x_2,\ldots,x_N) \\ & = & \sum_{n=1}^N \partial_\alpha \log p_\alpha(x_n) \\ & = & \sum_{n=1}^N V(x_n) ~~~~, \ea $$ and so the Fisher information for the set is $$ \ba J_N(\alpha) & = & \langle [\partial_\alpha \log p_\alpha(x_1,x_2,\ldots,x_N)]^2\rangle \\ & = & \langle V^2(x_1,x_2,\ldots,x_N) \rangle \\ & = & \left\langle \left(\sum_{n=1}^N V(x_n)\right)^2\right\rangle \\ & = & \sum_{n=1}^N \langle V^2(x_n) \rangle \\ & = & N~J(\alpha) \ea $$ (remember that the individual scores are uncorrelated).

The Cramér-Rao inequality states that the mean square error of an unbiased estimator $f$ of $\alpha$ is lower bounded by the reciprocal of the Fisher information: $$ \sigma^2(f) \geq \frac{1}{J(\alpha)} ~~~~. $$ To prove this, start with the Cauchy--Schwarz inequality $$ \ba \langle (V-\la V\ra)(f-\la f\ra)\ra^2 & \leq & \la (V-\la V\ra)^2\ra \la (f-\la f\ra)^2\ra \\ \la Vf - \la V \ra f - V \la f \ra + \la V\ra \la f\ra \ra^2 & \leq & \la V^2 - 2 V \la V \ra + \la V \ra ^2 \ra \la (f-\la f\ra)^2\ra \\ \la Vf \ra^2 & \leq & \la V^2 \ra \la (f-\la f\ra)^2\ra \\ \la Vf \ra^2 & \leq & J(\alpha) ~ \sigma^2(f) \ea $$ (remember $\la V \ra = 0$). The lefthand side equals one: $$ \ba \langle Vf \rangle & = & \int_{-\infty}^\infty \frac{\partial_\alpha p_\alpha(x)}{p_\alpha(x)} f(x) p_\alpha(x)~dx \\ & = & \int_{-\infty}^\infty \partial_\alpha p_\alpha(x) f(x)~dx \\ & = & \partial_\alpha \int_{-\infty}^\infty p_\alpha(x) f(x)~dx \\ & = & \partial_\alpha \langle f(x) \rangle \\ & = & \partial_\alpha \alpha \\ & = & 1 ~~~~, \ea $$ thus proving the Cramér--Rao inequality.

Just like the information theoretic channel capacity, this sets a lower limit on what is possible but does not provide any guidance in finding the minimum variance unbiased estimator. The inequality measures how much information the distribution provides about a parameter. Not surprisingly, the Fisher information can be related to the entropy of the distribution; this is done by de Bruijn's identity {Cover:06}. Roughly, the entropy measures the volume and the Fisher information measures the surface of the distribution.

One final caution about the Cramér-Rao bound. Not only may it not be reachable in practice, but it may be misleading because it is a bound on unbiased estimators. Unbiased does not necessarily mean better: it is possible for a biased estimator to dominate an unbiased one (as well as have other desirable characteristics). However, just like channel capacity, although it should not be taken too literally it does provide a good rough estimate of what is plausible and what is not.

References¶

Press, William H., Teukolsky, Saul A., Vetterling, William T., & Flannery, Brian~P. (2007). Numerical Recipes in C: The Art of Scientific Computing. 3nd edn. Cambridge: Cambridge University Press.

Numerical Recipes is particularly strong for function fitting.

Cover, Thomas M., & Thomas, Joy A. (2006). Elements of Information Theory. 2nd edn. New York: Wiley-Interscience.

Good coverage of the many connections between information theory and statistics.

Problems¶

Pick a function with linear coefficients, randomly sample from it with added noise, use a SVD to fit the function coefficients, and evaluate the fit errors:
1. With equation (svderror)
2. By fitting an ensemble of independent data sets
3. By bootstrap sampling from one data set
Pick a function with nonlinear coefficients, randomly sample from it with added noise, and write a Levenberg-Marquardt function to fit the function parameters.
1. Fit a function with linear coefficients to these data, using cross-validation to find the optimum model size.
This problem is harder. An alternative way to choose among models is to select the one that makes the weakest assumptions about the data; this is the purpose of maximum entropy methods. Assume that what is measured is a set of expectation values for functions $f_i$ of a random variable $x$, $$ \tag{fit-exp} \la f_i(x) \ra = \int_{-\infty}^\infty p(x) f_i(x) ~ dx ~~~~. $$
1. Given these measurements, find the compatible normalized probability distribution $p(x)$ that maximizes the differential entropy $$ S = -\int_{-\infty}^\infty p(x) \log p(x) ~ dx ~~~~. $$
2. What is the maximum entropy distribution if we know only the second moment $$ \sigma^2 = \int_{-\infty}^\infty p(x) ~ x^2 ~ dx ~~~~? $$
3. Now consider the reverse situation. Let's say that we know that a data set $\{x_n\}_{n=1}^N$ was drawn from a Gaussian distribution with variance $\sigma^2$ and unknown mean $\mu$. Try to find an optimal estimator of the mean (one that is unbiased and has the smallest possible error in the estimate).