Density Estimation

The Nature of Mathematical Modeling (draft)

$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} \def\argmax#1{\mathrel{\mathop{\mathrm {argmax}}\limits_{#1}}} $

We've learned more and more about how to describe and fit functions, but the decision to fit a function can itself be unduly restrictive. Instead of assuming a noise model for the data (such as Gaussianity), why not go ahead and deduce the underlying probability distribution from which the data were drawn? Given the distribution (or probability density) any other quantity of interest, such as a conditional forecast of a new observation, can be derived. Perhaps because this is such an ambitious goal it is surprisingly poorly covered in the literature and in a typical education, but in fact it is not only possible but extremely useful.

This chapter will cover density estimation at three levels of generality and complexity. First will be methods based on binning the data that are easy to implement but that can require impractical amounts of data. This is solved by casting density estimation as a problem in functional approximation, but that approach can have trouble representing all of the things that a density can do. The final algorithms are based on clustering, merging the desirable features of the preceeding ones while avoiding the attendant liabilities.

Histogramming, Sorting, and Trees

Given a set of measurements $\{x_n\}_{n=1}^N$ that were drawn from an unknown distribution $p(x)$, the simplest way to model the density is by histogramming $x$. If $N_i$ is the number of observations between $x_i$ and $x_{i+1}$, then the probability $p(x_i)$ to see a new point in the bin between $x_i$ and $x_{i+1}$ is approximately $N_i/N$ (by the Law of Large Numbers).

This idea is simple but not very useful. The first problem is the amount of storage required. Let's say that we're looking at a 15-dimensional data set sampled at a 16-bit resolution. Binning it would require an array of $2^{16\times 15} \approx 10^{72}$ elements, more elements than there are atoms in the universe! But this is an unreasonable approach because most of those bins would be empty. The storage requirement becomes feasible if only occupied bins are allocated. This might appear to be circular reasoning, but in fact can be done quite simply.

The trick is to perform a lexicographic sort. The digits of each vector are appended to make one long number, in the preceeding example $15\times 16=240$ bits long. Then the one-dimensional string of numbers is sorted, requiring ${\cal O}(N\log N)$ operations. A final pass through the sorted numbers counts the number of times each number appears, giving the occupancy of the multi-dimensional bin uniquely indexed by that number.

Since the resolution is known in advance this can further be reduced to a linear time algorithm. Sorting takes ${\cal O}(N\log N)$ when the primitive operation is a comparison of two elements, but it can be done in linear time at a fixed resolution by sorting on a tree. This is shown schematically in Figure {treesort}. Each bit of the appended number is used to decide to branch right or left in the tree. If a node already exists then the count of points passing through it is incremented by one, and if the node does not exist then it is created. Descending through the tree takes a time proportional to the number of bits, and hence the dimension, and the total running time is proportional to the number of points. The storage depends on the amount of space the data set occupies (an idea that will be revisited in Chapter {time} to characterize a time series).

Screenshot from 2026-05-03 21-02-40.png Figure: Multi-dimensional binning on a binary tree.

Sorting on a tree makes it tractable to accumulate a multi-dimensional histogram but it leaves another problem: the number of points in each bin is variable. Assuming Poisson counting errors, this means that the relative uncertainty of the count of each bin ranges from 100\% for a bin with a single point in it to a small fraction for a bin with a large occupancy. Many operations (such as taking a logarithm in calculating an entropy) will magnify the influence of the large errors associated with the low probability bins.

The solution to this problem is to use bins that have a fixed occupancy rather than a fixed size. These can be constructed by using a k-D tree {Preparata:85}, shown for 2D in Figure {densekd}. A vertical cut is initially chosen at a location $x_1^1$ that has half of the data on the left and half on the right. Then each of these halves is cut horizontally at $y_1^1$ and $y_2^1$ to leave half of the data on top and half on bottom. These four bins are next cut horizontally at $x_{1-4}^2$, and so forth. In a higher-dimensional space the cuts cycle over the directions of the axes. By construction each bin has the same number of points, but the volume of the bin varies. Since the probability is estimated by the number of points divided by the volume, instead of varying the number of points for a given volume as was done before we can divide the fixed number of points by the varying volumes. This gives a density estimate with constant error per bin.

Screenshot from 2026-05-03 21-03-00.png

Figure: $k$-D tree binning in 2D.

Building a $k$-D tree is more work than fixed-mass binning on a tree, particularly if new points need to be added later (which can require moving many of the $k$-D tree partitions). There is a simple trick that approximates fixed-volume binning while retaining the convenience of a linear-time tree. The idea is to interleave the bits of each coordinate of the point to be sorted rather than appending them, so that each succeeding bit is associated with a different direction. Then the path that the sorted sequence of numbers takes through the space is a fractal curve {Mandlebrot:83}, shown in 2D in Figure {fractree}. This has the property that the average distance between two points in the space is proportional to the linear distance along the curve connecting them. Therefore the difference between the indices of two bins estimates the volume that covers them.

Screenshot from 2026-05-03 21-03-16.png

Figure: Fractal binning in 2D.

Even histogramming with constant error per bin still has a serious problem. Let's say that we're modestly looking at three-dimensional data sampled at an 8-bit resolution. If it is random with a uniform probability distribution and is histogrammed in $M$ bins then the probability of each bin will be roughly $1/M$. The entropy of this distribution is $$ \ba H &=& -\sum_{i=1}^M p_i\log_2 p_i \\ &=& \log_2 M \\ \Rightarrow 2^H &=& M ~~~~. \ea $$ But we know that the entropy is $H=8\times 3=24$ bits, therefore there must be $2^{24}\approx 10^7$ occupied bins or the entropy will be underestimated. Since the smallest possible data set would have one point per bin this means that at least 10 million points must be used.

The real problem is that histogramming has no notion of neighborhood. Two adjacent bins could represent different letters in an alphabet just as well as nearby parts of a space, ignoring the strong constraint that neighboring points might be expected to behave similarly. What's needed is some kind of functional approximation that can make predictions about the density in locations where points haven't been observed.

Fitting Distributions

Let's say that we're given a set of data points $\{x_n\}_{n=1}^N$ and want to infer the density $p(x)$ from which they were drawn. If we're willing to add some kind of prior belief, for example that nearby points behave similarly, then we can hope to find a functional representation for $p(x)$. This must of course be normalized to be a valid density, and a possible further constraint is compact support: the density is restricted to be nonzero on a bounded set. In this section we'll assume that the density vanishes outside the interval [0,1]; the generalization is straightforward.

One way to relate this problem to what we've already studied is to introduce the cumulative distribution $$ P(x) \equiv \int_0^x p(x)~dx ~~~~. $$ If we could find the cumulative distribution then the density follows by differentiation, $$ p(x) = \ds{P}{x} ~~~~. $$ There is in fact an easy way to find a starting guess for $P(x)$: sort the data set. $P(x_i)$ is defined to be the fraction of points below $x_i$, therefore if $i$ is the index of where point $x_i$ appears in the sorted set, then $$ y_i \equiv \frac{i}{N+1} \approx P(x_i) ~~~~. $$ The denominator is taken to be $N+1$ instead of $N$ because the normalization constraint is effectively an extra point $P(1)=1$.

To estimate the error, recognize that $P(x_i)$ is the probability to draw a new point below $x_i$, and that $NP(x_i)$ is the expected number of such points in a data set of $N$ points. If we assume Gaussian errors then the variance around this mean is $NP(x_i)[1-P(x_i)]$ {Gershenfeld:00_pit}. Since $y_i = i/(N+1)$, the standard deviation in $y_i$ is the standard deviation in the number of points below it, $i$, divided by $N+1$, $$ \ba \sigma_i &=& \frac{1}{N+1} \left\{ N P(x_i)[1-P(x_i)] \right\}^{1/2} \\ &=& \frac{1}{N+1} \left[ N \frac{i}{N+1} \left(1-\frac{i}{N+1}\right) \right]^{1/2} ~~~~. \tag{trigexpand} \ea $$ This properly vanishes at the boundaries where we know that $P(0)=0$ and $P(1)=1$.

We can now fit a function to the data set $\{x_i,y_i,\sigma_i\}$. A convenient parameterization is $$ \ba P(x) &=& x+\sum_{m=1}^M a_m \sin(m\pi x) \\ \Rightarrow p(x) &=& 1+\sum_{m=1}^M a_m m\pi \cos(m\pi x) ~~~~, \ea $$ which enforces the boundary conditions $P(0)=0$ and $P(1)=1$ regardless of the choice of coefficients (this representation also imposes the constraint that $dp/dx=0$ at $x=0$ and 1, which may or may not be appropriate).

One way to impose the prior that nearby points behave similarly is to use the integral square curvature of $P(x)$ as a regularizer (Section {regularization}), which seeks to minimize the quantity $$ I_0 = \int_0^1 \left( \dd{P}{x} \right)^2~dx ~~~~. $$ The model mismatch is given by $$ I_1 = \frac{1}{N} \sum_{i=1}^N \left[ \frac{P(x_i)-y_i}{\sigma_i} \right]^2 ~~~~. $$ The expected value of $I_1$ is 1; we don't want to minimize it because passing exactly through the data is a very unlikely event given the uncertainty. Therefore we want to impose this constraint by introducing a Lagrange multiplier in a variational sum $$ I = I_0 + \lambda I_1 ~~~~. $$ Solving $$ \ps{I}{a_m}=0 $$ for all $m$ gives the set of coefficients $\{a_m\}$ as a function of $\lambda$, and then given the coefficients we can do a one-dimensional search to find the value of $\lambda$ that results in $I_1=1$.

Plugging in the expansion (trigexpand) and doing the derivatives and integrals shows that if the matrix $\mat{A}$ is defined by $$ A_{lm} = 2\pi^4 l^2 m^2 \left\{ \begin{array}{l l} \displaystyle\frac{1}{2(l^2-m^2)\pi} \left\{ (l+m)\sin[(l-m)\pi] \right. \\ ~~~~~~~~~~~ \left. -(l-m)\sin[(l+m)\pi] \right\} & (l\ne m)~~, \\ \frac{1}{2} - \sin(2m\pi)/(4m\pi) & (l=m) \end{array}\right. $$ the matrix $\mat{B}$ by $$ B_{lm} = \frac{2}{N} \sum_{i=1}^N \sin(l\pi x_i) \sin(m\pi x_i) / \sigma_i^2 ~~~~, $$ and the vector $\vec{c}$ by $$ c_m = \frac{2}{N} \sum_{i=1}^N (y_i-x_i) \sin(m\pi x_i) / \sigma_i^2 ~~~~, $$ then the vector of expansion coefficients $\vec{a}=(a_1, \ldots, a_M)$ is given by $$ \vec{a} = \lambda (\mat{A}+\lambda\mat{B})^{-1}\cdot\vec{c} ~~~~. $$

An example is shown in Figure {densereg}. Fifty points were drawn from a Gaussian distribution and the cumulative density was fit using 50 basis functions. Even though there are as many coefficients as data points, the regularization insures that the resulting density does not overfit the data. This figure also demonstrates that the presence of the regularizer biases the estimate of the variance upwards, because the curvature is minimized if the fitted curve starts above the cumulative histogram and ends up below it.

Screenshot from 2026-05-03 21-03-35.png

Figure: Cumulative histogram (solid) and regularized fit (dashed), for the points shown along the axis drawn from a Gaussian distribution. Differentiating gives the resulting density estimate.

Expectation-Maximization (E-M)

  • apparently circular logic
    • use model to update latent (hidden) variables
    • use latent variables to updade model
  • iteration maximizes likelihood
  • can use momentum for local minima

  • Clustering

    • group similar data points
  • k-means
    • E-M clustering algorithm
    • choose a set of anchor points randomly from data
    • iteration
      • assign data points to the closest anchor
      • update anchor to the mean of closest points
    • can assess number of clusters by plotting total distance of points to anchors versus their number and look for the "elbow"
  • Voronoi tesselation
    • partition of space closest to anchor points

example: k-means

import matplotlib.pyplot as plt
from scipy.spatial import Voronoi,voronoi_plot_2d
import numpy as np
import time
#
# k-means parameters
#
npts = 1000
nsteps = 1000
momentum = 0.
xs = [0,5,10]
ys = [0,10,5]
np.random.seed(10)
#
# generate random data
#
x = np.array([])
y = np.array([])
for i in range(len(xs)):
    x = np.append(x,np.random.normal(loc=xs[i],scale=1,size=npts))
    y = np.append(y,np.random.normal(loc=ys[i],scale=1,size=npts))

def kmeans(x,y,momentum,nclusters):
    #
    # choose starting points
    #
    indices = np.random.uniform(low=0,high=len(x),size=nclusters).astype(int)
    mux = x[indices]
    muy = y[indices]
    #
    # do k-means iteration
    #
    for i in range(nsteps):
        #
        # find closest points
        #
        X = np.outer(x,np.ones(len(mux)))
        Y = np.outer(y,np.ones(len(muy)))
        Mux = np.outer(np.ones(len(x)),mux)
        Muy = np.outer(np.ones(len(x)),muy)
        distances = np.sqrt((X-Mux)**2+(Y-Muy)**2)
        mins = np.argmin(distances,axis=1)
        #
        # update means
        #
        for i in range(len(mux)):
            index = np.where(mins == i)
            mux[i] = np.sum(x[index])/len(index[0])
            muy[i] = np.sum(y[index])/len(index[0])
    #
    # find distances
    #
    distances = 0
    for i in range(len(mux)):
        index = np.where(mins == i)
        distances += np.sum(np.sqrt((x[index]-mux[i])**2+(y[index]-muy[i])**2))
    return mux,muy,distances

def plot_kmeans(x,y,mux,muy):
    xmin = np.min(x)
    xmax = np.max(x)
    ymin = np.min(y)
    ymax = np.max(y)
    fig,ax = plt.subplots()
    plt.plot(x,y,'.')
    plt.plot(mux,muy,'r.',markersize=20)
    plt.xlim(xmin,xmax)
    plt.ylim(xmin,xmax)
    plt.title(f"{len(mux)} clusters")
    plt.show()

def plot_Voronoi(x,y,mux,muy):
    xmin = np.min(x)
    xmax = np.max(x)
    ymin = np.min(y)
    ymax = np.max(y)
    fig,ax = plt.subplots()
    plt.plot(x,y,'.')
    vor = Voronoi(np.stack((mux,muy),axis=1))
    voronoi_plot_2d(vor,ax=ax,show_points=True,show_vertices=False,point_size=20)
    plt.xlim(xmin,xmax)
    plt.ylim(xmin,xmax)
    plt.title(f"{len(mux)} clusters")
    plt.show()

distances = np.zeros(5)

mux,muy,distances[0] = kmeans(x,y,momentum,1)
plot_kmeans(x,y,mux,muy)

mux,muy,distances[1] = kmeans(x,y,momentum,2)
plot_kmeans(x,y,mux,muy)

mux,muy,distances[2] = kmeans(x,y,momentum,3)
plot_Voronoi(x,y,mux,muy)

mux,muy,distances[3] = kmeans(x,y,momentum,4)
plot_Voronoi(x,y,mux,muy)

mux,muy,distances[4] = kmeans(x,y,momentum,5)
plot_Voronoi(x,y,mux,muy)

fig,ax = plt.subplots()
plt.plot(np.arange(1,6),distances,'o')
plt.xlabel('number of clusters')
plt.ylabel('total distances to clusters')
ax.xaxis.get_major_locator().set_params(integer=True)
plt.show()

Mixture Density Estimation

Fitting a function lets us generalize our density estimate away from measured points, but it unfortunately does not handle many other kinds of generalization very well. The first problem is that it can be hard to express local beliefs with a global prior. For example, if a smooth curve has some discontinuities, then a smoothness prior will round out the discontinuities, and a maximum entropy prior will fit the discontinuity but miss the smoothness (Figure {regprob}). What's needed is a way to express a statement like "the density is smooth everywhere, except for where it isn't."

Screenshot from 2026-05-03 21-03-54.png

Figure: The problem with global regularization. A smoothness prior (dashed line) misses the discontinuty in the data (solid line); a maximum entropy prior (dotted) misses the smoothness.

Another problem is with the kinds of functions that need to be represented. Consider points distributed on a low-dimensional surface in a high-dimensional space. Transverse to the surface the distribution is very narrow, something that is hard to expand with trigonometric or polynomial basis functions.

These problems with capturing local behavior suggest that density estimation should be done using local rather than global functions. Kernel density estimation {Silverman:86} does this by placing some kind of smooth bump, such as a Gaussian, on each data point. An obvious disadvantage of this approach is that the resulting model requires retaining all of the data. A better approach is to find interesting places to put a smaller number of local functions that can model larger neighborhoods. This is done by mixture models {McLachlan:88}, which are closely connected to the problem of splitting up a data set by clustering, and are an example of unsupervised learning. Unlike function fitting with a known target, the algorithm must learn for itself where the interesting places in the data set are.

In $D$ dimensions a mixture model can be written by factoring the density over multivariate Gaussians $$ \ba \tag{Gaussfact} p(\ve{x}) &=& \sum_{m=1}^M p(\ve{x},c_m) \\ &=& \sum_{m=1}^M p(\ve{x}|c_m)~p(c_m) \\ &=& \sum_{m=1}^M \frac{|\mat{C}_m^{-1}|^{1/2}}{(2\pi)^{D/2}} e^{-(\ve{x}-\ve{\mu}_m)^T\cdot\mat{C}_M^{-1}\cdot(\ve{x}-\ve{\mu}_m)/2} ~p(c_m) ~~, \ea $$ where $|\cdot|^{1/2}$ is the square root of the determinant, and $c_m$ refers to the $m$th Gaussian with mean $\ve{\mu}_m$ and covariance matrix $\mat{C}_m$. The challenge of course is to find these parameters.

If we had a single Gaussian, the mean value $\ve{\mu}$ could be estimated simply by averaging the data, $$ \ba \ve{\mu} &=& \int_{-\infty}^\infty \ve{x}~p(\ve{x})~d\ve{x} \\ &\approx& \frac{1}{N} \sum_{n=1}^N \ve{x}_n ~~~~. \ea $$ The second line follows because an integral over a density can be approximated by a sum over variables drawn from the density; we don't know the density but by definition it is the one our data set was taken from. This idea can be extended to more Gaussians by recognizing that the $m$th mean is the integral with respect to the conditional distribution, $$ \ba \ve{\mu}_m &=& \int \ve{x}~p(\ve{x}|c_m)~\ve{dx} \\ &=& \int \ve{x}~\frac{p(c_m|\ve{x})}{p(c_m)} p(\ve{x})~d\ve{x} \\ &\approx& \frac{1}{Np(c_m)} \sum_{n=1}^N \ve{x}_n~p(c_m|\ve{x}_n) ~~~~. \ea $$ Similarly, the covariance matrix could be found from $$ \mat{C}_m \approx \frac{1}{Np(c_m)} \sum_{n=1}^N (\ve{x}_n-\ve{\mu}_m)(\ve{x}_n-\ve{\mu}_m)^T~p(c_m|x_n) ~~~~,\n $$ and the expansion weights by $$ \ba p(c_m) &=& \int_{-\infty}^\infty p(\ve{x},c_m)~d\ve{x} \\ &=& \int_{-\infty}^\infty p(c_m|\ve{x})~p(\ve{x})~d\ve{x} \\ &\approx& \frac{1}{N} \sum_{n=1}^N p(c_m|\ve{x}_n) ~~~~. \ea $$

But how do we find the posterior probability $p(c_m|\ve{x})$ used in these sums? By definition it is $$ \ba \tag{posterior} p(c_m|\ve{x}) &=& \frac{p(\ve{x},c_m)}{p(\ve{x})} \\ &=& \frac{p(\ve{x}|c_m)~p(c_m)}{\sum_{m=1}^M p(\ve{x}|c_m)~p(c_m)} ~~~~, \ea $$ which we can calculate using the definition (equation Gaussfact). This might appear to be circular reasoning, and it is! The probabilities of the points can be calculated if we know the parameters of the distributions (means, variances, weights), and the parameters can be found if we know the probabilities. Since we start knowing neither, we can start with a random guess for the parameters and go back and forth, iteratively updating the probabilities and then the parameters. Calculating an expected distribution given parameters, then finding the most likely parameters given a distribution, is called Expectation-Maximization (EM), and it converges to the maximum likelihood distribution starting from the initial guess {Dempster:77}.

To see where this magical property comes from let's take the log-likelihood $L$ of the data set (the log of the product of the probabilities to see each point) and differentiate with respect to the mean of the $m$th Gaussian: $$ \ba \del_{\ve{\mu}_m}L &=& \del_{\ve{\mu}_m}\log\prod_{n=1}^N p(\ve{x}_n) \\ &=& \sum_{n=1}^N \del_{\ve{\mu}_m}\log p(\ve{x}_n) \\ &=& \sum_{n=1}^N \frac{1}{p(\ve{x}_n)} \del_{\ve{\mu}_m} p(\ve{x}_n) \\ &=& \sum_{n=1}^N \frac{1}{p(\ve{x}_n)} p(\ve{x}_n,c_m) \mat{C}_m^{-1}\cdot(\ve{x}_n-\ve{\mu}_m) \\ &=& \sum_{n=1}^N p(c_m|\ve{x}_n)\mat{C}_m^{-1}\cdot\ve{x}_n -\sum_{n=1}^N p(c_m|\ve{x}_n)\mat{C}_m^{-1}\cdot\ve{\mu}_m \\ &=& Np(c_m)\mat{C}_m^{-1}\cdot\left[ \frac{1}{Np(c_m)}\sum_{n=1}^N \ve{x}_n~p(c_m|\ve{x}_n) -\ve{\mu}_m\right] ~. \ea $$ Writing the change in the mean after one EM iteration as $\delta\ve{\mu}_m$ and recognizing that the term on the left in the bracket is the update rule for the mean, we see that $$ \delta\ve{\mu}_m = \frac{\mat{C}_m}{Np(c_m)}\cdot\del_{\ve{\mu}_m} L ~~~~. $$ Now look back at Section {nllsq}, where we wrote a gradient descent update step in terms of the gradient of a cost function suitably scaled. In one EM update the mean moves in the direction of the gradient of the log-likelihood, scaled by the Gaussian's covariance matrix divided by the weight. The weight is a positive number, and the covariance matrix is positive definite (it has positive eigenvalues {Strang:86}), therefore the result is to increase the likelihood. The changes in the mean stop at the maximum when the gradient vanishes. Similar equations apply to the variances and weights {Xu:96}.

This is as close as data analysis algorithms come to a free lunch. We're maximizing a quantity of interest (the log-likelihood) merely by repeated function evaluations. The secret is in equation (posterior). The numerator measures how strongly one Gaussian predicts a point, and the denominator measures how strongly all the Gaussians predict the point. The ratio gives the fraction of the point to be associated with each Gaussian. Each point has one unit of explanatory power, and it gives it out based on how well it is predicted. As a Gaussian begins to have a high probability at a point, that point effectively disappears from the other Gaussians. The Gaussians therefore start exploring the data set in parallel, "eating" points that they explain well and forcing the other Gaussians towards ones they don't. The interaction among the Gaussians happens in the denominator of the posterior term, collectively performing a high-dimensional search.

The EM cycle finds the local maximum of the likelihood that can be reached from the starting condition, but that usually is not a significant limitation because there are so many equivalent arrangements that are equally good. The Gaussians can for example be started with random means and variances large enough to "feel" the entire data set. Stopping can be determined by the convergence of their positions, or of the likelihood estimage. And because EM is doing gradient ascent on the log-likelihood, all of the lessons from Chapter {opt} apply - a filtered updated combining new and old estimates can be included to control the learning rate, and momentum from the change in prior estimates can be included to climb out of local extrema.

example: mixture model

import matplotlib.pyplot as plt
from scipy.spatial import Voronoi,voronoi_plot_2d
import numpy as np
import time
#
# Gaussuam mixture model parameters
#
npts = 3000
nclusters = 3
nsteps = 25
nplot = 100
xs = [0,5,10]
ys = [0,10,5]
np.random.seed(0)
#
# generate random data
#
x = np.array([])
y = np.array([])
for i in range(len(xs)):
    x = np.append(x,np.random.normal(loc=xs[i],scale=1,size=npts//nclusters))
    y = np.append(y,np.random.normal(loc=ys[i],scale=1,size=npts//nclusters))
#
# choose starting points and initialize
#
indices = np.random.uniform(low=0,high=len(x),size=nclusters).astype(int)
mux = x[indices]
muy = y[indices]
varx = (np.max(x)-np.min(x))**2
vary = (np.max(y)-np.min(y))**2
pc = np.ones(nclusters)/nclusters
#
# plot before iteration
#
fig,ax = plt.subplots()
plt.plot(x,y,'.')
plt.errorbar(mux,muy,xerr=np.sqrt(varx),yerr=np.sqrt(vary),fmt='r.',markersize=20)
plt.autoscale()
plt.title('before iteration')
plt.show()
#
# do E-M iterations
#
for i in range(nsteps):
    #
    # construct matrices
    #
    xm = np.outer(x,np.ones(nclusters))
    ym = np.outer(y,np.ones(nclusters))
    muxm = np.outer(np.ones(npts),mux)
    muym = np.outer(np.ones(npts),muy)
    varxm = np.outer(np.ones(npts),varx)
    varym = np.outer(np.ones(npts),varx)
    pcm = np.outer(np.ones(npts),pc)
    #
    # use model to update probabilities
    #
    pvgc = (1/np.sqrt(2*np.pi*varxm))*\
       np.exp(-(xm-muxm)**2/(2*varxm))*\
       (1/np.sqrt(2*np.pi*varym))*\
       np.exp(-(ym-muym)**2/(2*varym))
    pvc = pvgc*np.outer(np.ones(npts),pc)
    pcgv = pvc/np.outer(np.sum(pvc,1),np.ones(nclusters))
    #a
    # use probabilities to update model
    #
    pc = np.sum(pcgv,0)/npts
    mux = np.sum(xm*pcgv,0)/(npts*pc)
    muy = np.sum(ym*pcgv,0)/(npts*pc)
    varx = 0.1+np.sum((xm-muxm)**2*pcgv,0)/(npts*pc)
    vary = 0.1+np.sum((ym-muym)**2*pcgv,0)/(npts*pc)
#
# plot after iteration
#
fig,ax = plt.subplots()
plt.plot(x,y,'.')
plt.errorbar(mux,muy,xerr=np.sqrt(varx),yerr=np.sqrt(vary),fmt='r.',markersize=20)
plt.autoscale()
plt.title('after iteration')
plt.show()
#
# plot distribution
#
xplot = np.linspace(np.min(x),np.max(x),nplot)
yplot = np.linspace(np.min(y),np.max(y),nplot)
(X,Y) = np.meshgrid(xplot,yplot)
p = np.zeros((nplot,nplot))
for c in range(nclusters):
    p += np.exp(-(X-mux[c])**2/(2*varx[c]))/np.sqrt(2*np.pi*varx[c])\
       *np.exp(-(Y-muy[c])**2/(2*vary[c]))/np.sqrt(2*np.pi*vary[c])\
       *pc[c]
fig, ax = plt.subplots(subplot_kw={"projection":"3d"})
ax.plot_surface(X,Y,p)
plt.title('probability distribution')
plt.show()

Cluster-Weighted Modeling

As appealing as mixture density estimation is, there is still a final problem. Figure {denseker} shows the result of doing EM with a mixture of Gaussians on random data uniformly distributed over the interval $[0,1]$. The individual distributions are shown as dashed lines, and the sum by a solid line. The correct answer is a constant, but because that is hard to represent in this basis we get a very bumpy distribution that depends on the precise details of how the Gaussians overlap. In gaining locality we've lost the ability to model simple functional dependence.

Screenshot from 2026-05-03 21-04-12.png

Figure: Mixture density estimation with Gaussians for data uniformly distributed between 0 and 1.

The problem is that a Gaussian captures only proximity; anything nontrivial must come from the overlap of multiple Gaussians. A better alternative is to base the expansion of a density around models that can locally describe more complex behavior. This powerful idea has re-emerged in many flavors under many names, including Bayesian networks {Buntine:94}, mixtures of experts {Jordan:94}, and gated experts {Weigend:95}. The version that we will cover, cluster-weighted modeling {Gershenfeld:99_cwm}, is just general enough to be able to describe many situations, but not so general that it presents difficult architectural decisions.

The goal now is to capture the functional dependence in a system as part of a density estimate. Let's start with a set of $N$ observations $\{y_n,\ve{x}_n\}_{n=1}^N$, where the $\ve{x}_n$ are known inputs and the $y_n$ are measured outputs. $y$ might be the exchange rate between the dollar and the mark, and $\ve{x}$ a group of currency indicators. Or $y$ could be a future value of a signal and $\ve{x}$ a vector of past lagged values. For simplicity we'll take $y$ to be a scalar, but the generalization to vector $\ve{y}$ is straightforward.

Screenshot from 2026-05-03 21-04-25.png

Figure: The spaces in cluster-weighted modeling.

Given the joint density $p(y,\ve{x})$ we could find any derived quantity of interest, such as a conditional forecast $\la y|\ve{x}\ra$. We'll proceed by expanding the density again, but now in terms of explanatory clusters that contain three terms: a weight $p(c_m)$, a domain of influence in the input space $p(\vec x|c_m)$, and a dependence in the output space $p(y|\ve{x},c_m)$: $$ \ba p(y,{\vec x}) &=& \sum_{m=1}^M p(y,{\vec x},c_m) \\ &=& \sum_{m=1}^M p(y,{\vec x}|c_m)~p(c_m) \\ &=& \sum_{m=1}^M p(y|{\vec x},c_m)~p({\vec x}|c_m)~p(c_m) ~~~~. \ea $$ These terms are shown in Figure {cwmspace}. As before, $p(c_m)$ is a number that measures the fraction of the data set explained by the cluster. The input term could be taken to be a $D$-dimensional separable Gaussian, $$ p({\vec x}|c_m) = \prod_{d=1}^D \frac{1}{\sqrt{2\pi\sigma_{m,d}^2}} e^{-(x_d-\mu_{m,d})^2/2\sigma_{m,d}^2} ~~~~, $$ or one using the full covariance matrix, $$ p({\vec x}|c_m) = \frac{|{\bf C}_m^{-1}|^{1/2}}{(2\pi)^{D/2}} e^{-({\vec x}-\vec\mu_m)^T \cdot {\bf C}_m^{-1} \cdot ({\vec x}-\vec\mu_m)/2} ~~~~. $$ Separable Gaussians require storage and computation linear in the dimension of the space while nonseparable ones require quadratic storage and a matrix inverse at each step. Conversely, using the covariance matrix lets one cluster capture a linear relationship that would require many separable clusters to describe. Therefore covariance clusters are preferred in low-dimensional spaces, but cannot be used in high-dimensional spaces.

The output term will also be taken to be a Gaussian, but with a new twist: $$ p(y|{\vec x},c_m) = \frac{1}{\sqrt{2\pi\sigma_{m,y}^2}} e^{-[y-f({\vec x},\vec\beta_m)]^2/2\sigma_{m,y}^2} ~~~~. $$ The mean of the Gaussian is now a function $f$ that depends on $\vec{x}$ and a set of parameters $\vec{\beta}_m$. The reason for this can be seen by calculating the conditional forecast $$ \ba \tag{condforc} \langle y|{\vec x} \rangle &=& \int y~p(y|{\vec x})~dy \\ &=& \int y~\frac{p(y,{\vec x})}{p({\vec x})}~dy \\ &=& \frac{\sum_{m=1}^M \int y~p(y|{\vec x},c_m)~dy~p({\vec x}|c_m)~p(c_m)} {\sum_{m=1}^M p({\vec x}|c_m)~p(c_m)} \\ &=& \frac{\sum_{m=1}^M f({\vec x},{\vec\beta}_m)~p({\vec x}|c_m)~p(c_m)} {\sum_{m=1}^M p({\vec x}|c_m)~p(c_m)} ~~~~. \ea $$ We see that in the forecast the Gaussians are used to control the interpolation among the local functions, rather than directly serving as the basis for functional approximation. This means that $f$ can be chosen to reflect a prior belief on the local relationship between $\vec x$ and $y$, for example locally linear, and even one cluster is capable of modeling this behavior. Using locally linear models says nothing about the global smoothness because there is no constraint that the clusters need to be near each other, so this architecture can handle the combination of smoothness and discontinuity posed in Figure {regprob}.

Equation (condforc) looks like a kind of network model, but it is a derived property of a more general density estimate rather than an assumed form. We can also predict other quantities of interest such as the error in our forecast of $y$, $$ \ba \langle \sigmay^2|{\vec x} \rangle &=& \int (y-\langle y|{\vec x} \rangle)^2 ~ p(y|{\vec x}) ~ dy \ &=& \int (y^2-\langle y|{\vec x} \rangle^2) ~ p(y|{\vec x}) ~ dy \ &=& \frac{\sum{m=1}^M [\sigma_{m,y}^2 + f({\vec x},{\vec\beta}_m)^2]~ p({\vec x}|c_m)~p(cm)}{\sum{m=1}^M p({\vec x}|c_m)~p(c_m)}

   - \langle y | {\vec x} \rangle^2 . 

\ea $$ Note that the use of Gaussian clusters does not assume a Gaussian error model; there can be multiple output clusters associated with one input value to capture multimodal or other kinds of non-Gaussian distributions.

The estimation of the parameters will follow the EM algorithm we used in the last section. From the forward probabilities we can calculate the posteriors $$ \ba p(c_m|y,{\vec x}) &=& \frac{p(y,{\vec x},c_m)}{p(y,{\vec x})} \\ &=& \frac{p(y|{\vec x},c_m)~p(\ve{x}|c_m)~p(c_m)} {\sum_{m=1}^M p(y,{\vec x},c_m)} \\ &=& \frac{p(y|{\vec x},c_m)~p(\ve{x}|c_m)~p(c_m)} {\sum_{m=1}^M p(y|{\vec x},c_m)~p(\ve{x}|c_m)~p(c_m)} \ea $$ which are used to update the cluster weights $$ \ba p(c_m) &=& \int p(y,{\vec x},c_m)~dy~d\vec x \\ &=& \int p(c_m|y,{\vec x})~p(y,{\vec x})~dy~d\vec x \\ &\approx& \frac{1}{N} \sum_{n=1}^N p(c_m|y_n,{\vec x}_n) ~~~~. \ea $$ A similar calculation is used to find the new means, $$ \ba \vec{\mu}_m^{new} &=& \int {\vec x} ~ p({\vec x}|c_m) ~ d{\vec x} \\ &=& \int {\vec x} ~ p(y,{\vec x}|c_m) ~ dy~d{\vec x} \\ &=& \int {\vec x} ~ \frac{p(c_m|y,{\vec x})}{p(c_m)} ~ p(y,{\vec x}) ~ dy~d{\vec x} \\ &\approx& \frac{1}{N~p(c_m)} \sum_{n=1}^N {\vec x}_n ~ p(c_m|y_n,{\vec x}_n) \\ &=& \frac{\sum_{n=1}^N {\vec x}_n~p(c_m|y_n,{\vec x}_n)} {\sum_{n=1}^N p(c_m|y_n,{\vec x}_n)} \\ &\equiv& \langle {\vec x} \rangle_m \ea $$ (where the last line defines the cluster-weighted expectation value). The second line introduces $y$ as a variable that is immediately integrated over, an apparently meaningless step that in fact is essential. Because we are using the stochastic sampling trick to evaluate the integrals and guide the cluster updates, this permits the clusters to respond to both where data are in the input space and how well their models work in the output space. A cluster won't move to explain nearby data if they are better explained by another cluster's model, and if two clusters' models work equally well then they will separate to better explain where the data are.

The cluster-weighted expectations are also used to update the variances $$ \sigma_{m,d}^{2,new} = \langle (x_d-\mu_{m,d})^2 \rangle_m $$ or covariances $$ [{\bf C}_m]_{ij}^{new} = \langle (x_i-\mu_i)(x_j-\mu_j) \rangle_m ~~~~. $$

The model parameters are found by choosing the values that maximize the cluster-weighted log-likelihood: $$ \ba 0 &=& \ps{}{\ve{\beta}_m}\log\prod_{n=1}^N p(y_n,\ve{x}_n) \\ &=& \sum_{n=1}^N \ps{}{\ve{\beta}_m}\log p(y_n,\ve{x}_n) \\ &=& \sum_{n=1}^N \frac{1}{p(y_n,\ve{x}_n)} \ps{p(y_n,\ve{x}_n)}{\beta_m}\\ &=& \sum_{n=1}^N \frac{1}{p(y_n,\ve{x}_n)} p(y_n,\ve{x}_n,c_m) \frac{y_n-f(\ve{x}_n,\ve{\beta}_m)}{\sigma_{m,y}^2} \ps{f(\ve{x}_n,\ve{\beta}_m)}{\ve{\beta}_m} \\ &=& \frac{1}{\sigma_{m,y}^2} \sum_{n=1}^N p(c_m|y_n,\ve{x}_n) [y_n-f(\ve{x}_n,\ve{\beta}_m)] \ps{f(\ve{x}_n,\ve{\beta}_m)}{\ve{\beta}_m} \\ &=& \frac{1}{Np(c_m)} \sum_{n=1}^N p(c_m|y_n,\ve{x}_n) [y_n-f(\ve{x}_n,\ve{\beta}_m)] \ps{f(\ve{x}_n,\ve{\beta}_m)}{\ve{\beta}_m} \\ &=& \left\langle [y-f(\ve{x},\ve{\beta})] \ps{f(\ve{x},\ve{\beta}_m)}{\ve{\beta}_m} \right\rangle_m ~~~~. \ea $$ In the last few lines remember that since the left hand side is zero we can multiply or divide the right hand side by a constant.

If we choose a local model that has linear coefficients, $$ f({\vec x},\ve{\beta}_m) = \sum_{i=1}^I \beta_{m,i} f_i({\vec x}) ~~~~, \tag{local} $$ then this gives for the $j$th coefficient $$ \ba 0 &=& \langle [y-f({\vec x},{\vec\beta}_m)] f_j({\vec x}) \rangle_m \ &=& \underbrace{\langle yf_j({\vec x}) \ranglem}{\displaystyle a_j}

   - \sum_{i=1}^I \beta_{m,i}~
   \underbrace{\langle f_j({\vec x})f_i({\vec x})\rangle_m}_{\displaystyle B_{ji}}
   \\

\Rightarrow \ve{\beta}m &=& B^{-1}\cdot{\vec a} ~~~~. \ea $$ Finally, once we've found the new model parameters then the new output width is $$ \sigma{m,y}^{2,new} % = \langle [y - \langle y|{\vec x} \rangle]^2 \rangle_m = \langle [y - f({\vec x},{\vec\beta}_m)]^2 \rangle_m ~~~~. $$ It's a good idea to add a small constant to the input and output variance estimates to prevent a cluster from shrinking down to zero width for data with no functional dependence; this constant reflects the underlying resolution of the data.

Iterating this sequence of evaluting the forward and posterior probabilities at the data points, then maximizing the likelihood of the parameters, finds the most probable set of parameters that can be reached from the starting configuration. The fundamental algorithm parameter is $M$, the number of clusters. This controls overfitting, which can be determined by cross-validation, and by adaptively pruning small clusters and dividing large ones. The other essential choice to make is $f$, the form of the local model. This can be based on past practice for a domain, and should be simple enough that the local model cannot overfit. It's even possible to include more than one type of local model and let the clustering find where they are most relevant.

Cluster-weighted modeling then assembles the local models into a global model that can handle nonlinearity (by the nonlinear interpolation among local models), discontinuity (since nothing forces clusters to stay near each other), non-Gaussianity (by the overlap of multiple clusters), nonstationarity (by giving absolute time as an input variable so that clusters can grow or shrink as needed to find the locally stationary time scales), find low-dimensional structure in high-dimensional spaces (models are allocated only where there are data to explain), predict not only errors from the width of the output distribution but also errors that come from forecasting away from where training observations were made (by using the input density estimate), build on experience (by reducing to a familiar global version of the local model in the limit of one cluster), and the resulting model has a transparent architecture (the parameters are easy to interpret, and by definition the clusters find ``interesting'' places to represent the data) {Gershenfeld:99_cwm}.

Taken together those specifications are a long way from the simple models we used at the outset of the study of function fitting. Looking beyond cluster-weighted modeling, it's possible to improve on the EM search by doing a full Bayesian integration over prior distributions on all the parameters {Richardson:97}, but this generality is not needed for the common case of weak priors on the parameters and many equally acceptable solutions. And the model can be further generalized to arbitrary probabilistic networks {Buntine:96}, a step that is useful if there is some {\it a priori} insight into the architecture.

An important special case of cluster-weighted modeling is observables that are restricted to a discrete set of values such as events, patterns, or conditions. Let's now let $\{s_n,\ve{x}_n\}_{n=1}^N$ be the training set of pairs of input points $\ve{x}$ and discrete output states $s$. Once again we'll factor the joint density over clusters $$ p(s,\vec{x}) = \sum_{m=1}^M p(s|\vec{x},c_m)~p(\vec{x}|c_m)~p(c_m) ~~~~, $$ but now the output term is simply a histogram of the probability to see each state for each cluster (with no explicit $\ve{x}$ dependence). The histogram is found by generalizing the simple binning we used at the beginning of the chapter to the cluster-weighted expectation of a delta function that equals one on points with the correct state and zero otherwise: $$ \ba p(s|\vec{x},c_m) &=& \frac{1}{Np(c_m)} \sum_{n=1}^N \delta_{s_n=s} p(c_m|s_n,\ve{x}_n) \\ &=& \frac{1}{Np(c_m)} \sum_{n|s_n=s} p(c_m|s_n,\ve{x}_n) ~~~~. \ea $$ Everything else is the same as the real-valued output case. An example is shown in Figure {denses}, for two states $s$ (labelled "1" and "2"). The clusters all started equally likely to predict either state, but after convergence they have specialized in one of the data types as well as dividing the space.

This model can predict the probability to see each state given a new point by $$ p(s|\vec{x}) = \frac{p(s,\vec{x})}{\sum_s p(s,\vec{x})} ~~~~. $$ For applications where computational complexity or speed is a concern there are easy approximations that can be made. It is possible to use just the cluster variances instead of the full covariance matrix, and to restrict the output models to one state per cluster rather than keeping the complete histogram. If the goal is simply to predict the most likely state rather than the distribution of outcomes, then evaluating the density can be approximated by comparing the square distances to the clusters divided by the variances. Finally, keeping only the cluster means reduces to a linear-time search to find the nearest cluster, $\min_m |\ve{x}-\ve{\mu}_m|$, which defines a Voronoi tesselation of the space by the clusters {Preparata:85}.

The final approximation of simply looking up the state of the cluster nearest a new point might appear to not need such a general clustering to start with. A much simpler algorithm would be to iteratively assign points to the nearest cluster and then move the cluster to the mean of the points associated with it. This is called the k-means algorithm {MacQueen:67}{Hartigan:75}. It doesn't work very well, though, because in the iteration a cluster cannot be influenced by a large number of points that are just over the boundary with another cluster. By embedding the cluster states in Gaussian kernels as we have done, this is a \trm{soft clustering}{Soft clustering} unlike the hard clustering done by k-means. The result is usually a faster search and a better model.

Predicting states is part of the domain of pattern recognition. State prediction can also be used to approximate a set of inputs by a common output; vector quantization applies this to a feature vector describing a signal in order to compress it for transmission and storage {Nasrabi:88}{Gersho:82}. Alternatively, it can be used to retrieve information associated with related inputs stored in a content-addressable memory* {Kohonen:89}{Bechtel:91}. We'll pick up those problems in Chapter {ml}.

example: variance cluster classifier

import matplotlib.pyplot as plt
import numpy as np
#
# state cluster-weighted modeling parameters
#
dim = 2
nstates = 2
nclusters = 5
nsamples = 100
momentum = 0.9
rate = 0.1
min_var = 0.1
niterations = 100
np.random.seed(10)
#
# generate data
#
def translate(dx,dy,a):
   return(np.array((a[0,:]+dx,a[1,:]+dy)))
def scale(sx,sy):
   return(np.array(([sx,0],[0,sy])))
def rotate(theta):
   theta = np.pi*theta/180
   return(np.array(([np.cos(theta),-np.sin(theta)],[np.sin(theta),np.cos(theta)])))
zero_states = scale(3,1)@np.random.randn(dim,nsamples)
zero_states = np.hstack((zero_states,
   translate(20,0,rotate(90)@scale(3,1)@np.random.randn(dim,nsamples))))
nzeros = zero_states.shape[1]
one_states = translate(5,-10,rotate(45)@scale(5,1)@np.random.randn(dim,nsamples))
one_states = np.hstack((one_states,
   translate(5,10,rotate(-45)@scale(5,1)@np.random.randn(dim,nsamples))))
one_states = np.hstack((one_states,
   translate(11,0,rotate(90)@scale(4,1)@np.random.randn(dim,nsamples))))
nones = one_states.shape[1]
points = np.hstack((zero_states,one_states))
states = np.append(np.zeros(nzeros,dtype=int),np.ones(nones,dtype=int))
npts = len(states)
#
# initialize arrays
#
indices = np.random.randint(0,npts,nclusters)
means = points[:,indices]
dmean = np.zeros((dim,nclusters))
variances = np.outer(np.var(points,axis=1),np.ones(nclusters))
pc = np.ones(nclusters)/nclusters # p(c)
pxgc = np.zeros((nclusters,npts)) # p(x|c)
psgxc = np.ones((nclusters,nstates))/nstates # p(s|xc)
#
# plot data
#
plt.plot(zero_states[0,:],zero_states[1,:],ls='',marker='$0$',markeredgecolor='green')
plt.plot(one_states[0,:],one_states[1,:],ls='',marker='$1$',markeredgecolor='orange')
plt.xlim([-10,30])
plt.ylim([-20,20])
#
# plot clusters
#
def plot_var():
   for cluster in range(means.shape[1]):
      x = means[0,cluster]
      y = means[1,cluster]
      type = np.argmax(psgxc[cluster,:])
      dx = np.sqrt(variances[0,cluster])
      dy = np.sqrt(variances[1,cluster])
      plt.plot([x-dx,x+dx],[y,y],'k')
      plt.plot([x,x],[y-dy,y+dy],'k')
      plt.plot([x],[y],'k',marker=f'${type}$',markersize=15)
plot_var()
plt.title('variance clusters before iteration')
plt.show()
#
# do E-M iteration
#
for i in range(niterations):
   for cluster in range(nclusters):
      mean = np.outer(means[:,cluster],np.ones(npts))
      variance = np.outer(variances[:,cluster],np.ones(npts))
      pxgc[cluster,:] = np.prod(np.exp(-(points-mean)**2/(2*variance))\
         /np.sqrt(2*np.pi*variance),0)
   pxc = pxgc*np.outer(pc,np.ones(npts))
   pcgx = pxc/np.outer(np.ones(nclusters),np.sum(pxc,0))
   psxc = psgxc[:,states]*pxc
   pcgsx = psxc/np.outer(np.ones(nclusters),np.sum(psxc,0))
   pc = np.sum(pcgsx,1)/npts
   for cluster in range(nclusters):
      newmean = momentum*dmean[:,cluster]\
         +np.sum(points*np.outer(np.ones(dim),pcgsx[cluster,:]),1)\
         /np.sum(np.outer(np.ones(dim),pcgsx[cluster,:]),1)
      dmean[:,cluster] = newmean-means[:,cluster]
      means[:,cluster] += rate*dmean[:,cluster]
      m = np.outer(means[:,cluster],np.ones(npts))
      variances[:,cluster] = np.sum((points-m)**2\
          *np.outer(np.ones(dim),pcgsx[cluster,:]),1)\
         /np.sum(np.outer(np.ones(dim),pcgsx[cluster,:]),1)\
         +min_var
   for state in range(nstates):
      index = np.argwhere(states == state)
      psgxc[:,state] = np.sum(pcgsx[:,index],1)[:,0]/np.sum(pcgsx[:,:],1)
#
# plot data
#
plt.plot(zero_states[0,:],zero_states[1,:],ls='',marker='$0$',markeredgecolor='green')
plt.plot(one_states[0,:],one_states[1,:],ls='',marker='$1$',markeredgecolor='orange')
plt.xlim([-10,30])
plt.ylim([-20,20])
#
# plot clusters
#
plot_var()
#
# plot decision boundaries
#
ngrid = 100
xmin = np.min(points[0,:])
xmax = np.max(points[0,:])
ymin = np.min(points[1,:])
ymax = np.max(points[1,:])
x = np.linspace(xmin,xmax,ngrid)
y = np.linspace(ymin,ymax,ngrid)
mx,my = np.meshgrid(x,y)
x = np.reshape(mx,(ngrid*ngrid))
y = np.reshape(my,(ngrid*ngrid))
plotpoints = np.vstack((x,y))
pxgc = np.zeros((nclusters,ngrid*ngrid))
for cluster in range(nclusters):
   mean = np.outer(means[:,cluster],np.ones(ngrid*ngrid))
   variance = np.outer(variances[:,cluster],np.ones(ngrid*ngrid))
   pxgc[cluster,:] = np.prod(np.exp(-(plotpoints-mean)**2/(2*variance))\
      /np.sqrt(2*np.pi*variance),0)
pxc = pxgc*np.outer(pc,np.ones(ngrid*ngrid))
p = np.sum(np.outer(psgxc[:,0],np.ones(ngrid*ngrid))*pxc,0)/np.sum(pxc,0)
p = np.reshape(p,(ngrid,ngrid))
plt.contour(mx,my,p,[0.5])
plt.title('variance clusters and decision boundaries after iteration')
plt.show()
#
# plot probability surface
#
fig, ax = plt.subplots(subplot_kw={"projection":"3d"})
ax.plot_wireframe(mx,my,p,rstride=10,cstride=10,color='gray')
plt.plot(zero_states[0,:],zero_states[1,:],zdir='z',ls='',marker='r$0$',markeredgecolor='green')
plt.plot(one_states[0,:],one_states[1,:],zdir='z',ls='',marker='$1$',markeredgecolor='orange')
plt.title('probability of state 0')
plt.show()

example: covariance cluster classifiers

import matplotlib.pyplot as plt
import numpy as np
#
# state cluster-weighted modeling parameters
#
dim = 2
nstates = 2
nclusters = 5
nsamples = 100
momentum = 0.9
rate = 0.1
min_var = 0.1
niterations = 100
np.random.seed(10)
#
# generate data
#
def translate(dx,dy,a):
   return(np.array((a[0,:]+dx,a[1,:]+dy)))
def scale(sx,sy):
   return(np.array(([sx,0],[0,sy])))
def rotate(theta):
   theta = np.pi*theta/180
   return(np.array(([np.cos(theta),-np.sin(theta)],[np.sin(theta),np.cos(theta)])))
zero_states = scale(3,1)@np.random.randn(dim,nsamples)
zero_states = np.hstack((zero_states,
   translate(20,0,rotate(90)@scale(3,1)@np.random.randn(dim,nsamples))))
nzeros = zero_states.shape[1]
one_states = translate(5,-10,rotate(45)@scale(5,1)@np.random.randn(dim,nsamples))
one_states = np.hstack((one_states,
   translate(5,10,rotate(-45)@scale(5,1)@np.random.randn(dim,nsamples))))
one_states = np.hstack((one_states,
   translate(11,0,rotate(90)@scale(4,1)@np.random.randn(dim,nsamples))))
nones = one_states.shape[1]
points = np.hstack((zero_states,one_states))
states = np.append(np.zeros(nzeros,dtype=int),np.ones(nones,dtype=int))
npts = len(states)
#
# initialize arrays
#
indices = np.random.randint(0,npts,nclusters)
means = points[:,indices]
dmean = np.zeros((dim,nclusters))
covariances = np.zeros((nclusters,2,2))
covariances[:,0,0] = np.var(points[0,:])
covariances[:,1,1] = np.var(points[1,:])
pc = np.ones(nclusters)/nclusters # p(c)
pxgc = np.zeros((nclusters,npts)) # p(x|c)
psgxc = np.ones((nclusters,nstates))/nstates # p(s|xc)
#
# plot data
#
plt.plot(zero_states[0,:],zero_states[1,:],ls='',marker='$0$',markeredgecolor='green')
plt.plot(one_states[0,:],one_states[1,:],ls='',marker='$1$',markeredgecolor='orange')
plt.xlim([-10,30])
plt.ylim([-20,20])
#
# plot clusters
#
def plot_covar():
   for cluster in range(means.shape[1]):
      x = means[0,cluster]
      y = means[1,cluster]
      type = np.argmax(psgxc[cluster,:])
      w,v = np.linalg.eig(covariances[cluster])
      w = np.sqrt(w)
      plt.plot([x-w[0]*v[0,0],x+w[0]*v[0,0]],[y-w[0]*v[1,0],y+w[0]*v[1,0]],'k')
      plt.plot([x-w[1]*v[0,1],x+w[1]*v[0,1]],[y-w[1]*v[1,1],y+w[1]*v[1,1]],'k')
      plt.plot([x],[y],'k',marker=f'${type}$',markersize=15)
plot_covar()
plt.title('covariance clusters before iteration')
plt.show()
#
# do E-M iteration
#
for i in range(niterations):
   for cluster in range(nclusters):
      mean = np.outer(means[:,cluster],np.ones(npts))
      cinv = np.linalg.pinv(covariances[cluster])
      pxgc[cluster,:] = (np.sqrt(np.linalg.det(cinv))/(2*np.pi)**(dim/2))\
         *np.exp(-np.sum((points-mean)*(cinv@(points-mean)),0)/2)
   pxc = pxgc*np.outer(pc,np.ones(npts))
   pcgx = pxc/np.outer(np.ones(nclusters),np.sum(pxc,0))
   psxc = psgxc[:,states]*pxc
   pcgsx = psxc/np.outer(np.ones(nclusters),np.sum(psxc,0))
   pc = np.sum(pcgsx,1)/npts
   for cluster in range(nclusters):
      newmean = momentum*dmean[:,cluster]\
         +np.sum(points*np.outer(np.ones(dim),pcgsx[cluster,:]),1)\
         /np.sum(np.outer(np.ones(dim),pcgsx[cluster,:]),1)
      dmean[:,cluster] = newmean-means[:,cluster]
      means[:,cluster] += rate*dmean[:,cluster]
      m = np.outer(means[:,cluster],np.ones(npts))
      for d in range(dim):
         covariances[cluster][:,d] = np.sum(np.outer(\
            np.ones(dim),points[d,:]-m[d,:])*(points-m)*\
            np.outer(np.ones(dim),pcgsx[cluster,:]),1)\
            /np.sum(np.outer(np.ones(dim),pcgsx[cluster,:]),1)\
            +min_var
   for state in range(nstates):
      index = np.argwhere(states == state)
      psgxc[:,state] = np.sum(pcgsx[:,index],1)[:,0]/np.sum(pcgsx[:,:],1)
#
# plot data
#
plt.plot(zero_states[0,:],zero_states[1,:],ls='',marker='$0$',markeredgecolor='green')
plt.plot(one_states[0,:],one_states[1,:],ls='',marker='$1$',markeredgecolor='orange')
plt.xlim([-10,30])
plt.ylim([-20,20])
#
# plot clusters
#
plot_covar()
#
# plot decision boundary
#
ngrid = 100
xmin = np.min(points[0,:])
xmax = np.max(points[0,:])
ymin = np.min(points[1,:])
ymax = np.max(points[1,:])
x = np.linspace(xmin,xmax,ngrid)
y = np.linspace(ymin,ymax,ngrid)
mx,my = np.meshgrid(x,y)
x = np.reshape(mx,(ngrid*ngrid))
y = np.reshape(my,(ngrid*ngrid))
plotpoints = np.vstack((x,y))
pxgc = np.zeros((nclusters,ngrid*ngrid))
for cluster in range(nclusters):
   mean = np.outer(means[:,cluster],np.ones(ngrid*ngrid))
   cinv = np.linalg.pinv(covariances[cluster])
   pxgc[cluster,:] = (np.sqrt(np.linalg.det(cinv))/(2*np.pi)**(dim/2))\
      *np.exp(-np.sum((plotpoints-mean)*(cinv@(plotpoints-mean)),0)/2)
pxc = pxgc*np.outer(pc,np.ones(ngrid*ngrid))
p = np.sum(np.outer(psgxc[:,0],np.ones(ngrid*ngrid))*pxc,0)/np.sum(pxc,0)
p = np.reshape(p,(ngrid,ngrid))
plt.contour(mx,my,p,[0.5])
plt.title('covariance clusters and decision boundaries after iteration')
plt.show()
#
# plot probability surface
#
fig, ax = plt.subplots(subplot_kw={"projection":"3d"})
ax.plot_wireframe(mx,my,p,rstride=10,cstride=10,color='gray')
plt.plot(zero_states[0,:],zero_states[1,:],zdir='z',ls='',marker='r$0$',markeredgecolor='green')
plt.plot(one_states[0,:],one_states[1,:],zdir='z',ls='',marker='$1$',markeredgecolor='orange')
plt.title('probability of state 0')
plt.show()

example: function cluster fit

import matplotlib.pyplot as plt
import numpy as np
#
# function cluster-weighted modeling parameters
#
nclusters = 5
xmin = -5
xmax = 5
npts = 100
minvar = .01
niterations = 100
np.random.seed(10)
#
x = np.arange(xmin,xmax,(xmax-xmin)/npts)
nplot = npts
y = np.tanh(x)+0.25*np.random.rand(npts)
xplot = np.copy(x)
mux = xmin+(xmax-xmin)*np.random.rand(nclusters)
beta = 0.1*np.random.rand(3,nclusters)
varx = np.ones(nclusters)
vary = np.ones(nclusters)
pc = np.ones(nclusters)/nclusters
#
# plot before iteration
#
pxgc = np.exp(-(np.outer(x,np.ones(nclusters)) # p(x|c) \
  -np.outer(np.ones(npts),mux))**2\
  /(2*np.outer(np.ones(npts),varx)))\
  /np.sqrt(2*np.pi*np.outer(np.ones(npts),varx))
pygxc = np.exp(-(np.outer(y,np.ones(nclusters)) # p(y|xc) \
  -(np.outer(np.ones(npts),beta[0,:])+np.outer(x,beta[1,:])\
  +np.outer(x*x,beta[2,:])))**2\
  /(2*np.outer(np.ones(npts),vary)))\
  /np.sqrt(2*np.pi*np.outer(np.ones(npts),vary))
pxc = pxgc*np.outer(np.ones(npts),pc) # p(xc)
pxyc = pygxc*pxc # p(yxc)
pcgxy = pxyc/np.outer(np.sum(pxyc,1),np.ones(nclusters)) # p(c|xy)
plt.plot(x,y,'.')
yplot = np.sum((np.outer(np.ones(nplot),beta[0,:])\
  +np.outer(x,beta[1,:])+np.outer(x*x,beta[2,:]))*pxc,1)/np.sum(pxc,1)
plt.plot(xplot,yplot)
yvarplot = np.sum((vary+(np.outer(np.ones(nplot),beta[0,:])\
  +np.outer(x,beta[1,:])+np.outer(x*x,beta[2,:]))**2)*pxc,1)\
  /np.sum(pxc,1)-yplot**2
ystdplot = np.sqrt(yvarplot)
plt.plot(xplot,yplot+ystdplot,'k--')
plt.plot(xplot,yplot-ystdplot,'k--')
yplot = .8*pxc/np.max(pxc)-2 # for positioning on plot
plt.plot(xplot,yplot,'k')
plt.title('function clusters before iteration')
plt.show()
#
# E-M iteration
#
for step in range(niterations):
   #
   pxgc = np.exp(-(np.outer(x,np.ones(nclusters)) # p(x|c) \
      -np.outer(np.ones(npts),mux))**2\
      /(2*np.outer(np.ones(npts),varx)))\
      /np.sqrt(2*np.pi*np.outer(np.ones(npts),varx))
   pygxc = np.exp(-(np.outer(y,np.ones(nclusters)) # p(y|xc) \
      -(np.outer(np.ones(npts),beta[0,:])+np.outer(x,beta[1,:])\
      +np.outer(x*x,beta[2,:])))**2\
      /(2*np.outer(np.ones(npts),vary)))\
      /np.sqrt(2*np.pi*np.outer(np.ones(npts),vary))
   pxc = pxgc*np.outer(np.ones(npts),pc) # p(xc)
   pxyc = pygxc*pxc # p(yxc)
   pcgxy = pxyc/np.outer(np.sum(pxyc,1),np.ones(nclusters)) # p(c|xy)
   pc = np.sum(pcgxy,0)/npts # p(c)
   mux = np.sum(np.outer(x,np.ones(nclusters))*pcgxy,0)/(npts*pc)
   varx = minvar+np.sum((np.outer(x,np.ones(nclusters))\
      -np.outer(np.ones(npts),mux))**2*pcgxy,0)/(npts*pc)
   a = np.array([\
      np.sum(np.outer(y*1,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      np.sum(np.outer(y*x,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      np.sum(np.outer(y*x*x,np.ones(nclusters))*pcgxy,0)/(npts*pc)\
      ])
   B = np.array([[\
      np.ones(nclusters),\
      np.sum(np.outer(1*x,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      np.sum(np.outer(1*x*x,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      ],[\
      np.sum(np.outer(x*1,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      np.sum(np.outer(x*x,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      np.sum(np.outer(x*x*x,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      ],[\
      np.sum(np.outer(x*x*1,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      np.sum(np.outer(x*x*x,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      np.sum(np.outer(x*x*x*x,np.ones(nclusters))*pcgxy,0)/(npts*pc),\
      ]])
   for c in range(nclusters):
      beta[:,c] = np.linalg.inv(B[:,:,c])@a[:,c]
   vary = minvar+np.sum((np.outer(y,np.ones(nclusters))\
      -(np.outer(np.ones(npts),beta[0,:])+np.outer(x,beta[1,:])\
      +np.outer(x*x,beta[2,:])))**2\
      *pcgxy,0)/(npts*pc)
   #
plt.plot(x,y,'.')
yplot = np.sum((np.outer(np.ones(nplot),beta[0,:])\
  +np.outer(x,beta[1,:])+np.outer(x*x,beta[2,:]))*pxc,1)/np.sum(pxc,1)
plt.plot(xplot,yplot)
yvarplot = np.sum((vary+(np.outer(np.ones(nplot),beta[0,:])\
  +np.outer(x,beta[1,:])+np.outer(x*x,beta[2,:]))**2)*pxc,1)\
  /np.sum(pxc,1)-yplot**2
ystdplot = np.sqrt(yvarplot)
plt.plot(xplot,yplot+ystdplot,'k--')
plt.plot(xplot,yplot-ystdplot,'k--')
yplot = .8*pxc/np.max(pxc)-2 # for positioning on plot
plt.plot(xplot,yplot
,'k')
plt.title('function clusters after iteration')
plt.show()

References

Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis.

Traditional density estimation.

Problems

  1. Generate a data set by sampling 100 points from $y = \tanh(x)$ between $x = -5$ to 5, adding random noise with a magnitude of 0.25 to $y$. Fit it with a cluster-weighted model using a local linear function in the clusters, $y = a+bx$. Plot the data and the resulting forecast $\la y|x\ra$, uncertainty $\la\sigma_y^2|x\ra$, and cluster probabilities $p({\vec x}|c_m)p(c_m)$ as the number of clusters is increased (starting with one), and animate their evolution during the EM iterations.

  2. Generate a 2D distribution of data labeled with discrete states as shown in Figure {densecovar}, fit it with a cluster-weighted model, and plot the decision boundaries.


(c) Neil Gershenfeld 5/2/26