Transforms¶

The Nature of Mathematical Modeling (draft)¶

$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} $

We start our study of data with the easily overlooked question of whether it is best analyzed in the form that it is given (hint: the answer is frequently no). This is a question about representation -- what's the best way to view the data to highlight the features of interest? The goal will be to boil a set of measurements down to a smaller set that is more independent, freeing subsequent analysis from having to rediscover the structure. A good representation can go a long way towards solving a difficult problem, and conversely a bad one can doom an otherwise well-intentioned effort.

Orthogonal Transforms¶

We will frequently be concerned with orthogonal transformations. Because they are easily inverted they preserve information, an important feature since we don't want our transformation to throw away anything unless we tell it to.

A matrix is orthogonal if its inverse is equal to its transpose, $$ \mat{M}^T\mat{M} = \mat{I} ~~~~, $$ where as usual the transpose is denoted by $$ \mat{M}^T_{ij} \equiv \mat{M}_{ji} $$ and $\mat{I}$ is the identity matrix with 1s on the diagonal and 0s elsewhere. Multiplication of a vector by an orthogonal matrix defines an orthogonal transformation on the vector. For a complex matrix the adjoint is the complex conjugate of the transpose $$ \mat{M}^\dagger_{ij} \equiv \mat{M}^*_{ji} ~~~~. $$ If the adjoint is the inverse, $$ \mat{M}^\dagger\mat{M} = \mat{I} ~~~~, $$ then $\mat{M}$ is unitary. The column or row vectors $\ve{v}_i$ of an orthogonal matrix are not only orthogonal, $\ve{v}_i\ve{v}_j = 0~(i\ne j)$, they are orthonormal, $\ve{v}_i\ve{v}_j = \delta_{ij}$, but by convention the matrix itself is still usually just called orthogonal.

An important property of the adjoint is that it interchanges the order of a product of matrices: $$ (\mat{A}\mat{B})^\dagger = \mat{B}^\dagger\mat{A}^\dagger ~~~~. $$ Remember also that matrix multiplication is distributive $(\mat{A}(\mat{B}+\mat{C}) = \mat{A}\mat{B}+\mat{A}\mat{C})$ and associative $(\mat{A}(\mat{B}\mat{C}) = (\mat{A}\mat{B})\mat{C})$, but need not be commutative $(\mat{A}\mat{B}\ne \mat{B}\mat{A})$.

Now consider a linear transformation on a column vector $\vec{x}$ to a new one $\ve{y}=\mat{M}\ve{x}$. The Euclidean norm of $\vec{x}$ is its length as measured by the sum of the squares of the elements, $$ |\ve{x}|^2 = \ve{x}^\dagger\ve{x} = \sum_i x_i^* x_i = \sum_i |x_i|^2 ~~~~. $$ If $\mat{M}$ is unitary, then the norm of $\vec{y}$ is $$ \ba |\ve{y}|^2 &=& |(\mat{M}\ve{x})^\dagger(\mat{M}\ve{x})| \\ &=& |(\ve{x}^\dagger\mat{M}^\dagger)(\mat{M}\ve{x})| \\ &=& |\ve{x}^\dagger(\mat{M}^\dagger\mat{M})\ve{x}| \\ &=& |\ve{x}^\dagger\ve{x}| \\ &=& |\ve{x}|^2 ~~~~. \ea $$ A unitary (or orthogonal) transformation preserves the norm of a vector. It rotates a data point to a new location, but doesn't change its distance from the origin. This means that it can rearrange the points but not do something as nasty as make some of them disappear.

Fourier Transforms¶

The Discrete Fourier Transformation (DFT) is a familiar example of a unitary transformation (Problem {trans-DFT}). Given a data vector $\{x_0, x_1, \ldots, x_{N-1}\}$, the DFT is defined by $$ \ba X_f &=& \frac{1}{\sqrt{N}}\sum_{n=0}^{N-1} e^{2\pi i f n/N} x_n \\ &\equiv& \sum_{n=0}^{N-1} M_{fn} x_n \\ &=& \mat{M}\cdot\ve{x} ~~~~, \ea $$ and the corresponding inverse transform by $$ x_n = \frac{1}{\sqrt{N}} \sum_{f=0}^{N-1} e^{-2\pi ifn/N} X_f ~~~~. $$ The $X_f$ are the coefficients for an expansion of the vector in a basis of periodic complex functions. For real-valued signals and transforms the related Discrete Cosine Transformation (DCT) can be used (Problem {trans-DCT}).

Computing the DFT requires multiplying the data vector by the transform matrix. Finding one element needs $N$ multiplies and adds, and there are $N$ elements, so this appears to be an ${\cal O}(N^2)$ algorithm. Remarkably, and significantly, this is not the case. Notice the the DFT can be split into two sums as follows: $$ \ba X_f &=& \frac{1}{\sqrt{N}}\sum_{n=0}^{N-1} e^{2\pi i f n/N} x_n \\ &=& \frac{1}{\sqrt{N}}\sum_{n=0}^{N/2-1} e^{2\pi i f (2n)/N} x_{2n} + \frac{1}{\sqrt{N}}\sum_{n=0}^{N/2-1} e^{2\pi i f (2n+1)/N} x_{2n+1} \\ &=& \frac{1}{\sqrt{N}}\sum_{n=0}^{N/2-1} e^{2\pi i f (2n)/N} x_{2n} + \frac{e^{2\pi if/N}}{\sqrt{N}} \sum_{n=0}^{N/2-1} e^{2\pi i f (2n)/N} x_{2n+1} \\ &=& \frac{1}{\sqrt{N}}\sum_{n=0}^{N/2-1} e^{2\pi ifn/(N/2)} x_{2n} + \frac{e^{2\pi if/N}}{\sqrt{N}} \sum_{n=0}^{N/2-1} e^{2\pi ifn/(N/2)} x_{2n+1} \\ &=& X_f^{even} + e^{2\pi if/N} X_f^{odd} ~~~~. \ea $$ Instead of one $N$-point transform we've broken it into two $N/2$-point transforms, one on the even points and one on the odd ones. This requires ${\cal O}[(N/2)^2]+{\cal O}[(N/2)^2]={\cal O}(N^2/2)$ steps to do the transforms and one final multiplication and addition to combine each element, instead of the original ${\cal O}(N^2)$ steps. The even and odd transforms can likewise be split, and so forth, until we've broken the calculation into $N$ single-point transforms. Reassembling each of them through the hierarchical factoring (in a butterfly) takes $\log_2 N$ adds and multiplies, for a total of ${\cal O}(N\log_2 N)$ steps. If $N=10^8$, doing this requires ${\cal O}(10^9)$ steps (about a second at 1 GFlop), versus ${\cal O}(10^{16})$ operations for the DFT (a few months at a GFlop). Quite a savings! The modern incarnation of this clever idea is called the Fast Fourier Transform (FFT) and is associated with Cooley and Tukey {Cooley:65}, but it has a long history dating all the way back to Gauss in 1805. It is an example of the powerful algorithm design principle of divide-and-conquer: if you can't solve a difficult problem, split it into successively smaller problems until they can be solved and then recombine them to find the answer {Aho:74}.

The clarity of the FFT implementation hides many subtleties in its application {Oppenheim:09}. The highest frequency possible in a DFT is $f=1/2$; beyond that the $2\pi$ periodicity of the exponential will wrap still higher components in $x_n$ onto lower frequencies. This is the phenomenon of aliasing and requires that a signal be sampled at more than twice the highest frequency of interest (called the Nyquist frequency). And since the transform is done over a finite time it is equivalent to transforming an infinite series multipied by a finite-length pulse. Since multiplication in the time domain is equal to convolution in the frequency domain, and the Fourier transform of a pulse is a sinc function $\sin(2\pi f\Delta T)/(\pi f)$, sharp features in the transform get spread out by the finite window and spurious side-lobes appear. There are many other ways to window data with weighting functions other than a rectangular step, in order to optimize desired attributes such as spectral resolution, sidelobe suppression, or phase uniformity. Finally, remember that the discrete sampling of the spectrum done by the DFT can miss important features that lie between the points of the transform.

The FFT is one of the most important algorithms in all of numerical mathematics. Beyond the many applications we've already seen for Fourier transforms it crops up in places where you might not expect it, such as speeding up the multiplication of two long numbers (which is really just a convolution {Knuth:97}). When Cooley (then at IBM) first presented the FFT, IBM concluded that it was so significant it should be put in the public domain to prevent anyone from trying to patent it, and so it was published openly. Ironically, its very success has made this kind of behavior less common now.

Example¶

Dual-Tone Multi-Frequency (DTMF) phone signals

import numpy as np
from IPython.display import Audio, display
digits = "1234567890"
rate = 44100
def DTMF_tone(digit,duration,rate,amplitude):
    frequencies = {
        '1':(697,1209),'2':(697,1336),'3':(697,1477),
        '4':(770,1209),'5':(770,1336),'6':(770,1477),
        '7':(852,1209),'8':(852,1336),'9':(852,1477),
        '*':(941,1209),'0':(941,1336),'#':(941,1477)}
    low,high = frequencies[digit]
    t = np.linspace(0,duration,int(rate*duration),endpoint=False)
    tone = amplitude*(np.sin(2*np.pi*low*t)+np.sin(2*np.pi*high*t))
    return tone
def generate_DTMF(digits,duration=0.2,silence=0.1,rate=44100,amplitude=0.5):
    data = np.array([])
    for digit in digits:
        tone = DTMF_tone(digit,duration,rate,amplitude)
        silence = np.zeros(int(rate*duration))
        data = np.concatenate((data,tone,silence))
    return data
data = generate_DTMF(digits,rate=rate)
data = data/np.max(np.abs(data))*(2**15-1)
data = data.astype(np.int16)
display(Audio(data,rate=rate))

import matplotlib.pyplot as plt
plt.plot(data)
plt.title("audio")
plt.show()
plt.plot(data[24000:25000])
plt.title("waveform")
plt.show()

import matplotlib.pyplot as plt
import numpy as np
fft = np.fft.fft(data)
frequencies = np.fft.fftfreq(len(data),d=1/rate)
plt.plot(frequencies,abs(fft))
plt.xlim(0,2000)
plt.title('FFT')
plt.xlabel('frequency (Hz)')
plt.ylabel('magnitude')
plt.show()

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.mlab as mlab
plt.specgram(data,Fs=rate,scale='linear',cmap='inferno',window=mlab.window_none)
plt.ylim(500,8000)
plt.title("spectrogram")
plt.xlabel('time (s)')
plt.ylabel('frequency (Hz)')
plt.ylim(0,2000)
plt.show()

Filters¶

sampling
- aliasing
  - sampling too slowly, so that high and low frequencies appear the same
- Nyquist sampling rate
  - twice the highest frequency to prevent aliasing
digital filter
- input $x_n$, output $y_n$, time steps $n$
- $y_n = \sum_{i=0}^M b_i x_{n-i} - \sum_{i=1}^N a_i y_{n-i}$
common types
- low-pass
  - frequencies below a cutoff
- high-pass
  - frequencies above a cutoff
- band-pass
  - frequencies in a band
common uses
- reducing noise
- removing unwanted signals, varying baselines
- isolating features of interest
- filter bank
  - an array of band-pass filters used for feature detection
Butterworth filters
- designed to have a flat response in the pass band

Example¶

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import butter,filtfilt
#
# generate data for filtering example and plot
#
npts = 1000
noise = 0.2
rate = 20000
order = 5
np.random.seed(50)
x = np.linspace(-10,10,npts)
data = np.tanh(x)+np.random.normal(0,noise,npts)
#
plt.plot(data)
plt.title('time series')
plt.show()
#
fft = np.fft.rfft(data)
frequencies = np.fft.rfftfreq(len(data),d=1/rate)
plt.plot(frequencies,np.abs(fft))
plt.title('time series FFT')
plt.xlabel('frequency (Hz)')
plt.ylabel('magnitude')
plt.xscale('log')
plt.yscale('log')
plt.xlim(10,5000)
plt.show()
#
# filter and plot
#
low = 500
b,a = butter(order,low,btype='low',fs=rate,output='ba',analog=False)
filtlow = filtfilt(b,a,data)
plt.plot(filtlow)
plt.title('low pass time series')
plt.show()
#
fft = np.fft.rfft(filtlow)
frequencies = np.fft.rfftfreq(len(filtlow),d=1/rate)
plt.plot(frequencies,np.abs(fft))
plt.title('low pass FFT')
plt.xlabel('frequency (Hz)')
plt.ylabel('magnitude')
plt.xscale('log')
plt.yscale('log')
plt.xlim(10,5000)
plt.show()
#
high = 500
b,a = butter(order,high,btype='high',fs=rate,output='ba',analog=False)
filthigh = filtfilt(b,a,data)
plt.plot(filthigh)
plt.title('high pass time series')
plt.show()
#
fft = np.fft.rfft(filthigh)
frequencies = np.fft.rfftfreq(len(filthigh),d=1/rate)
plt.plot(frequencies,np.abs(fft))
plt.title('high pass FFT')
plt.xlabel('frequency (Hz)')
plt.ylabel('magnitude')
plt.xscale('log')
plt.yscale('log')
plt.xlim(10,5000)
plt.show()
#
low = 40
high = 750
b,a = butter(order,[low,high],btype='band',fs=rate,output='ba',analog=False)
filtband = filtfilt(b,a,data)
plt.plot(filtband)
plt.title('band pass time series')
plt.show()
#
fft = np.fft.rfft(filtband)
frequencies = np.fft.rfftfreq(len(filtband),d=1/rate)
plt.plot(frequencies,np.abs(fft))
plt.title('band pass FFT')
plt.xlabel('frequency (Hz)')
plt.ylabel('magnitude')
plt.xscale('log')
plt.yscale('log')
plt.xlim(10,5000)
plt.show()

Wavelets¶

Figure: Division of time-frequency spaces by the coefficients of discrete transforms.

Wavelets are families of orthogonal transformations that generalize Fourier transforms in a very important way by introducing locality (Figure {wavelets}). Trigonometric functions are defined everywhere. This makes them good at describing global properties, such as the frequency of a signal, but very bad at describing locally varying properties. On the other hand, a time series represents a signal as a series of local impulses, which have an infinite spectrum of Fourier coefficients. A sine wave is most conveniently expressed in the frequency domain, and a step function is much more naturally defined in the time domain. In between these extremes lie most signals of interest, for which neither a global nor a local representation is best. A short-time Fourier transform (STFT) tries to do this by transforming short windows of data. This has the problem that low-frequency estimates need big windows to be meaningful, while high-frequency estimates need small windows to be relevant. This is exactly the happy compromise that wavelets provide, retaining a useful notion of both location and frequency.

Wavelets can be understood as a hierarchical filter bank, shown in Figure {filters}. A signal is applied to two filters, one passing the high-frequency part of the signal and the other passing the low-frequency part. Then, the low-frequency part goes through a pair of filters, separating it into a new high-frequency component and an even lower-frequency one. This procedure is continued until the signals at the bottom are left with a single point. Since we don't want the transform to throw away information unless we explicitly decide to, each of these steps is done invertibly.

Figure: Interpretation of the wavelet transform as a hierarchical filter bank.

The earliest wavelets were based on expanding a function in terms of rectangular steps, the Haar wavelets {Haar:10}. This is usually a very poor approximation; we will instead start with the Daubechies wavelets, which are among the simplest but still most important families {Daubechies:88}. Given a record of $N$ points $x_n$, the first step is to write down a linear filter $$ y_n = \sum_{i=0}^{M-1} b_i x_{n-i} $$ that is zero for "smooth" signals. To design it we certainly want it to vanish for a constant, so that (taking the order $M=4$ for example) $$ b_0\cdot 1 + b_1\cdot 1 + b_2\cdot 1 + b_3\cdot 1= 0 ~~~~. $$ The next thing that we could ask for is that it vanish for a linear ramp $$ b_0\cdot 0 + b_1\cdot 1 + b_2\cdot 2 + b_3\cdot 3 = 0 ~~~~. $$ Since this is a linear filter it will then vanish for any $x = \alpha n + \beta$. It will turn out that for a fourth-order wavelet this is all that we can do; given the other constraints to be included six terms will be needed if we want it to vanish for a quadratic curve, and so forth. Next, we want to define another filter $$ z_n = \sum_{i=0}^{M-1} c_i x_{n-i} $$ that responds exactly oppositely, being large for smooth signals and small for nonsmooth signals. A linear filter is just a convolution of the signal with the filter's coefficients, so the series of the coefficients is the signal that the filter responds maximally to (Chapter {est}). Therefore, if the output of our second filter vanishes when the coefficients of the first one are input to it, it will be as unlike the first one as two linear filters can be. This means that we want $$ \sum_{i=0}^{M-1} c_i b_i = 0 $$ (remember that because a linear filter is a convolution, the associated time series flips the order of the coefficients, and so both have the same index in this sum). A pair of filters with this property are called quadrature mirror filters. For $M=4$ the equation to be solved is $$ c_0 b_0 + c_1 b_1 + c_2 b_2 + c_3 b_3 = 0 ~~~~. $$ By inspection, this can be enforced by flipping the order of the coefficients as well as the sign of every other one: $$ b_0 = c_3 ~~~ b_1 = -c_2 ~~~ b_2 = c_1 ~~~ b_3 = -c_0 ~~~~. $$ We now have two filters: one is large for the smooth parts of the signal, and the other for the nonsmooth parts. To apply them to an input vector we can write it as a matrix problem: $$ \left[ \begin{array}{c@{\hspace{8.8pt}}c@{\hspace{8.8pt}}c@{\hspace{8.8pt}}% c@{\hspace{8.8pt}}c@{\hspace{8.8pt}}c@{\hspace{8.8pt}}% c@{\hspace{8.8pt}}c@{\hspace{8.8pt}}c@{\hspace{8.8pt}}% c@{\hspace{8.8pt}}c} c_0 & c_1 & c_2 & c_3 & & & & & & & \\ c_3 & -c_2 & c_1 & -c_0 & & & & & & & \\ & & c_0 & c_1 & c_2 & c_3 & & & & & \\ & & c_3 & -c_2 & c_1 & -c_0 & & & & & \\ & & & & & & \ddots & & & & \\ & & & & & & & c_0 & c_1 & c_2 & c_3 \\ & & & & & & & c_3 & -c_2 & c_1 & -c_0 \\ c_2 & c_3 & & & & & & & & c_0 & c_1 \\ c_1 & -c_0 & & & & & & & & c_3 & -c_2 \end{array} \right] \left( \begin{array}{c} x_0 \\ x_1 \\ \vdots \\ \vdots \\ \vdots \\ \vdots \\ \vdots \\ x_{N-1} \end{array} \right) = \left( \begin{array}{c} x_0^{(1)} \\ w_0^{(1)} \\ x_1^{(1)} \\ w_1^{(1)} \\ \vdots \\ x_{N/2-2}^{(1)} \\ w_{N/2-2}^{(1)} \\ x_{N/2-1}^{(1)} \\ w_{N/2-1}^{(1)} \end{array} \right)\rule[-70pt]{0pt}{80pt} $$ (all empty matrix elements are 0). Such a representation of a moving filter is called a circulant matrix. Periodic boundary conditions were used to wrap around the coefficients, but it's also possible to define special coefficients for the boundaries to avoid that if necessary. I've called the output of the "smooth" filter $x^{(1)}$, and the output of the ``nonsmooth'' filter $w^{(1)}$. Each component has half as many points as the original series. The former is a lower resolution description of the signal, and the latter contains the fine structure that was lost in the smoothing.

It is convenient if the transformation is orthogonal so that the inverse is just the transpose. Requiring that the matrix times its transpose results in the identity matrix gives two nontrivial equations $$ c_0^2+c_1^2+c_2^2+c_3^2 = 1 $$ $$ c_2c_0+c_3c_1 = 0 ~~~~. $$ We also had two equations for the filter $$ c_3-c_2+c_1-c_0 = 0 $$ $$ -c_2+2c_1-3c_0 = 0 $$ (written in terms of the $c$'s instead of the $b$'s). This is four equations in four unknowns, which can be solved to find $$ \ba c_0 = \frac{1+\sqrt{3}}{4\sqrt{2}} &~~~~& c_1 = \frac{3+\sqrt{3}}{4\sqrt{2}} \n\\ c_2 = \frac{3-\sqrt{3}}{4\sqrt{2}} &~~~~& c_3 = \frac{1-\sqrt{3}}{4\sqrt{2}} ~~~~. \ea $$ Using these coefficients, the transformation can be inverted by using the transpose $$ \left[ \begin{array}{c c c c c c c c c} c_0 & c_3 & & & & & & c_2 & c_1 \\ c_1 & -c_2 & & & & & & c_3 & -c_0 \\ c_2 & c_1 & c_0 & c_3 & & & & & \\ c_3 & -c_0 & c_1 & -c_2 & & & & & \\ & & c_2 & c_1 & c_0 & c_3 & & & \\ & & c_3 & -c_0 & c_1 &-c_2 & & & \\ & & & & \ddots & & & & \\ & & & c_2 & c_1 & c_0 & c_3 & & \\ & & & c_3 & -c_0 & c_1 & -c_2 & & \\ & & & & & c_2 & c_1 & c_0 & c_3 \\ & & & & & c_3 & -c_0 & c_1 & -c_2 \end{array} \right] \left( \begin{array}{c} x_0^{(1)} \\ w_0^{(1)} \\ \vdots \\ \vdots \\ \vdots \\ \vdots \\ \vdots \\ x_{N/2-1}^{(1)} \\ w_{N/2-1}^{(1)} \end{array} \right) = \left( \begin{array}{c} x_0 \\ x_1 \\ \vdots \\ \vdots \\ \vdots \\ \vdots \\ \vdots \\ x_{N-2} \\ x_{N-1} \end{array} \right) $$

Let's now multiply the output from the filters by another orthonormal matrix that splits the two types of results: $$ \left[ \begin{array}{c c c c c c c c c c c} 1 & 0 & & & & & & & & & \\ & 0 & 1 & & & & & & & & \\ & & & & & & & 1 & 0 & & \\ & & & & & & & & 0 & 1 & 0 \\ & & & & & \ddots & & & & & \\ 0 & 1 & 0 & & & & & & & & \\ & & 0 & 1 & & & & & & & \\ & & & & & & & & 1 & 0 & \\ & & & & & & & & & 0 & 1 \end{array} \right] \left( \begin{array}{c} x_0^{(1)} \\ w_0^{(1)} \\ x_1^{(1)} \\ w_1^{(1)} \\ \vdots \\ x_{N/2-2}^{(1)} \\ w_{N/2-2}^{(1)} \\ x_{N/2-1}^{(1)} \\ w_{N/2-1}^{(1)} \end{array} \right) $$ $$ = \left( \begin{array}{c} x_0^{(1)} \\ x_1^{(1)} \\ \vdots \\ x_{N/2-2}^{(1)} \\ x_{N/2-1}^{(1)} \\ w_0^{(1)} \\ w_1^{(1)} \\ \vdots \\ w_{N/2-1}^{(1)} \end{array} \right) $$ The first half of the resulting vector is a smoothed version of the original signal at half the time resolution, and the second half contains the details lost in the smoothing. The original series can be recovered by multiplying by the transposes of the two matrices used. We can now go ahead and do the same sequence of operations on the new $x$'s, to give a version at even lower resolution as well as some more ``detail'' coefficients. Repeating the filtering and shuffling operations until we're left with just two $x$ values and so can go no further gives the following sequence of coefficient vectors: $$ \begin{array}{c c c c c c c c} x_0 & x_0^{(1)} & x_0^{(1)} & x_0^{(2)} & x_0^{(2)} & & x_0^{(\log_2N-1)} & x_0^{(\log_2N-1)} \\ x_1 & w_0^{(1)} & x_1^{(1)} & w_0^{(2)} & x_1^{(2)} & & w_0^{(\log_2N-1)} & x_1^{(\log_2N-1)} \\ \vdots & x_1^{(1)} & \vdots & x_1^{(2)} & \vdots & & x_1^{(\log_2N-1)} & w_0^{(\log_2N-1)} \\ \vdots & w_1^{(1)} & \vdots & w_1^{(2)} & x_{N/4-1}^{(2)} & & w_1^{(\log_2N-1)} & w_1^{(\log_2N-1)} \\ \vdots & \vdots & \vdots & \vdots & w_0^{(2)} & & w_0^{(\log_2N-2)} & w_0^{(\log_2N-2)} \\ \vdots & \vdots & \vdots & \vdots & w_1^{(2)} & & \vdots & \vdots \\ \vdots & \vdots & \vdots & x_{N/4-1}^{(2)} & \vdots & \cdots & \vdots & \vdots \\ \vdots & \vdots & x_{N/2-1}^{(1)} & w_{N/4-1}^{(2)} & w_{N/4-1}^{(2)} & & \vdots & \vdots \\ \vdots & \vdots & w_0^{(1)} & w_0^{(1)} & w_0^{(1)} & & \vdots & \vdots \\ \vdots & \vdots & w_1^{(1)} & w_1^{(1)} & w_1^{(1)} & & \vdots & \vdots \\ \vdots & x_{N/2-1}^{(1)} & \vdots & \vdots & \vdots & & \vdots & \vdots \\ x_{N-1} & w_{N/2-1}^{(1)} & w_{N/2-1}^{(1)} & w_{N/2-1}^{(1)} & w_{N/2-1}^{(1)} & & w_{N/2-1}^{(1)} & w_{N/2-1}^{(1)} \end{array} $$

This defines the Discrete Wavelet Transformation (DWT), and the final $w$'s are the wavelet coefficients. They represent structure at many scales as well as at many locations. If any of the wavelet coefficients are small they can be set to zero to approximate the original series with less information, but the beauty of this kind of compression is that it can find important regions in the time-frequency space rather than projecting all of the information onto the frequency axis (as done by an FFT) or the time axis (by impulses).

A sine wave looks like, well, a sine wave. What does a wavelet look like? We can find out by setting one of the wavelet coefficients to 1 and all the others to 0, and then running the inverse wavelet transform back to find the $x$ series that produces it (just as inverting a Fourier transform of an impulse gives a sinusoidal function). Problem {trans-wave} shows that this results in quite a curious looking function. To understand it, consider that after one pass of the smoothing filter, $$ x^{(1)}_n = c_0 x_{2n} + c_1 x_{2n+1} + c_2 x_{2n+2} + c_3 x_{2n+3} ~~~~. $$ If a function exists that satisfies $$ X_n = c_0 X_{2n} + c_1 X_{2n+1} + c_2 X_{2n+2} + c_3 X_{2n+3} $$ then it will be unchanged by the smoothing (this is a \trm{dilation equation}{Dilation equation}, instead of a difference or differential equation). The associated wavelet function is $$ W_n = c_3 X_{2n} - c_2 X_{2n+1} + c_1 X_{2n+2} - c_0 X_{2n+3} ~~~~. $$ In the limit of many iterations, so that $n$ approaches a continuous variable, these are the basis functions that are invariant under the transformation. Remarkably, in the continuum limit these apparently innocent and certainly useful functions are very complicated, not even differentiable.

We've been looking at fourth-order wavelets; higher orders are similarly defined. For each two additional coefficients used it's possible to go to one higher derivative of the function that can be matched. Beyond order 6, the coefficients must be found numerically. Our wavelets also have had compact support (they are zero everywhere except for where they are defined); this is convenient numerically but can be relaxed in order to get other benefits such as the analytical form and simple spectrum of the harmonic wavelets {Newland:94}.

Just as a high-dimensional Fourier transform can be done by transforming each axis in turn, wavelets can be extended to higher dimensions by transforming each axis separately {Press:07}. This restricts the wavelets to the axes of the space; it is also possible to define more general multi-dimensional wavelets. The state of the art in wavelets has advanced rapidly since their introduction; see for example {Chui:94}. Beyond wavelets there are other time-frequency transforms, such as *Wigner functions {Hlawatsch:92}, which first arose as a probabilistic representation of quantum mechanics for studying semi-classical systems {Balazs:84}.

Principal Components¶

Wavelets were constructed based on the assumption that time and frequency are the interesting axes against which a signal can be viewed. This certainly need not be true, and doesn't even apply to a set of measurements that have no particular temporal or spatial ordering. Rather than designing one transform to apply to all data we might hope to do better by customizing a transform to provide the best representation for a given data set (where "best" of course will reflect some combination of what we hope to achieve and what we know how to accomplish).

Let's once again let $\ve{x}$ be a measurement vector, and $\ve{y} = \mat{M}\ve{x}$ be a transformation to a new set of variables with more desirable properties. A reasonable definition of "best" is to ask that the covariance matrix of $\ve{y}$ be diagonal, so that each of its elements is uncorrelated. To find the required transformation, the covariance matrix of $\ve{y}$ can be related to that of $\ve{x}$: $$ \ba \mat{C}_y &=& \la (\ve{y}-\la\ve{y}\ra)(\ve{y}-\la\ve{y}\ra)^T \ra \\ &=& \la [\mat{M}(\ve{x}-\la\ve{x}\ra)] [\mat{M}(\ve{x}-\la\ve{x}\ra)]^T \ra \\ &=& \la [\mat{M}(\ve{x}-\la\ve{x}\ra)] [(\ve{x}-\la\ve{x}\ra)^T\mat{M}^T] \ra \\ &=& \mat{M}\la(\ve{x}-\la\ve{x}\ra)(\ve{x}-\la\ve{x}\ra)^T\ra \mat{M}^T \\ &=& \mat{M}\mat{C}_x\mat{M}^T ~~~~. \ea $$ Because $\mat{C}_x$ is a real symmetric matrix it's possible to find an orthonormal set of eigenvectors {Golub:96}. Now consider what happens if the columns of $\mat{M}^T$ are taken to be these eigenvectors. After multiplication by $\mat{C}_x$ each eigenvector is returned multiplied by its corresponding eigenvalue. Then because of the orthonormality, the multiplication of this matrix by $\mat{M}$ gives zeros off-diagonal, and returns the values of the eigenvalues on the diagonal. Therefore $\mat{C}_y$ is a diagonal matrix as desired, with the eigenvalues measuring the explained variance of the principal axes. If there are linear correlations among the elements of $\ve{x}$ then some of the eigenvalues will vanish; these components of $\ve{y}$ can be dropped from subsequent analysis. For real data sets the elements might not be exactly equal to zero, but the relative magnitudes of them let the important components be found and the less important ones be ignored. Such variable subset selection is frequently the key to successful modeling.

Use of the covariance matrix of a set of measurements to find a transformation to new variables that are uncorrelated is called Principal Components Analysis (PCA). If the scale of the data components varies widely it's standard to standardize the data, by subtracting off the mean and dividing each dimension by its variance, so that they all have the same scale.

Example¶

import numpy as np
eps = 1e-10
npts = 1000000
np.set_printoptions(precision=3,suppress=True)
x = np.random.randn(1,npts) # generate random points
x = np.concatenate((x,np.random.randn(1,npts))) # add random points in a second dimension
x = np.concatenate((x,np.reshape(x[0,:]+x[1,:],(1,npts)))) # create a third dimension adding the other two
Cx = np.cov(x) # find the covariance
print('covariance of x:')
print(Cx)
v,Mt = np.linalg.eig(Cx) # find the eigenvalues and eigenvactors
print('eigenvalues of covariance of x:')
print(v)
M = np.transpose(Mt)
index = (v > eps)
M = M[index,:] # threshold dimensions based on the eigenvalues
y = np.matmul(M,x) # map the data onto the principal components
Cy = np.cov(y) # find the new covariance
print('covariance of y:')
print(Cy)

covariance of x:
[[1.    0.001 1.001]
 [0.001 1.002 1.003]
 [1.001 1.003 2.005]]
eigenvalues of covariance of x:
[ 3.007  1.    -0.   ]
covariance of y:
[[ 3.007 -0.   ]
 [-0.     1.   ]]

In the related Karhunen-Loéve Transform (KLT) {Fukunaga:90}, a measurement vector $\ve{y}$ (such as a time series, or the values of the pixels in an image) is expanded in a sum over orthonormal basis vectors $\ve{\phi}_i$ with expansion coefficients $x_i$, $$ \ve{y} = \sum_i x_i \ve{\phi}_i ~~~~. $$ Given an ensemble of measurements of $\ve{y}$, the goal is to choose a set of $\ve{\phi}_i$ that make the $x_i$'s as independent as possible. Defining $\mat{M}$ to be a matrix that has the $\ve{\phi}_i$ as column vectors, the expansion of $\ve{y}$ can be written as $\ve{y}=\mat{M}\ve{x}$, where $\ve{x}$ is a column vector of the expansion coefficients. We've already seen that the covariance matrices of $\ve{y}$ and $\ve{x}$ are related by $$ \mat{C}_y = \mat{M}\mat{C}_x\mat{M}^T $$ or $$ \mat{M}^T\mat{C}_y\mat{M} = \mat{C}_x ~~~~. $$ Therefore if we choose the $\ve{\phi}_i$ to be the eigenvectors of $\mat{C}_y$ then $\mat{C}_x$ will be diagonal. Since the $\ve{\phi}_i$ are orthonormal, given a new measurement $\ve{y}$ the expansion coefficients can be found from $$ \ve{y}\ve{\phi}_j = \sum_i x_i \ve{\phi}_i\ve{\phi}_j = \sum_i x_i \delta_{ij} = x_j ~~~~. $$ This provides a convenient way to do lossy compression for storage or communications, by using only the significant coefficients to partially reconstruct a data vector from the bases.

Independent Components¶

PCA starts with the covariance matrix of all of the original variables and then throws out the insignificant components. Factor Analysis directly seeks a smaller set of variables that can explain the covariance structure of the observations {Hair:98}. Independent Components Analysis (ICA) goes further to search for a transformation that makes the new variables independent $(p(y_i,y_j)=p(y_i)p(y_j))$ rather than just uncorrelated {Comon:94}{Bell:95}{Hyvarinen:04}.

In the blind source separation problem, unknown sources $\vec s$ are mixed in observations $\vec x$ by an unknown matrix $\ve x = \mat A \vec s$. If weights $\mat W = \mat A^{-1}$ could be found then the sources could be separated by $\ve s = \mat W \vec x$, but how can this be done without information about either $\mat A$ or $\vec s$? The surprising answer is that this can be possible if the signals are interesting. In Section {stoch-clt} we saw that a sum of random variables approaces a Gaussian distribution for almost any distribution of the variables. Conversely, as long as the distributions do not start out as Gaussians, then the departure from Gaussianity can be used as a signature for separating them. This is the basis for ICA.

Gaussianity can be tested with the kurtosis (4th cumulant), but because that raises variables to the fourth power it's sensitive to outliers. Entropy is another test (Problem {maxent} showed that the entropy of a Gaussian is maximal for continuous distributions with a given variance), but that requires an estimate of the probability distribution function. A simpler approximation is to use a contrast function {Hyvarinen:99}. For a vector of weights $\vec w$ corresponding to one component of $\vec s$ (i.e., one row of $\mat W$), this approach seeks to maximize the expected value of a function relative to its value for a Gaussian, $\max \langle f(\vec w \vec x) \rangle$, where the observations are used to evaluate the expectation. A convenient example is $f(\vec x) = \log \cosh (\vec x)$ {Hyvarinen:00}.

Because ICA can't determine absolute scale factors in $\mat W$, it's convential in ICA to sphere the data with PCA, so that it has a diagonal covariance matrix with unit variances. The weight vectors are then taken to have $|\vec w|^2=1$ to preserve that undetermined norm. To find them we want to make extremal $$ F = \langle f(\vec w \vec x) \rangle - \lambda \vec w \vec w $$ with the Lagrange multiplier for the normalization. Taking the gradient, $$ \ps{F}{\vec w} = \langle \vec x f'(\vec w \vec x) \rangle - \lambda \vec w $$ The Jacobian can be approximated by: $$ \ba \pp{F}{\vec w} &=& \langle \vec x \vec{x}^T f''(\vec w x) \rangle - \lambda \mat I \\ &\approx& \langle \vec x \vec{x}^T \rangle \langle f''(\vec w x) \rangle - \lambda \mat I \\ &\approx& \left( \langle f''(\vec w x) \rangle - \lambda \right) \mat I \ea $$ because of the initial diagonalization of the covariance matrix. A Newton root-finding step is performed by subtracting the gradient over the Jacobian: $$ \vec w \leftarrow \vec w - \left( \langle \vec x f'(\vec w \vec x) \rangle - \lambda \vec w \right) / \left( \langle f''(\vec w \vec x) \rangle - \lambda \right) $$ Multiplying both sides by $\left( \lambda - \langle f''(\vec w \vec x) \rangle \right)$, $$ \vec w \left( \lambda - \langle f''(\vec w \vec x) \rangle \right) \leftarrow \vec w \left( \lambda - \langle f''(\vec w \vec x) \rangle \right) + \left( \langle \vec x f'(\vec w \vec x) \rangle - \lambda \vec w \right) $$ Since we'll be normalizing the weights the multiplicative factor can be dropped on the left side, and the right side simplified: $$ \vec w \leftarrow \langle \vec x f'(\vec w \vec x) \rangle - \vec w \langle f''(\vec w \vec x) \rangle $$ After this step the normalization is preserved with $$ \vec w \leftarrow \frac{\vec w}{\left| \vec w \right|} $$ (this corresponds to choosing the Lagrange multiplier). Starting with random normalized weights, this iteration will seek the least-Gaussian mixture. Multiple weights can either be found by orthogonalizing serially to span the remaining subspace, or be calculated jointly {Hyvarinen:99}. This is a global linear ICA transformation; it can done both nonlinearly and locally {Hyvarinen:04}.

In Machine Learning we'll see another way to accomplish something similar, with a neural network autoencoder.

Compressed Sensing¶

usually measure, then compress

can instead compress, then measure

sparsity a kind of universal prior

example: total variation (TV)

fitting error only is underconstrained

L2 norm: sum of squares

match samples and minimize L2 constraint: pseudo-inverse

L0 norm: number of nonzero elements

match samples and minimize L0 constraint: combinatorial, NP-hard

L1 norm: sum of absolute values

match samples and minimize L1 constraint: compressed sensing

Figure: Constant L1 and L2 norms.

Fornasier, Massimo, and Rauhut, Holger. (2011). Compressive sensing. Pages 187–228 of: Handbook of Mathematical Methods in Imaging. Springer.

Donoho, David L. (2006). Compressed sensing. Information Theory, IEEE Transactions on, 52(4), 1289–1306.

Candes, Emmanuel J, and Tao, Terence. (2006). Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12), 5406–5425.

Candes, Emmanuel J, Romberg, Justin, and Tao, Terence. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. Information Theory, IEEE Transactions on, 52(2), 489–509.

Candes, Emmanuel J, Romberg, Justin K, and Tao, Terence. (2006). Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8), 1207–1223.

Kim, Seung-Jean, Koh, Kwangmoo, Lustig, Michael, and Boyd, Stephen. (2007). An efficient method for compressed sensing. Pages III–117 of: Image Processing, 2007. ICIP 2007. IEEE International Conference on, vol. 3. IEEE.

Yang, Allen Y, Sastry, Shankar S, Ganesh, Arvind, and Ma, Yi. (2010). Fast l1-minimization algorithms and an application in robust face recognition: A review. Pages 1849–1852 of: Image Processing (ICIP), 2010 17th IEEE International Conference on. IEEE.

Wakin, Michael, Stephen Becker, Eric Nakamura, Michael Grant, Emilio Sovero, Daniel Ching, Juhwan Yoo, Justin Romberg, Azita Emami-Neyestanak, and Emmanuel Candes. "A nonuniform sampler for wideband spectrally-sparse environments." IEEE Journal on Emerging and Selected Topics in Circuits and Systems 2, no. 3 (2012): 516-529.

Yoo, Juhwan, Stephen Becker, Matthew Loh, Manuel Monge, Emmanuel Candes, and Azita Emami-Neyestanak. "A 100MHz–2GHz 12.5 x sub-Nyquist rate receiver in 90nm CMOS." In 2012 IEEE Radio Frequency Integrated Circuits Symposium, pp. 31-34. IEEE, 2012.

Tropp, Joel A., Jason N. Laska, Marco F. Duarte, Justin K. Romberg, and Richard G. Baraniuk. "Beyond Nyquist: Efficient sampling of sparse bandlimited signals." arXiv preprint arXiv:0902.0026 (2009).

Data set sources¶

References¶

Golub, Gene H., & Loan, Charles F. Van. (1996). {Matrix Computations}. 3rd edn. Baltimore, MD: Johns Hopkins University Press.

Everything you always wanted to know about transformations with matrices.

Fukunaga, Keinosuke (1990). {Introduction to Statistical Pattern Recognition}. 2nd edn. Boston, MA: Academic Press.

Much of the effort in pattern recognition goes into finding good representations.

Problems¶

Calculate the inverse wavelet transform, using Daubechies fourth-order coefficients, of a vector of length $2^{12}$, with a 1 in the 5th and 30th places and zeros elsewhere else.
Choose a data set, perform PCA, plot a low-dimensional projection of the data set, plot the sorted magnitude of the PCA components, and plot the data set projected onto the first few components.
This problem is harder.
1. Generate and plot a time series $\{t_j\}$ for the sum of two sine waves at 697 and 1209 Hz (the DTMF tone for the number 1 key), sampling it at 10,000 samples per second for 0.25 second ($N$ = 2500 points).
2. Calculate and plot the Discrete Cosine Transform (DCT) coefficients $\{f_i\}$ for these data, defined by their multiplication by the matrix $f_i = \sum_{j=0}^{N-1}D_{ij}t_j$, where $$ D_{ij} = \left\{ \begin{array}{l} \sqrt{\frac{1}{N}} ~~~~ (i = 0)\\ \sqrt{\frac{2}{N}} \cos\left(\frac{\pi(2j+1)i}{2N}\right) ~~~~ (1 \le i \le N-1) \end{array} \right. $$
3. Plot the inverse transform of the $\{f_i\}$ by multiplying them by the inverse of the DCT matrix (which is equal to its transpose) and verify that it matches the time series.
4. Randomly sample and plot a subset $\{t_k'\}$ of 5% of the $\{t_j\}$ ($M = 125$ points).
5. Starting with a random guess for the DCT coefficients $\{f_i'\}$, use gradient descent to minimize the error at the sample points $$ \min_{\{f_i'\}}\sum_{k=0}^{M-1} \left( t_k' - \sum_{i=0}^{N-1} D_{ik} f_i' \right)^2 $$ and plot the resulting estimated coefficients.
6. The preceding minimization is under-constrained; it becames well-posed if a norm of the DCT coefficients is minized subject to a constraint of agreeing with the sampled points. One of the simplest (but not best, see Chapter {conopt}) ways to do this is by adding a penalty term to the minimization. Repeat the gradient descent minimization using the L2 norm: $$ \min_{\{f_i'\}}\sum_{k=0}^{M-1} \left( t_k' - \sum_{i=0}^{N-1} D_{ik} f_i' \right)^2 + \sum_{i=0}^{N-1} f_i'^2 $$ and plot the resulting estimated coefficients.
7. Repeat the gradient descent minimization using the L1 norm: $$ \min_{\{f_i'\}}\sum_{k=0}^{M-1} \left( t_k' - \sum_{i=0}^{N-1} D_{ik} f_i' \right)^2 + \sum_{i=0}^{N-1} \left|f_i'\right| $$ and plot the resulting estimated coefficients, compare to the L2 norm estimate, and compare $M$ to the Nyquist sampling limit of twice the highest frequency.
</ol>

(c) Neil Gershenfeld 3/16/26