Neural Networks¶

The Nature of Mathematical Modeling (draft)¶

$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} $

import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 6*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.4,1.25,'weights',textsize)
text(6.5,2,'activation functions',textsize)
text(4.2,3,'hidden layers',textsize)
text(4.6,4,'nodes',textsize)
text(5.4,5.5,'output layer, loss',textsize)
#
plt.show()

Figure: A Multi-Layer Perceptron (MLP) Deep Neural Network (DNN)

Taxonomy¶

supervised (regression, classification), unsupervised, reinforcement (reward, next chapter)
introduces hierarchy in models

History¶

neuroscience diverge, converge

Hasson, Uri, Samuel A. Nastase, and Ariel Goldstein. "Direct fit to nature: An evolutionary perspective on biological and artificial neural networks." Neuron 105, no. 3 (2020): 416-434.

Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., \& Hinton, G. (2020). Backpropagation and the brain. Nature Reviews Neuroscience, 21(6), 335-346.

model neurons with a threshold function

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5, 115-133.

Perceptrons apply neuron models

$f(\ve{w}\cdot\ve{x}+\ve{b})$

Rosenblatt, F. (1957). The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory.

Minsky, Marvin, and Seymour A. Papert. Perceptrons: An introduction to computational geometry. MIT press, 2017.

XOR history
- AI winter
Hopfield
- https://www.nobelprize.org/prizes/physics/2024/press-release/
Rumelhardt

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." nature 323, no. 6088 (1986): 533-536.

deep learning
multiple layers $f(g(...))$

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521, no. 7553 (2015): 436-444.

Functions¶

Breadth vs Depth¶

linear vs nonlinear coefficients in functions
under mild assumptions linear depth vs exponential breadth

one layer polynomial of degree $k$

second layer polynommial of degree $q$

polynomial of a polynomial is degree $kq$, using $k+q$ units

a single layer would need $kq$ units

can't represent all terms, but can represent desired terms

avoids the curse of dimensionality

Telgarsky, M. (2016, June). Benefits of depth in neural networks. In Conference on learning theory (pp. 1517-1539). PMLR.

Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5), 503-519.

[Simon:26]

Activation¶

step function, logic, not differentiable
tanh
- regression -1 to 1
sigmoid
- logic 0 to 1
ReLU
- easy to calculate, fixes vanishing activation slope
leaky ReLU
- fixes zero inhibition slope

Nair, Vinod, and Geoffrey E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807-814. 2010.

Xu, J., Li, Z., Du, B., Zhang, M., \& Liu, J. (2020, July). Reluplex made more practical: Leaky ReLU. In 2020 IEEE Symposium on Computers and communications (ISCC) (pp. 1-7). IEEE.

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3,3,100)
plt.plot(x,1/(1+np.exp(-x)),label='sigmoid')
plt.plot(x,np.tanh(x),label='tanh')
plt.plot(x,np.where(x < 0,0,x),label='ReLU')
plt.plot(x,np.where(x < 0,0.1*x,x),'--',label='leaky ReLU')
plt.legend()
plt.show()

Training¶

Preprocessing¶

pre-process data, zero mean unit variance, sphering, standardization, ICA

Loss¶

mean squre error for regression
cross entropy for classification $-\la \sum y_i\log p_i\ra$

$y_i$ = 1 for state $i$, 0 otherwise

$p_i$ is the predicted probability to be in state $i$

vanishes if $p=1$

one-hot encoding: one 1, all the rest 0
logits: unnormalized probability predictions

Backpropagation¶

Rumelhart, D. E., Hinton, G. E., \& Williams, R. J. (1985). Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science.

inputs $x_j$

combine with weights

$$ y_i = \sum_j w_{ij} x_j $$

can add bias for fixed values and to adjust sensitivity

$$ y_i = \sum_j w_{ij} x_j + b_i $$

output through activation function

$$ x_i = f(y_i) $$

hidden layer

$$ \ba x_i &=& f\left[\sum_j w_{ij} f(y_j)\right]\\\ x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} x_k\right]\right] \ea $$

hidden layers

$$ \ba x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} f(y_l)\right]\right]\\ x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} f\left[\sum_l w_{kl} x_l\right]\right]\right] \ea $$

loss, data $d$

$$ \chi^2 = \sum_n \sum_i \left[x_{i,n}-d_{i,n}\right]^2 $$

gradient descent, back-propagation

$$ w_{ij} \rightarrow w_{ij} - \alpha \ps{\chi^2}{w_{ij}} $$$$ b_i \rightarrow b_i - \beta \ps{\chi^2}{b_i} $$

last layer

$$ \ba \ps{\chi^2}{w_{ij}} &=& \sum_n \sum_{i'} 2\left(x_{i',n}-d_{i',n}\right) \ps{x_{i',n}}{w_{ij}}\\ &=& \sum_n 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) x_{j,n}\\ &\equiv& \sum_n \Delta_{i,n} x_{j,n} \ea $$$$ \ba \ps{\chi^2}{b_i} &=& \sum_n \sum_{i'} 2\left(x_{i',n}-d_{i',n}\right) \ps{x_{i',n}}{b_i}\\ &=& \sum_n 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right)\\ &\equiv& \sum_n \Delta_{i,n} \ea $$

next layer

$$ \ba \ps{\chi^2}{w_{jk}} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{w_{jk}}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) w_{ij} f'\left(y_{j,n}\right) x_{k,n}\\ &=& \sum_n \sum_i \Delta_{i,n} w_{ij} f'\left(y_{j,n}\right) x_{k,n}\\ &=& \sum_n f'\left(y_{j,n}\right) \sum_i w_{ij} \Delta_{i,n} x_{k,n}\\ &\equiv& \sum_n \Delta_{j,n} x_{k,n} \ea $$$$ \ba \ps{\chi^2}{b_j} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{b_j}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) w_{ij} f'\left(y_{j,n}\right)\\ &=& \sum_n \sum_i \Delta_{i,n} w_{ij} f'\left(y_{j,n}\right)\\ &=& \sum_n f'\left(y_{j,n}\right) \sum_i w_{ij} \Delta_{i,n}\\ &\equiv& \sum_n \Delta_{j,n} \ea $$

next layer

$$ \ba \ps{\chi^2}{w_{kl}} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{w_{kl}}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n \sum_i \Delta_{i,n} \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n \sum_j \Delta_{j,n} w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n f'\left(y_{k,n}\right) \sum_j w_{jk} \Delta_{j,n} x_{l,n}\\ &\equiv& \sum_n \Delta_{k,n} x_{l,n} \ea $$$$ \ba \ps{\chi^2}{b_k} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{b_k}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n \sum_i \Delta_{i,n} \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n \sum_j \Delta_{j,n} w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n f'\left(y_{k,n}\right) \sum_j w_{jk} \Delta_{j,n}\\ &\equiv& \sum_n \Delta_{k,n} \ea $$

forward, backward training passes
stochastic gradient descent: train on random subsets of large data sets

Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." In Proceedings of COMPSTAT'2010, pp. 177-186. Physica-Verlag HD, 2010.

need to set a learning rate
can use momentum to avoid local minima
ADAM combines adaptive rate and momentum

Kingma, D. P., \& Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Initialization¶

Regularization¶

variables can be >> data
need to prevent overfitting
separate training and testing data
can do early stopping

Caruana, Rich, Steve Lawrence, and Lee Giles. "Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping." In NIPS, pp. 402-408. 2000.

can penalize sum square weights

Krogh, A., and Hertz, J. (1991). A simple weight decay can improve generalization. Advances in neural information processing systems, 4.

can do dropout, randomly drop weights to prevent fine-tuning

Wager, S., Wang, S., \& Liang, P. S. (2013). Dropout training as adaptive regularization. Advances in neural information processing systems, 26.

Examples¶

XOR¶

sklearn¶

from sklearn.neural_network import MLPClassifier
import numpy as np
X = [[0,0],[0,1],[1,0],[1,1]]
y = [0,1,1,0]
classifier = MLPClassifier(solver='lbfgs',hidden_layer_sizes=(4),activation='tanh',random_state=1)
classifier.fit(X,y)
print(f"score: {classifier.score(X,y)}")
print("Predictions:")
print(np.c_[X,classifier.predict(X)])

score: 1.0
Predictions:
[[0 0 0]
 [0 1 1]
 [1 0 1]
 [1 1 0]]

Jax, Flax, Optax¶

Jax¶

import jax
import jax.numpy as jnp
from jax import random,grad,jit
#
# init random key
#
key = random.PRNGKey(0)
#
# XOR training data
#
X = jnp.array([[0,0],[0,1],[1,0],[1,1]],dtype=jnp.int8)
y = jnp.array([0,1,1,0],dtype=jnp.int8).reshape(4,1)
#
# forward pass
#
@jit
def forward(params,layer_0):
    Weight1,bias1,Weight2,bias2 = params
    layer_1 = jnp.tanh(layer_0@Weight1+bias1)
    layer_2 = jax.nn.sigmoid(layer_1@Weight2+bias2)
    return layer_2
#
# loss function
#
@jit
def loss(params):
    ypred = forward(params,X)
    return jnp.mean((ypred-y)**2)
#
# gradient update step
#
@jit
def update(params,rate=1):
    gradient = grad(loss)(params)
    return jax.tree.map(lambda params,gradient:params-rate*gradient,params,gradient)
#
# parameter initialization
#
def init_params(key):
    key1,key2 = random.split(key)
    Weight1 = 0.5*random.normal(key1,(2,4))
    bias1 = jnp.zeros(4)
    Weight2 = 0.5*random.normal(key2,(4,1))
    bias2 = jnp.zeros(1)
    return (Weight1,bias1,Weight2,bias2)
#
# initialize parameters
#
params = init_params(key)
#
# training steps
#
for step in range(500):
    params = update(params,rate=10)
    if step%100 == 0:
        print(f"step {step:4d} loss={loss(params):.4f}")
#
# evaluate fit
#
pred = forward(params,X)
jnp.set_printoptions(precision=2)
print("\nPredictions:")
print(jnp.c_[X,pred])

step    0 loss=0.3047
step  100 loss=0.0008
step  200 loss=0.0004
step  300 loss=0.0002
step  400 loss=0.0002

Predictions:
[[0.   0.   0.  ]
 [0.   1.   0.99]
 [1.   0.   0.99]
 [1.   1.   0.01]]

MNIST¶

from sklearn.neural_network import MLPClassifier
import numpy as np
xtrain = np.load('datasets/MNIST/xtrain.npy')
ytrain = np.load('datasets/MNIST/ytrain.npy')
xtest = np.load('datasets/MNIST/xtest.npy')
ytest = np.load('datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
classifier = MLPClassifier(solver='adam',hidden_layer_sizes=(100),activation='relu',random_state=1,verbose=True,tol=0.05)
classifier.fit(xtrain,ytrain)
print(f"\ntest score: {classifier.score(xtest,ytest)}\n")
predictions = classifier.predict(xtest)
fig,axs = plt.subplots(1,5)
for i in range(5):
    axs[i].imshow(jnp.reshape(xtest[i],(28,28)))
    axs[i].axis('off')
    axs[i].set_title(f"predict: {predictions[i]}")
plt.tight_layout()
plt.show()

read 784 byte data records, 60000 training examples, 10000 testing examples

Iteration 1, loss = 3.36992820
Iteration 2, loss = 1.13264743
Iteration 3, loss = 0.67881654
Iteration 4, loss = 0.44722898
Iteration 5, loss = 0.31655389
Iteration 6, loss = 0.23663579
Iteration 7, loss = 0.19165519
Iteration 8, loss = 0.15617156
Iteration 9, loss = 0.13629980
Iteration 10, loss = 0.11865439
Iteration 11, loss = 0.11459503
Iteration 12, loss = 0.10146799
Iteration 13, loss = 0.09842103
Iteration 14, loss = 0.09300270
Iteration 15, loss = 0.08931920
Iteration 16, loss = 0.08818319
Iteration 17, loss = 0.09585389
Training loss did not improve more than tol=0.050000 for 10 consecutive epochs. Stopping.

test score: 0.958

import jax
import jax.numpy as jnp
from jax import random,grad,jit
import matplotlib.pyplot as plt
#
# hyperparameters
#
data_size = 28*28
hidden_size = data_size//10
output_size = 10
batch_size = 5000
train_steps = 25
learning_rate = 0.5
#
# init random key
#
key = random.PRNGKey(0)
#
# load MNIST data
#
xtrain = jnp.load('datasets/MNIST/xtrain.npy')
ytrain = jnp.load('datasets/MNIST/ytrain.npy')
xtest = jnp.load('datasets/MNIST/xtest.npy')
ytest = jnp.load('datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
#
# forward pass
#
@jit
def forward(params,layer_0):
    Weight1,bias1,Weight2,bias2 = params
    layer_1 = jnp.tanh(layer_0@Weight1+bias1)
    layer_2 = layer_1@Weight2+bias2
    return layer_2
#
# loss function
#
@jit
def loss(params,xtrain,ytrain):
    logits = forward(params,xtrain)
    probs = jnp.exp(logits)/jnp.sum(jnp.exp(logits),axis=1,keepdims=True)
    error = 1-jnp.mean(probs[jnp.arange(len(ytrain)),ytrain])
    return error
#
# gradient update step
#
@jit
def update(params,xtrain,ytrain,rate):
    gradient = grad(loss)(params,xtrain,ytrain)
    return jax.tree.map(lambda params,gradient:params-rate*gradient,params,gradient)
#
# parameter initialization
#
def init_params(key,xsize,hidden,output):
    key1,key = random.split(key)
    Weight1 = 0.01*random.normal(key1,(xsize,hidden))
    bias1 = jnp.zeros(hidden)
    key2,key = random.split(key)
    Weight2 = 0.01*random.normal(key2,(hidden,output))
    bias2 = jnp.zeros(output)
    return (Weight1,bias1,Weight2,bias2)
#
# initialize parameters
#
params = init_params(key,data_size,hidden_size,output_size)
#
# train
#
print(f"starting loss: {loss(params,xtrain,ytrain):.3f}\n")
for batch in range(0,len(ytrain),batch_size):
    xbatch = xtrain[batch:batch+batch_size]
    ybatch = ytrain[batch:batch+batch_size]
    print(f"batch {batch}: ",end='')
    for step in range(train_steps):
        params = update(params,xbatch,ybatch,rate=learning_rate)
    print(f"loss {loss(params,xbatch,ybatch):.3f}")
#
# test
#
logits = forward(params,xtest)
probs = jnp.exp(logits)/jnp.sum(jnp.exp(logits),axis=1,keepdims=True)
error = 1-jnp.mean(probs[jnp.arange(len(ytest)),ytest])
print(f"\ntest loss: {error:.3f}\n")
#
# plot
#
fig,axs = plt.subplots(1,5)
for i in range(5):
    axs[i].imshow(jnp.reshape(xtest[i],(28,28)))
    axs[i].axis('off')
    axs[i].set_title(f"predict: {jnp.argmax(probs[i])}")
plt.tight_layout()
plt.show()

read 784 byte data records, 60000 training examples, 10000 testing examples

starting loss: 0.899

batch 0: loss 0.381
batch 5000: loss 0.253
batch 10000: loss 0.198
batch 15000: loss 0.130
batch 20000: loss 0.114
batch 25000: loss 0.100
batch 30000: loss 0.097
batch 35000: loss 0.084
batch 40000: loss 0.082
batch 45000: loss 0.090
batch 50000: loss 0.077
batch 55000: loss 0.050

test loss: 0.085

Architectures¶

DNN/MLP¶

classifier
- logits
- confusion matrix
regression

Convolutional (CNN)¶

import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
#
wire(.5,.5,1.5,2,linewidth)
wire(1.5,.5,1.5,2,linewidth)
wire(2.5,.5,1.5,2,linewidth)
wire(1.5,.5,2.5,2,linewidth)
wire(2.5,.5,2.5,2,linewidth)
wire(3.5,.5,2.5,2,linewidth)
wire(2.5,.5,3.5,2,linewidth)
wire(3.5,.5,3.5,2,linewidth)
wire(4.5,.5,3.5,2,linewidth)
wire(3.5,.5,4.5,2,linewidth)
wire(4.5,.5,4.5,2,linewidth)
wire(5.5,.5,4.5,2,linewidth)
#
wire(1.5,2,2,3.5,linewidth)
wire(2.5,2,2,3.5,linewidth)
wire(3.5,2,4,3.5,linewidth)
wire(4.5,2,4,3.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
circle(5.5,.5,circlesize)
#
circle(1.5,2,circlesize)
circle(2.5,2,circlesize)
circle(3.5,2,circlesize)
circle(4.5,2,circlesize)
#
circle(2,3.5,circlesize)
circle(4,3.5,circlesize)
#
point(3,4,pointsize)
point(3,4.4,pointsize)
point(3,4.8,pointsize)
#
text(7.2,.5,'input layer',textsize)
text(7.4,1.4,'shared filter weights',textsize)
text(6.,2.8,'pooling layer',textsize)
#
plt.show()

Figure: A Convolutional Neural Network (CNN)

pattern recognition, YOLO
want invariance to translation rotation
huge number of inputs, e.g. pixels
find feature maps

Hubel, David H., and Torsten N. Wiesel. "Receptive fields and functional architecture of monkey striate cortex." The Journal of physiology 195, no. 1 (1968): 215-243.

LeCun, Yann, Koray Kavukcuoglu, and Clément Farabet. "Convolutional networks and applications in vision." In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pp. 253-256. IEEE, 2010.

filter layers
pooling layers

Recurrent (RNN)¶

import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def arrow(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    ax.annotate('',xy=(x1,y1),xytext=(x0,y0),
        arrowprops=dict(color=(0.6,0.6,0.6),width=width,headwidth=3*width,headlength=3*width))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
wire(3.5,4,5,4,linewidth)
wire(5,4,5,2,linewidth)
arrow(5,2,4.3,2,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.9,1.25,'feed forward',textsize)
text(6.3,3,'feedback',textsize)
text(4.8,5.5,'output layer',textsize)
#
plt.show()

Figure: A Recurrent Neural Network (RNN)

introduces time, memory
MLP uses a fixed window, like an FIR filter
RNN is like an IIR filter

Pineda, Fernando J. "Generalization of back-propagation to recurrent neural networks." Physical review letters 59, no. 19 (1987): 2229.

unroll, do backprop through time
has issues with vanishing, diverging gradients
LSTM adds cells with memories

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

Residual (ResNet)¶

import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def arrow(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    ax.annotate('',xy=(x1,y1),xytext=(x0,y0),
        arrowprops=dict(color=(0.6,0.6,0.6),width=width,headwidth=3*width,headlength=3*width))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
arrow(5,4,3.9,4,linewidth)
wire(5,4,5,2,linewidth)
wire(5,2,4,2,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.9,1.25,'feed forward',textsize)
text(6.,3,'residual',textsize)
text(4.8,5.5,'output layer',textsize)
#
plt.show()

feed layers forward
intervening layers learn residuals
helps with vanishing/diverging gradients
used for hundreds, thousands of layers

Autoencoder (AE)¶

import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 5.5*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
linesize = 3
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
def arrow(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    ax.annotate('',xy=(x1,y1),xytext=(x0,y0),
        arrowprops=dict(color=(0.6,0.6,0.6),width=width,headwidth=3*width,headlength=3*width))
#
#wire(3.5,4,3,5.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
point(2.5,1.2,pointsize)
point(2.5,1.6,pointsize)
point(2.5,2,pointsize)
#
circle(2,2.6,circlesize)
circle(3,2.6,circlesize)
#
point(2.5,3.2,pointsize)
point(2.5,3.6,pointsize)
point(2.5,4,pointsize)
#
circle(.5,4.7,circlesize)
circle(1.5,4.7,circlesize)
circle(2.5,4.7,circlesize)
circle(3.5,4.7,circlesize)
circle(4.5,4.7,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(4.8,2.6,'latent layer',textsize)
text(9,2.6,'training data',textsize)
text(6.4,4.7,'output layer',textsize)
#
arrow(7.5,2.81,4.8,0.8,linesize)
arrow(7.5,2.79,4.8,4.4,linesize)
#
plt.show()

Figure: An Autoencoder

learn to predict the input through a bottleneck

finds a lower-dimensional representation

unsupervised learning

not reliable for generation outside of training set

will see VAE in Machine Learning

masked autoencoders

Attention¶

Edge ML¶

embedded devices, real-time applications
model quantization and pruning

Packages¶

References¶

[Ekman:21] Ekman, M. (2021). Learning deep learning: Theory and practice of neural networks, computer vision, NLP, and transformers using Tensorflow.
- A good balance between breadth and depth.
[Fleuret:24] The Little Book of Deep Learning, François Fleuret (2024)
- https://fleuret.org/public/lbdl.pdf
- A consise (and freely available) survey
[Simon:26] Simon, Jamie, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh et al. "There Will Be a Scientific Theory of Deep Learning." arXiv preprint arXiv:2604.21691 (2026).

Problems¶

Train a neural network to classify the data set you used for the PCA problem in Transforms.
Train an unsupervised neural network to recognize noisy samples of DTMF tones.
Train a neural network to predict the output of a Linear Feedback Shift Register, and verify its ability to continue the LFSR sequence.