Neural Networks¶
The Nature of Mathematical Modeling (draft)¶
$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} $
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 6*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.4,1.25,'weights',textsize)
text(6.5,2,'activation functions',textsize)
text(4.2,3,'hidden layers',textsize)
text(4.6,4,'nodes',textsize)
text(5.4,5.5,'output layer, loss',textsize)
#
plt.show()
Figure: A Multi-Layer Perceptron (MLP) Deep Neural Network (DNN)
Taxonomy¶
supervised (regression, classification), unsupervised, reinforcement (reward, next chapter)
introduces hierarchy in models
History¶
- neuroscience diverge, converge
Hasson, Uri, Samuel A. Nastase, and Ariel Goldstein. "Direct fit to nature: An evolutionary perspective on biological and artificial neural networks." Neuron 105, no. 3 (2020): 416-434.
Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., \& Hinton, G. (2020). Backpropagation and the brain. Nature Reviews Neuroscience, 21(6), 335-346.
- model neurons with a threshold function
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5, 115-133.
- Perceptrons apply neuron models
$f(\ve{w}\cdot\ve{x}+\ve{b})$
Rosenblatt, F. (1957). The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory.
Minsky, Marvin, and Seymour A. Papert. Perceptrons: An introduction to computational geometry. MIT press, 2017.
XOR history
- AI winter
Hopfield
Rumelhardt
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." nature 323, no. 6088 (1986): 533-536.
deep learning
multiple layers $f(g(...))$
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521, no. 7553 (2015): 436-444.
Functions¶
Breadth vs Depth¶
linear vs nonlinear coefficients in functions
under mild assumptions linear depth vs exponential breadth
one layer polynomial of degree $k$
second layer polynommial of degree $q$
polynomial of a polynomial is degree $kq$, using $k+q$ units
a single layer would need $kq$ units
can't represent all terms, but can represent desired terms
avoids the curse of dimensionality
Telgarsky, M. (2016, June). Benefits of depth in neural networks. In Conference on learning theory (pp. 1517-1539). PMLR.
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5), 503-519.
[Simon:26]
Activation¶
step function, logic, not differentiable
tanh
- regression -1 to 1
- sigmoid
- logic 0 to 1
- ReLU
- easy to calculate, fixes vanishing activation slope
- leaky ReLU
- fixes zero inhibition slope
Nair, Vinod, and Geoffrey E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807-814. 2010.
Xu, J., Li, Z., Du, B., Zhang, M., \& Liu, J. (2020, July). Reluplex made more practical: Leaky ReLU. In 2020 IEEE Symposium on Computers and communications (ISCC) (pp. 1-7). IEEE.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3,3,100)
plt.plot(x,1/(1+np.exp(-x)),label='sigmoid')
plt.plot(x,np.tanh(x),label='tanh')
plt.plot(x,np.where(x < 0,0,x),label='ReLU')
plt.plot(x,np.where(x < 0,0.1*x,x),'--',label='leaky ReLU')
plt.legend()
plt.show()
Training¶
Preprocessing¶
- pre-process data, zero mean unit variance, sphering, standardization, ICA
Loss¶
mean squre error for regression
cross entropy for classification $-\la \sum y_i\log p_i\ra$
$y_i$ = 1 for state $i$, 0 otherwise
$p_i$ is the predicted probability to be in state $i$
vanishes if $p=1$
one-hot encoding: one 1, all the rest 0
logits: unnormalized probability predictions
Backpropagation¶
Rumelhart, D. E., Hinton, G. E., \& Williams, R. J. (1985). Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science.
inputs $x_j$
combine with weights
$$ y_i = \sum_j w_{ij} x_j $$can add bias for fixed values and to adjust sensitivity
$$ y_i = \sum_j w_{ij} x_j + b_i $$output through activation function
$$ x_i = f(y_i) $$hidden layer
$$ \ba x_i &=& f\left[\sum_j w_{ij} f(y_j)\right]\\\ x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} x_k\right]\right] \ea $$hidden layers
$$ \ba x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} f(y_l)\right]\right]\\ x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} f\left[\sum_l w_{kl} x_l\right]\right]\right] \ea $$loss, data $d$
$$ \chi^2 = \sum_n \sum_i \left[x_{i,n}-d_{i,n}\right]^2 $$gradient descent, back-propagation
$$ w_{ij} \rightarrow w_{ij} - \alpha \ps{\chi^2}{w_{ij}} $$$$ b_i \rightarrow b_i - \beta \ps{\chi^2}{b_i} $$last layer
$$ \ba \ps{\chi^2}{w_{ij}} &=& \sum_n \sum_{i'} 2\left(x_{i',n}-d_{i',n}\right) \ps{x_{i',n}}{w_{ij}}\\ &=& \sum_n 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) x_{j,n}\\ &\equiv& \sum_n \Delta_{i,n} x_{j,n} \ea $$$$ \ba \ps{\chi^2}{b_i} &=& \sum_n \sum_{i'} 2\left(x_{i',n}-d_{i',n}\right) \ps{x_{i',n}}{b_i}\\ &=& \sum_n 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right)\\ &\equiv& \sum_n \Delta_{i,n} \ea $$next layer
$$ \ba \ps{\chi^2}{w_{jk}} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{w_{jk}}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) w_{ij} f'\left(y_{j,n}\right) x_{k,n}\\ &=& \sum_n \sum_i \Delta_{i,n} w_{ij} f'\left(y_{j,n}\right) x_{k,n}\\ &=& \sum_n f'\left(y_{j,n}\right) \sum_i w_{ij} \Delta_{i,n} x_{k,n}\\ &\equiv& \sum_n \Delta_{j,n} x_{k,n} \ea $$$$ \ba \ps{\chi^2}{b_j} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{b_j}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) w_{ij} f'\left(y_{j,n}\right)\\ &=& \sum_n \sum_i \Delta_{i,n} w_{ij} f'\left(y_{j,n}\right)\\ &=& \sum_n f'\left(y_{j,n}\right) \sum_i w_{ij} \Delta_{i,n}\\ &\equiv& \sum_n \Delta_{j,n} \ea $$next layer
$$ \ba \ps{\chi^2}{w_{kl}} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{w_{kl}}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n \sum_i \Delta_{i,n} \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n \sum_j \Delta_{j,n} w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n f'\left(y_{k,n}\right) \sum_j w_{jk} \Delta_{j,n} x_{l,n}\\ &\equiv& \sum_n \Delta_{k,n} x_{l,n} \ea $$$$ \ba \ps{\chi^2}{b_k} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{b_k}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n \sum_i \Delta_{i,n} \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n \sum_j \Delta_{j,n} w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n f'\left(y_{k,n}\right) \sum_j w_{jk} \Delta_{j,n}\\ &\equiv& \sum_n \Delta_{k,n} \ea $$forward, backward training passes
stochastic gradient descent: train on random subsets of large data sets
Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." In Proceedings of COMPSTAT'2010, pp. 177-186. Physica-Verlag HD, 2010.
need to set a learning rate
can use momentum to avoid local minima
ADAM combines adaptive rate and momentum
Kingma, D. P., \& Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Initialization¶
- He
Regularization¶
variables can be >> data
need to prevent overfitting
separate training and testing data
can do early stopping
Caruana, Rich, Steve Lawrence, and Lee Giles. "Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping." In NIPS, pp. 402-408. 2000.
- can penalize sum square weights
Krogh, A., and Hertz, J. (1991). A simple weight decay can improve generalization. Advances in neural information processing systems, 4.
- can do dropout, randomly drop weights to prevent fine-tuning
Wager, S., Wang, S., \& Liang, P. S. (2013). Dropout training as adaptive regularization. Advances in neural information processing systems, 26.
Examples¶
XOR¶
sklearn¶
from sklearn.neural_network import MLPClassifier
import numpy as np
X = [[0,0],[0,1],[1,0],[1,1]]
y = [0,1,1,0]
classifier = MLPClassifier(solver='lbfgs',hidden_layer_sizes=(4),activation='tanh',random_state=1)
classifier.fit(X,y)
print(f"score: {classifier.score(X,y)}")
print("Predictions:")
print(np.c_[X,classifier.predict(X)])
score: 1.0 Predictions: [[0 0 0] [0 1 1] [1 0 1] [1 1 0]]
Jax, Flax, Optax¶
Jax¶
import jax
import jax.numpy as jnp
from jax import random,grad,jit
#
# init random key
#
key = random.PRNGKey(0)
#
# XOR training data
#
X = jnp.array([[0,0],[0,1],[1,0],[1,1]],dtype=jnp.int8)
y = jnp.array([0,1,1,0],dtype=jnp.int8).reshape(4,1)
#
# forward pass
#
@jit
def forward(params,layer_0):
Weight1,bias1,Weight2,bias2 = params
layer_1 = jnp.tanh(layer_0@Weight1+bias1)
layer_2 = jax.nn.sigmoid(layer_1@Weight2+bias2)
return layer_2
#
# loss function
#
@jit
def loss(params):
ypred = forward(params,X)
return jnp.mean((ypred-y)**2)
#
# gradient update step
#
@jit
def update(params,rate=1):
gradient = grad(loss)(params)
return jax.tree.map(lambda params,gradient:params-rate*gradient,params,gradient)
#
# parameter initialization
#
def init_params(key):
key1,key2 = random.split(key)
Weight1 = 0.5*random.normal(key1,(2,4))
bias1 = jnp.zeros(4)
Weight2 = 0.5*random.normal(key2,(4,1))
bias2 = jnp.zeros(1)
return (Weight1,bias1,Weight2,bias2)
#
# initialize parameters
#
params = init_params(key)
#
# training steps
#
for step in range(500):
params = update(params,rate=10)
if step%100 == 0:
print(f"step {step:4d} loss={loss(params):.4f}")
#
# evaluate fit
#
pred = forward(params,X)
jnp.set_printoptions(precision=2)
print("\nPredictions:")
print(jnp.c_[X,pred])
step 0 loss=0.3047 step 100 loss=0.0008 step 200 loss=0.0004 step 300 loss=0.0002 step 400 loss=0.0002 Predictions: [[0. 0. 0. ] [0. 1. 0.99] [1. 0. 0.99] [1. 1. 0.01]]
MNIST¶
from sklearn.neural_network import MLPClassifier
import numpy as np
xtrain = np.load('datasets/MNIST/xtrain.npy')
ytrain = np.load('datasets/MNIST/ytrain.npy')
xtest = np.load('datasets/MNIST/xtest.npy')
ytest = np.load('datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
classifier = MLPClassifier(solver='adam',hidden_layer_sizes=(100),activation='relu',random_state=1,verbose=True,tol=0.05)
classifier.fit(xtrain,ytrain)
print(f"\ntest score: {classifier.score(xtest,ytest)}\n")
predictions = classifier.predict(xtest)
fig,axs = plt.subplots(1,5)
for i in range(5):
axs[i].imshow(jnp.reshape(xtest[i],(28,28)))
axs[i].axis('off')
axs[i].set_title(f"predict: {predictions[i]}")
plt.tight_layout()
plt.show()
read 784 byte data records, 60000 training examples, 10000 testing examples Iteration 1, loss = 3.36992820 Iteration 2, loss = 1.13264743 Iteration 3, loss = 0.67881654 Iteration 4, loss = 0.44722898 Iteration 5, loss = 0.31655389 Iteration 6, loss = 0.23663579 Iteration 7, loss = 0.19165519 Iteration 8, loss = 0.15617156 Iteration 9, loss = 0.13629980 Iteration 10, loss = 0.11865439 Iteration 11, loss = 0.11459503 Iteration 12, loss = 0.10146799 Iteration 13, loss = 0.09842103 Iteration 14, loss = 0.09300270 Iteration 15, loss = 0.08931920 Iteration 16, loss = 0.08818319 Iteration 17, loss = 0.09585389 Training loss did not improve more than tol=0.050000 for 10 consecutive epochs. Stopping. test score: 0.958
import jax
import jax.numpy as jnp
from jax import random,grad,jit
import matplotlib.pyplot as plt
#
# hyperparameters
#
data_size = 28*28
hidden_size = data_size//10
output_size = 10
batch_size = 5000
train_steps = 25
learning_rate = 0.5
#
# init random key
#
key = random.PRNGKey(0)
#
# load MNIST data
#
xtrain = jnp.load('datasets/MNIST/xtrain.npy')
ytrain = jnp.load('datasets/MNIST/ytrain.npy')
xtest = jnp.load('datasets/MNIST/xtest.npy')
ytest = jnp.load('datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
#
# forward pass
#
@jit
def forward(params,layer_0):
Weight1,bias1,Weight2,bias2 = params
layer_1 = jnp.tanh(layer_0@Weight1+bias1)
layer_2 = layer_1@Weight2+bias2
return layer_2
#
# loss function
#
@jit
def loss(params,xtrain,ytrain):
logits = forward(params,xtrain)
probs = jnp.exp(logits)/jnp.sum(jnp.exp(logits),axis=1,keepdims=True)
error = 1-jnp.mean(probs[jnp.arange(len(ytrain)),ytrain])
return error
#
# gradient update step
#
@jit
def update(params,xtrain,ytrain,rate):
gradient = grad(loss)(params,xtrain,ytrain)
return jax.tree.map(lambda params,gradient:params-rate*gradient,params,gradient)
#
# parameter initialization
#
def init_params(key,xsize,hidden,output):
key1,key = random.split(key)
Weight1 = 0.01*random.normal(key1,(xsize,hidden))
bias1 = jnp.zeros(hidden)
key2,key = random.split(key)
Weight2 = 0.01*random.normal(key2,(hidden,output))
bias2 = jnp.zeros(output)
return (Weight1,bias1,Weight2,bias2)
#
# initialize parameters
#
params = init_params(key,data_size,hidden_size,output_size)
#
# train
#
print(f"starting loss: {loss(params,xtrain,ytrain):.3f}\n")
for batch in range(0,len(ytrain),batch_size):
xbatch = xtrain[batch:batch+batch_size]
ybatch = ytrain[batch:batch+batch_size]
print(f"batch {batch}: ",end='')
for step in range(train_steps):
params = update(params,xbatch,ybatch,rate=learning_rate)
print(f"loss {loss(params,xbatch,ybatch):.3f}")
#
# test
#
logits = forward(params,xtest)
probs = jnp.exp(logits)/jnp.sum(jnp.exp(logits),axis=1,keepdims=True)
error = 1-jnp.mean(probs[jnp.arange(len(ytest)),ytest])
print(f"\ntest loss: {error:.3f}\n")
#
# plot
#
fig,axs = plt.subplots(1,5)
for i in range(5):
axs[i].imshow(jnp.reshape(xtest[i],(28,28)))
axs[i].axis('off')
axs[i].set_title(f"predict: {jnp.argmax(probs[i])}")
plt.tight_layout()
plt.show()
read 784 byte data records, 60000 training examples, 10000 testing examples starting loss: 0.899 batch 0: loss 0.381 batch 5000: loss 0.253 batch 10000: loss 0.198 batch 15000: loss 0.130 batch 20000: loss 0.114 batch 25000: loss 0.100 batch 30000: loss 0.097 batch 35000: loss 0.084 batch 40000: loss 0.082 batch 45000: loss 0.090 batch 50000: loss 0.077 batch 55000: loss 0.050 test loss: 0.085
Architectures¶
DNN/MLP¶
- classifier
- logits
- confusion matrix
- regression
Convolutional (CNN)¶
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
#
wire(.5,.5,1.5,2,linewidth)
wire(1.5,.5,1.5,2,linewidth)
wire(2.5,.5,1.5,2,linewidth)
wire(1.5,.5,2.5,2,linewidth)
wire(2.5,.5,2.5,2,linewidth)
wire(3.5,.5,2.5,2,linewidth)
wire(2.5,.5,3.5,2,linewidth)
wire(3.5,.5,3.5,2,linewidth)
wire(4.5,.5,3.5,2,linewidth)
wire(3.5,.5,4.5,2,linewidth)
wire(4.5,.5,4.5,2,linewidth)
wire(5.5,.5,4.5,2,linewidth)
#
wire(1.5,2,2,3.5,linewidth)
wire(2.5,2,2,3.5,linewidth)
wire(3.5,2,4,3.5,linewidth)
wire(4.5,2,4,3.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
circle(5.5,.5,circlesize)
#
circle(1.5,2,circlesize)
circle(2.5,2,circlesize)
circle(3.5,2,circlesize)
circle(4.5,2,circlesize)
#
circle(2,3.5,circlesize)
circle(4,3.5,circlesize)
#
point(3,4,pointsize)
point(3,4.4,pointsize)
point(3,4.8,pointsize)
#
text(7.2,.5,'input layer',textsize)
text(7.4,1.4,'shared filter weights',textsize)
text(6.,2.8,'pooling layer',textsize)
#
plt.show()
Figure: A Convolutional Neural Network (CNN)
pattern recognition, YOLO
want invariance to translation rotation
huge number of inputs, e.g. pixels
find feature maps
Hubel, David H., and Torsten N. Wiesel. "Receptive fields and functional architecture of monkey striate cortex." The Journal of physiology 195, no. 1 (1968): 215-243.
LeCun, Yann, Koray Kavukcuoglu, and Clément Farabet. "Convolutional networks and applications in vision." In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pp. 253-256. IEEE, 2010.
filter layers
pooling layers
Recurrent (RNN)¶
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def arrow(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
ax.annotate('',xy=(x1,y1),xytext=(x0,y0),
arrowprops=dict(color=(0.6,0.6,0.6),width=width,headwidth=3*width,headlength=3*width))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
wire(3.5,4,5,4,linewidth)
wire(5,4,5,2,linewidth)
arrow(5,2,4.3,2,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.9,1.25,'feed forward',textsize)
text(6.3,3,'feedback',textsize)
text(4.8,5.5,'output layer',textsize)
#
plt.show()
Figure: A Recurrent Neural Network (RNN)
introduces time, memory
MLP uses a fixed window, like an FIR filter
RNN is like an IIR filter
Pineda, Fernando J. "Generalization of back-propagation to recurrent neural networks." Physical review letters 59, no. 19 (1987): 2229.
unroll, do backprop through time
has issues with vanishing, diverging gradients
LSTM adds cells with memories
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Residual (ResNet)¶
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def arrow(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
ax.annotate('',xy=(x1,y1),xytext=(x0,y0),
arrowprops=dict(color=(0.6,0.6,0.6),width=width,headwidth=3*width,headlength=3*width))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
arrow(5,4,3.9,4,linewidth)
wire(5,4,5,2,linewidth)
wire(5,2,4,2,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.9,1.25,'feed forward',textsize)
text(6.,3,'residual',textsize)
text(4.8,5.5,'output layer',textsize)
#
plt.show()
feed layers forward
intervening layers learn residuals
helps with vanishing/diverging gradients
used for hundreds, thousands of layers
Autoencoder (AE)¶
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 5.5*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
linesize = 3
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
def arrow(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
ax.annotate('',xy=(x1,y1),xytext=(x0,y0),
arrowprops=dict(color=(0.6,0.6,0.6),width=width,headwidth=3*width,headlength=3*width))
#
#wire(3.5,4,3,5.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
point(2.5,1.2,pointsize)
point(2.5,1.6,pointsize)
point(2.5,2,pointsize)
#
circle(2,2.6,circlesize)
circle(3,2.6,circlesize)
#
point(2.5,3.2,pointsize)
point(2.5,3.6,pointsize)
point(2.5,4,pointsize)
#
circle(.5,4.7,circlesize)
circle(1.5,4.7,circlesize)
circle(2.5,4.7,circlesize)
circle(3.5,4.7,circlesize)
circle(4.5,4.7,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(4.8,2.6,'latent layer',textsize)
text(9,2.6,'training data',textsize)
text(6.4,4.7,'output layer',textsize)
#
arrow(7.5,2.81,4.8,0.8,linesize)
arrow(7.5,2.79,4.8,4.4,linesize)
#
plt.show()
Figure: An Autoencoder
learn to predict the input through a bottleneck
finds a lower-dimensional representation
unsupervised learning
not reliable for generation outside of training set
will see VAE in Machine Learning
masked autoencoders
Attention¶
Edge ML¶
embedded devices, real-time applications
model quantization and pruning
Packages¶
References¶
- [Ekman:21] Ekman, M. (2021). Learning deep learning: Theory and practice of neural networks, computer vision, NLP, and transformers using Tensorflow.
- A good balance between breadth and depth.
- [Fleuret:24] The Little Book of Deep Learning, François Fleuret (2024)
- https://fleuret.org/public/lbdl.pdf
- A consise (and freely available) survey
- [Simon:26] Simon, Jamie, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà , Blake Bordelon, Jeremy Cohen, Nikhil Ghosh et al. "There Will Be a Scientific Theory of Deep Learning." arXiv preprint arXiv:2604.21691 (2026).
Problems¶
Train a neural network to classify the data set you used for the PCA problem in Transforms.
Train an unsupervised neural network to recognize noisy samples of DTMF tones.
Train a neural network to predict the output of a Linear Feedback Shift Register, and verify its ability to continue the LFSR sequence.
(c) Neil Gershenfeld 4/18/26