Neural Networks

The Nature of Mathematical Modeling (draft)

$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} $

  • history diverging, converging
  • curse, blessing of dimensionality
  • hidden layers latent variables
  • diverging, disappearing gradients
  • quantization
  • edge
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 6*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.4,1.25,'weights',textsize)
text(4.2,3,'hidden layers',textsize)
text(4.8,5.5,'output layer',textsize)
#
plt.show()

Figure: A Multi-Layer Perceptron (MLP) Deep Neural Network (DNN)

supervised (regression, classification), unsupervised, reinforcement

neuroscience diverge, converge

Hasson, Uri, Samuel A. Nastase, and Ariel Goldstein. "Direct fit to nature: An evolutionary perspective on biological and artificial neural networks." Neuron 105, no. 3 (2020): 416-434.

Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., \& Hinton, G. (2020). Backpropagation and the brain. Nature Reviews Neuroscience, 21(6), 335-346.

hierarchy missing until now

model neurons

McCulloch, W. S., \& Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5, 115-133.

perceptrons

Rosenblatt, F. (1957). The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory.

Minsky, Marvin, and Seymour A. Papert. Perceptrons: An introduction to computational geometry. MIT press, 2017.

XOR history

deep

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521, no. 7553 (2015): 436-444.

linear vs nonlinear coefficients in functions

under mild assumptions linear depth vs exponential breadth

Telgarsky, M. (2016, June). Benefits of depth in neural networks. In Conference on learning theory (pp. 1517-1539). PMLR.

Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5), 503-519.

Functions

Breadth vs Depth

Activation

step function, logic switch

not differentiable

tanh

sigmoid

softmax

saturation

ReLU

Nair, Vinod, and Geoffrey E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807-814. 2010.

leaky ReLU

Xu, J., Li, Z., Du, B., Zhang, M., \& Liu, J. (2020, July). Reluplex made more practical: Leaky ReLU. In 2020 IEEE Symposium on Computers and communications (ISCC) (pp. 1-7). IEEE.

GeLU

Hendrycks, D., \& Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3,3,100)
plt.plot(x,1/(1+np.exp(-x)),label='sigmoid')
plt.plot(x,np.tanh(x),label='tanh')
plt.plot(x,np.where(x < 0,0,x),label='ReLU')
plt.plot(x,np.where(x < 0,0.1*x,x),'--',label='leaky ReLU')
plt.legend()
plt.show()

Training

Preprocessing

pre-process data, zero mean unit variance, sphering, standardization

Loss

mean squre error for regression

cross entropy for classification

Backpropagation

Rumelhart, D. E., Hinton, G. E., \& Williams, R. J. (1985). Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science.

inputs $x_j$

combine with weights

$$ y_i = \sum_j w_{ij} x_j $$

can add bias for fixed values and to adjust sensitivity

$$ y_i = \sum_j w_{ij} x_j + b_i $$

output through activation function

$$ x_i = f(y_i) $$

hidden layer

$$ \ba x_i &=& f\left[\sum_j w_{ij} f(y_j)\right]\\\ x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} x_k\right]\right] \ea $$

hidden layers

$$ \ba x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} f(y_l)\right]\right]\\ x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} f\left[\sum_l w_{kl} x_l\right]\right]\right] \ea $$

loss, data $d$

$$ \chi^2 = \sum_n \sum_i \left[x_{i,n}-d_{i,n}\right]^2 $$

gradient descent, back-propagation

$$ w_{ij} \rightarrow w_{ij} - \alpha \ps{\chi^2}{w_{ij}} $$$$ b_i \rightarrow b_i - \beta \ps{\chi^2}{b_i} $$

last layer

$$ \ba \ps{\chi^2}{w_{ij}} &=& \sum_n \sum_{i'} 2\left(x_{i',n}-d_{i',n}\right) \ps{x_{i',n}}{w_{ij}}\\ &=& \sum_n 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) x_{j,n}\\ &\equiv& \sum_n \Delta_{i,n} x_{j,n} \ea $$$$ \ba \ps{\chi^2}{b_i} &=& \sum_n \sum_{i'} 2\left(x_{i',n}-d_{i',n}\right) \ps{x_{i',n}}{b_i}\\ &=& \sum_n 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right)\\ &\equiv& \sum_n \Delta_{i,n} \ea $$

next layer

$$ \ba \ps{\chi^2}{w_{jk}} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{w_{jk}}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) w_{ij} f'\left(y_{j,n}\right) x_{k,n}\\ &=& \sum_n \sum_i \Delta_{i,n} w_{ij} f'\left(y_{j,n}\right) x_{k,n}\\ &=& \sum_n f'\left(y_{j,n}\right) \sum_i w_{ij} \Delta_{i,n} x_{k,n}\\ &\equiv& \sum_n \Delta_{j,n} x_{k,n} \ea $$$$ \ba \ps{\chi^2}{b_j} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{b_j}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) w_{ij} f'\left(y_{j,n}\right)\\ &=& \sum_n \sum_i \Delta_{i,n} w_{ij} f'\left(y_{j,n}\right)\\ &=& \sum_n f'\left(y_{j,n}\right) \sum_i w_{ij} \Delta_{i,n}\\ &\equiv& \sum_n \Delta_{j,n} \ea $$

next layer

$$ \ba \ps{\chi^2}{w_{kl}} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{w_{kl}}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n \sum_i \Delta_{i,n} \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n \sum_j \Delta_{j,n} w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n f'\left(y_{k,n}\right) \sum_j w_{jk} \Delta_{j,n} x_{l,n}\\ &\equiv& \sum_n \Delta_{k,n} x_{l,n} \ea $$$$ \ba \ps{\chi^2}{b_k} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{b_k}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n \sum_i \Delta_{i,n} \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n \sum_j \Delta_{j,n} w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n f'\left(y_{k,n}\right) \sum_j w_{jk} \Delta_{j,n}\\ &\equiv& \sum_n \Delta_{k,n} \ea $$

forward, backward training passes

stochastic gradient descent

Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." In Proceedings of COMPSTAT'2010, pp. 177-186. Physica-Verlag HD, 2010.

incremental, batch updates for large data sets

learning rate

momentum, local minima

ADAM adaptive rate, momentum

Kingma, D. P., \& Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Regularization

hyperparameters

validation vs testing data

overfitting

early stopping

Caruana, Rich, Steve Lawrence, and Lee Giles. "Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping." In NIPS, pp. 402-408. 2000.

penalize sum square weights

Krogh, A., and Hertz, J. (1991). A simple weight decay can improve generalization. Advances in neural information processing systems, 4.

dropout, randomly drop weights to prevent fine-tuning

Wager, S., Wang, S., \& Liang, P. S. (2013). Dropout training as adaptive regularization. Advances in neural information processing systems, 26.

Examples

XOR

from sklearn.neural_network import MLPClassifier
import numpy as np
X = [[0,0],[0,1],[1,0],[1,1]]
y = [0,1,1,0]
classifier = MLPClassifier(solver='lbfgs',hidden_layer_sizes=(4),activation='tanh',random_state=1)
classifier.fit(X,y)
print(f"score: {classifier.score(X,y)}")
print("Predictions:")
np.c_[X,classifier.predict(X)]
score: 1.0
Predictions:
array([[0, 0, 0],
       [0, 1, 1],
       [1, 0, 1],
       [1, 1, 0]])
import jax
import jax.numpy as jnp
from jax import random,grad,jit
#
# init random key
#
key = random.PRNGKey(0)
#
# XOR training data
#
X = jnp.array([[0,0],[0,1],[1,0],[1,1]],dtype=jnp.int8)
y = jnp.array([0,1,1,0],dtype=jnp.int8).reshape(4,1)
#
# forward pass
#
@jit
def forward(params,layer_0):
    Weight1,bias1,Weight2,bias2 = params
    layer_1 = jnp.tanh(layer_0@Weight1+bias1)
    layer_2 = jax.nn.sigmoid(layer_1@Weight2+bias2)
    return layer_2
#
# loss function
#
@jit
def loss(params):
    ypred = forward(params,X)
    return jnp.mean((ypred-y)**2)
#
# gradient update step
#
@jit
def update(params,rate=1):
    gradient = grad(loss)(params)
    return jax.tree.map(lambda params,gradient:params-rate*gradient,params,gradient)
#
# parameter initialization
#
def init_params(key):
    key1,key2 = random.split(key)
    Weight1 = 0.5*random.normal(key1,(2,4))
    bias1 = jnp.zeros(4)
    Weight2 = 0.5*random.normal(key2,(4,1))
    bias2 = jnp.zeros(1)
    return (Weight1,bias1,Weight2,bias2)
#
# initialize parameters
#
params = init_params(key)
#
# training steps
#
for step in range(500):
    params = update(params,rate=10)
    if step%100 == 0:
        print(f"step {step:4d} loss={loss(params):.4f}")
#
# evaluate fit
#
pred = forward(params,X)
jnp.set_printoptions(precision=2)
print("\nPredictions:")
print(jnp.c_[X,pred])
step    0 loss=0.3047
step  100 loss=0.0008
step  200 loss=0.0004
step  300 loss=0.0002
step  400 loss=0.0002

Predictions:
[[0.   0.   0.  ]
 [0.   1.   0.99]
 [1.   0.   0.99]
 [1.   1.   0.01]]

MNIST

from sklearn.neural_network import MLPClassifier
import numpy as np
xtrain = np.load('datasets/MNIST/xtrain.npy')
ytrain = np.load('datasets/MNIST/ytrain.npy')
xtest = np.load('datasets/MNIST/xtest.npy')
ytest = np.load('datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
classifier = MLPClassifier(solver='adam',hidden_layer_sizes=(100),activation='relu',random_state=1,verbose=True,tol=0.05)
classifier.fit(xtrain,ytrain)
print(f"\ntest score: {classifier.score(xtest,ytest)}\n")
predictions = classifier.predict(xtest)
fig,axs = plt.subplots(1,5)
for i in range(5):
    axs[i].imshow(jnp.reshape(xtest[i],(28,28)))
    axs[i].axis('off')
    axs[i].set_title(f"predict: {predictions[i]}")
plt.tight_layout()
plt.show()
read 784 byte data records, 60000 training examples, 10000 testing examples

Iteration 1, loss = 3.36992820
Iteration 2, loss = 1.13264743
Iteration 3, loss = 0.67881654
Iteration 4, loss = 0.44722898
Iteration 5, loss = 0.31655389
Iteration 6, loss = 0.23663579
Iteration 7, loss = 0.19165519
Iteration 8, loss = 0.15617156
Iteration 9, loss = 0.13629980
Iteration 10, loss = 0.11865439
Iteration 11, loss = 0.11459503
Iteration 12, loss = 0.10146799
Iteration 13, loss = 0.09842103
Iteration 14, loss = 0.09300270
Iteration 15, loss = 0.08931920
Iteration 16, loss = 0.08818319
Iteration 17, loss = 0.09585389
Training loss did not improve more than tol=0.050000 for 10 consecutive epochs. Stopping.

test score: 0.958

import jax
import jax.numpy as jnp
from jax import random,grad,jit
import matplotlib.pyplot as plt
#
# hyperparameters
#
data_size = 28*28
hidden_size = data_size//10
output_size = 10
batch_size = 5000
train_steps = 25
learning_rate = 0.5
#
# init random key
#
key = random.PRNGKey(0)
#
# load MNIST data
#
xtrain = jnp.load('datasets/MNIST/xtrain.npy')
ytrain = jnp.load('datasets/MNIST/ytrain.npy')
xtest = jnp.load('datasets/MNIST/xtest.npy')
ytest = jnp.load('datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
#
# forward pass
#
@jit
def forward(params,layer_0):
    Weight1,bias1,Weight2,bias2 = params
    layer_1 = jnp.tanh(layer_0@Weight1+bias1)
    layer_2 = layer_1@Weight2+bias2
    return layer_2
#
# loss function
#
@jit
def loss(params,xtrain,ytrain):
    logits = forward(params,xtrain)
    probs = jnp.exp(logits)/jnp.sum(jnp.exp(logits),axis=1,keepdims=True)
    error = 1-jnp.mean(probs[jnp.arange(len(ytrain)),ytrain])
    return error
#
# gradient update step
#
@jit
def update(params,xtrain,ytrain,rate):
    gradient = grad(loss)(params,xtrain,ytrain)
    return jax.tree.map(lambda params,gradient:params-rate*gradient,params,gradient)
#
# parameter initialization
#
def init_params(key,xsize,hidden,output):
    key1,key = random.split(key)
    Weight1 = 0.01*random.normal(key1,(xsize,hidden))
    bias1 = jnp.zeros(hidden)
    key2,key = random.split(key)
    Weight2 = 0.01*random.normal(key2,(hidden,output))
    bias2 = jnp.zeros(output)
    return (Weight1,bias1,Weight2,bias2)
#
# initialize parameters
#
params = init_params(key,data_size,hidden_size,output_size)
#
# train
#
print(f"starting loss: {loss(params,xtrain,ytrain):.3f}\n")
for batch in range(0,len(ytrain),batch_size):
    xbatch = xtrain[batch:batch+batch_size]
    ybatch = ytrain[batch:batch+batch_size]
    print(f"batch {batch}: ",end='')
    for step in range(train_steps):
        params = update(params,xbatch,ybatch,rate=learning_rate)
    print(f"loss {loss(params,xbatch,ybatch):.3f}")
#
# test
#
logits = forward(params,xtest)
probs = jnp.exp(logits)/jnp.sum(jnp.exp(logits),axis=1,keepdims=True)
error = 1-jnp.mean(probs[jnp.arange(len(ytest)),ytest])
print(f"\ntest loss: {error:.3f}\n")
#
# plot
#
fig,axs = plt.subplots(1,5)
for i in range(5):
    axs[i].imshow(jnp.reshape(xtest[i],(28,28)))
    axs[i].axis('off')
    axs[i].set_title(f"predict: {jnp.argmax(probs[i])}")
plt.tight_layout()
plt.show()
read 784 byte data records, 60000 training examples, 10000 testing examples

starting loss: 0.899

batch 0: loss 0.381
batch 5000: loss 0.253
batch 10000: loss 0.198
batch 15000: loss 0.130
batch 20000: loss 0.114
batch 25000: loss 0.100
batch 30000: loss 0.097
batch 35000: loss 0.084
batch 40000: loss 0.082
batch 45000: loss 0.090
batch 50000: loss 0.077
batch 55000: loss 0.050

test loss: 0.085

Architectures

Convolutional

import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
#
wire(.5,.5,1.5,2,linewidth)
wire(1.5,.5,1.5,2,linewidth)
wire(2.5,.5,1.5,2,linewidth)
wire(1.5,.5,2.5,2,linewidth)
wire(2.5,.5,2.5,2,linewidth)
wire(3.5,.5,2.5,2,linewidth)
wire(2.5,.5,3.5,2,linewidth)
wire(3.5,.5,3.5,2,linewidth)
wire(4.5,.5,3.5,2,linewidth)
wire(3.5,.5,4.5,2,linewidth)
wire(4.5,.5,4.5,2,linewidth)
wire(5.5,.5,4.5,2,linewidth)
#
wire(1.5,2,2,3.5,linewidth)
wire(2.5,2,2,3.5,linewidth)
wire(3.5,2,4,3.5,linewidth)
wire(4.5,2,4,3.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
circle(5.5,.5,circlesize)
#
circle(1.5,2,circlesize)
circle(2.5,2,circlesize)
circle(3.5,2,circlesize)
circle(4.5,2,circlesize)
#
circle(2,3.5,circlesize)
circle(4,3.5,circlesize)
#
point(3,4,pointsize)
point(3,4.4,pointsize)
point(3,4.8,pointsize)
#
text(7.2,.5,'input layer',textsize)
text(7.4,1.4,'shared filter weights',textsize)
text(6.,2.8,'pooling layer',textsize)
#
plt.show()

Figure: A Convolutional Neural Network (CNN)

pattern recognition

want invariance to translation rotation

huge number of inputs, e.g. pixels

find feature maps

Hubel, David H., and Torsten N. Wiesel. "Receptive fields and functional architecture of monkey striate cortex." The Journal of physiology 195, no. 1 (1968): 215-243.

LeCun, Yann, Koray Kavukcuoglu, and Clément Farabet. "Convolutional networks and applications in vision." In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pp. 253-256. IEEE, 2010.

filter layers

pooling layers

Recurrent

import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def arrow(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    ax.annotate('',xy=(x1,y1),xytext=(x0,y0),
        arrowprops=dict(color=(0.6,0.6,0.6),width=width,headwidth=3*width,headlength=3*width))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
wire(3.5,4,5,4,linewidth)
wire(5,4,5,2,linewidth)
arrow(5,2,4.3,2,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.9,1.25,'feed forward',textsize)
text(6.3,3,'feedback',textsize)
text(4.8,5.5,'output layer',textsize)
#
plt.show()

Figure: A Recurrent Neural Network (RNN)

introduces time, memory

MLP uses a fixed window, like an FIR filter

RNN is like an IIR filter

Pineda, Fernando J. "Generalization of back-propagation to recurrent neural networks." Physical review letters 59, no. 19 (1987): 2229.

unroll, do backprop through time

issue vanishing, diverging gradients

LSTM

Hochreiter, S., \& Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

Autoencoder

import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 5.5*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
    x0 = scale*x0
    x1 = scale*x1
    y0 = scale*y0
    y1 = scale*y1
    plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
    x = scale*x
    y = scale*y
    plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
        markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
    x = scale*x
    y = scale*y
    ax.text(x,y,text,
        ha='center',va='center',
        math_fontfamily='cm',
        fontsize=size,color='black')
#
#wire(3.5,4,3,5.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
point(2.5,1.2,pointsize)
point(2.5,1.6,pointsize)
point(2.5,2,pointsize)
#
circle(2,2.6,circlesize)
circle(3,2.6,circlesize)
#
point(2.5,3.2,pointsize)
point(2.5,3.6,pointsize)
point(2.5,4,pointsize)
#
circle(.5,4.7,circlesize)
circle(1.5,4.7,circlesize)
circle(2.5,4.7,circlesize)
circle(3.5,4.7,circlesize)
circle(4.5,4.7,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(4.8,2.6,'latent layer',textsize)
text(6.4,4.7,'output layer',textsize)
#
plt.show()

Figure: An Autoencoder

learn to predict the input through a bottleneck

finds a lower-dimensional representation

unsupervised

Packages

References

  • [Ekman:21] Ekman, M. (2021). Learning deep learning: Theory and practice of neural networks, computer vision, NLP, and transformers using Tensorflow.
    • A good balance between breadth and depth.
  • [Fleuret:24] The Little Book of Deep Learning, François Fleuret (2024)

Problems

  1. Train and test a neural network classifier on the data set you used for the PCA problem in Transforms.

  2. Train a neural network autoencoder to recognize DTMF tones.

  3. Train a recurrent neural network to predict the output of a Linear Feedback Shift Register, and verify its ability to continue a LFSR sequence.