Neural Networks¶
The Nature of Mathematical Modeling (draft)¶
$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} $
- history diverging, converging
- curse, blessing of dimensionality
- hidden layers latent variables
- diverging, disappearing gradients
- quantization
- edge
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 6*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.4,1.25,'weights',textsize)
text(4.2,3,'hidden layers',textsize)
text(4.8,5.5,'output layer',textsize)
#
plt.show()
Figure: A Multi-Layer Perceptron (MLP) Deep Neural Network (DNN)
supervised (regression, classification), unsupervised, reinforcement
neuroscience diverge, converge
Hasson, Uri, Samuel A. Nastase, and Ariel Goldstein. "Direct fit to nature: An evolutionary perspective on biological and artificial neural networks." Neuron 105, no. 3 (2020): 416-434.
Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., \& Hinton, G. (2020). Backpropagation and the brain. Nature Reviews Neuroscience, 21(6), 335-346.
hierarchy missing until now
model neurons
McCulloch, W. S., \& Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5, 115-133.
perceptrons
Rosenblatt, F. (1957). The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory.
Minsky, Marvin, and Seymour A. Papert. Perceptrons: An introduction to computational geometry. MIT press, 2017.
XOR history
deep
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521, no. 7553 (2015): 436-444.
linear vs nonlinear coefficients in functions
under mild assumptions linear depth vs exponential breadth
Telgarsky, M. (2016, June). Benefits of depth in neural networks. In Conference on learning theory (pp. 1517-1539). PMLR.
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5), 503-519.
Functions¶
Breadth vs Depth¶
Activation¶
step function, logic switch
not differentiable
tanh
sigmoid
softmax
saturation
ReLU
Nair, Vinod, and Geoffrey E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807-814. 2010.
leaky ReLU
Xu, J., Li, Z., Du, B., Zhang, M., \& Liu, J. (2020, July). Reluplex made more practical: Leaky ReLU. In 2020 IEEE Symposium on Computers and communications (ISCC) (pp. 1-7). IEEE.
GeLU
Hendrycks, D., \& Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3,3,100)
plt.plot(x,1/(1+np.exp(-x)),label='sigmoid')
plt.plot(x,np.tanh(x),label='tanh')
plt.plot(x,np.where(x < 0,0,x),label='ReLU')
plt.plot(x,np.where(x < 0,0.1*x,x),'--',label='leaky ReLU')
plt.legend()
plt.show()
Training¶
Preprocessing¶
pre-process data, zero mean unit variance, sphering, standardization
Loss¶
mean squre error for regression
cross entropy for classification
Backpropagation¶
Rumelhart, D. E., Hinton, G. E., \& Williams, R. J. (1985). Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science.
inputs $x_j$
combine with weights
$$ y_i = \sum_j w_{ij} x_j $$can add bias for fixed values and to adjust sensitivity
$$ y_i = \sum_j w_{ij} x_j + b_i $$output through activation function
$$ x_i = f(y_i) $$hidden layer
$$ \ba x_i &=& f\left[\sum_j w_{ij} f(y_j)\right]\\\ x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} x_k\right]\right] \ea $$hidden layers
$$ \ba x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} f(y_l)\right]\right]\\ x_i &=& f\left[\sum_j w_{ij} f\left[\sum_k w_{jk} f\left[\sum_l w_{kl} x_l\right]\right]\right] \ea $$loss, data $d$
$$ \chi^2 = \sum_n \sum_i \left[x_{i,n}-d_{i,n}\right]^2 $$gradient descent, back-propagation
$$ w_{ij} \rightarrow w_{ij} - \alpha \ps{\chi^2}{w_{ij}} $$$$ b_i \rightarrow b_i - \beta \ps{\chi^2}{b_i} $$last layer
$$ \ba \ps{\chi^2}{w_{ij}} &=& \sum_n \sum_{i'} 2\left(x_{i',n}-d_{i',n}\right) \ps{x_{i',n}}{w_{ij}}\\ &=& \sum_n 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) x_{j,n}\\ &\equiv& \sum_n \Delta_{i,n} x_{j,n} \ea $$$$ \ba \ps{\chi^2}{b_i} &=& \sum_n \sum_{i'} 2\left(x_{i',n}-d_{i',n}\right) \ps{x_{i',n}}{b_i}\\ &=& \sum_n 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right)\\ &\equiv& \sum_n \Delta_{i,n} \ea $$next layer
$$ \ba \ps{\chi^2}{w_{jk}} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{w_{jk}}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) w_{ij} f'\left(y_{j,n}\right) x_{k,n}\\ &=& \sum_n \sum_i \Delta_{i,n} w_{ij} f'\left(y_{j,n}\right) x_{k,n}\\ &=& \sum_n f'\left(y_{j,n}\right) \sum_i w_{ij} \Delta_{i,n} x_{k,n}\\ &\equiv& \sum_n \Delta_{j,n} x_{k,n} \ea $$$$ \ba \ps{\chi^2}{b_j} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{b_j}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) w_{ij} f'\left(y_{j,n}\right)\\ &=& \sum_n \sum_i \Delta_{i,n} w_{ij} f'\left(y_{j,n}\right)\\ &=& \sum_n f'\left(y_{j,n}\right) \sum_i w_{ij} \Delta_{i,n}\\ &\equiv& \sum_n \Delta_{j,n} \ea $$next layer
$$ \ba \ps{\chi^2}{w_{kl}} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{w_{kl}}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n \sum_i \Delta_{i,n} \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n \sum_j \Delta_{j,n} w_{jk} f'\left(y_{k,n}\right) x_{l,n}\\ &=& \sum_n f'\left(y_{k,n}\right) \sum_j w_{jk} \Delta_{j,n} x_{l,n}\\ &\equiv& \sum_n \Delta_{k,n} x_{l,n} \ea $$$$ \ba \ps{\chi^2}{b_k} &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) \ps{x_{i,n}}{b_k}\\ &=& \sum_n \sum_i 2\left(x_{i,n}-d_{i,n}\right) f'\left(y_{i,n}\right) \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n \sum_i \Delta_{i,n} \sum_j w_{ij} f'\left(y_{j,n}\right) w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n \sum_j \Delta_{j,n} w_{jk} f'\left(y_{k,n}\right)\\ &=& \sum_n f'\left(y_{k,n}\right) \sum_j w_{jk} \Delta_{j,n}\\ &\equiv& \sum_n \Delta_{k,n} \ea $$forward, backward training passes
stochastic gradient descent
Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." In Proceedings of COMPSTAT'2010, pp. 177-186. Physica-Verlag HD, 2010.
incremental, batch updates for large data sets
learning rate
momentum, local minima
ADAM adaptive rate, momentum
Kingma, D. P., \& Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Regularization¶
hyperparameters
validation vs testing data
overfitting
early stopping
Caruana, Rich, Steve Lawrence, and Lee Giles. "Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping." In NIPS, pp. 402-408. 2000.
penalize sum square weights
Krogh, A., and Hertz, J. (1991). A simple weight decay can improve generalization. Advances in neural information processing systems, 4.
dropout, randomly drop weights to prevent fine-tuning
Wager, S., Wang, S., \& Liang, P. S. (2013). Dropout training as adaptive regularization. Advances in neural information processing systems, 26.
Examples¶
XOR¶
from sklearn.neural_network import MLPClassifier
import numpy as np
X = [[0,0],[0,1],[1,0],[1,1]]
y = [0,1,1,0]
classifier = MLPClassifier(solver='lbfgs',hidden_layer_sizes=(4),activation='tanh',random_state=1)
classifier.fit(X,y)
print(f"score: {classifier.score(X,y)}")
print("Predictions:")
np.c_[X,classifier.predict(X)]
score: 1.0 Predictions:
array([[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 0]])
import jax
import jax.numpy as jnp
from jax import random,grad,jit
#
# init random key
#
key = random.PRNGKey(0)
#
# XOR training data
#
X = jnp.array([[0,0],[0,1],[1,0],[1,1]],dtype=jnp.int8)
y = jnp.array([0,1,1,0],dtype=jnp.int8).reshape(4,1)
#
# forward pass
#
@jit
def forward(params,layer_0):
Weight1,bias1,Weight2,bias2 = params
layer_1 = jnp.tanh(layer_0@Weight1+bias1)
layer_2 = jax.nn.sigmoid(layer_1@Weight2+bias2)
return layer_2
#
# loss function
#
@jit
def loss(params):
ypred = forward(params,X)
return jnp.mean((ypred-y)**2)
#
# gradient update step
#
@jit
def update(params,rate=1):
gradient = grad(loss)(params)
return jax.tree.map(lambda params,gradient:params-rate*gradient,params,gradient)
#
# parameter initialization
#
def init_params(key):
key1,key2 = random.split(key)
Weight1 = 0.5*random.normal(key1,(2,4))
bias1 = jnp.zeros(4)
Weight2 = 0.5*random.normal(key2,(4,1))
bias2 = jnp.zeros(1)
return (Weight1,bias1,Weight2,bias2)
#
# initialize parameters
#
params = init_params(key)
#
# training steps
#
for step in range(500):
params = update(params,rate=10)
if step%100 == 0:
print(f"step {step:4d} loss={loss(params):.4f}")
#
# evaluate fit
#
pred = forward(params,X)
jnp.set_printoptions(precision=2)
print("\nPredictions:")
print(jnp.c_[X,pred])
step 0 loss=0.3047 step 100 loss=0.0008 step 200 loss=0.0004 step 300 loss=0.0002 step 400 loss=0.0002 Predictions: [[0. 0. 0. ] [0. 1. 0.99] [1. 0. 0.99] [1. 1. 0.01]]
MNIST¶
from sklearn.neural_network import MLPClassifier
import numpy as np
xtrain = np.load('datasets/MNIST/xtrain.npy')
ytrain = np.load('datasets/MNIST/ytrain.npy')
xtest = np.load('datasets/MNIST/xtest.npy')
ytest = np.load('datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
classifier = MLPClassifier(solver='adam',hidden_layer_sizes=(100),activation='relu',random_state=1,verbose=True,tol=0.05)
classifier.fit(xtrain,ytrain)
print(f"\ntest score: {classifier.score(xtest,ytest)}\n")
predictions = classifier.predict(xtest)
fig,axs = plt.subplots(1,5)
for i in range(5):
axs[i].imshow(jnp.reshape(xtest[i],(28,28)))
axs[i].axis('off')
axs[i].set_title(f"predict: {predictions[i]}")
plt.tight_layout()
plt.show()
read 784 byte data records, 60000 training examples, 10000 testing examples Iteration 1, loss = 3.36992820 Iteration 2, loss = 1.13264743 Iteration 3, loss = 0.67881654 Iteration 4, loss = 0.44722898 Iteration 5, loss = 0.31655389 Iteration 6, loss = 0.23663579 Iteration 7, loss = 0.19165519 Iteration 8, loss = 0.15617156 Iteration 9, loss = 0.13629980 Iteration 10, loss = 0.11865439 Iteration 11, loss = 0.11459503 Iteration 12, loss = 0.10146799 Iteration 13, loss = 0.09842103 Iteration 14, loss = 0.09300270 Iteration 15, loss = 0.08931920 Iteration 16, loss = 0.08818319 Iteration 17, loss = 0.09585389 Training loss did not improve more than tol=0.050000 for 10 consecutive epochs. Stopping. test score: 0.958
import jax
import jax.numpy as jnp
from jax import random,grad,jit
import matplotlib.pyplot as plt
#
# hyperparameters
#
data_size = 28*28
hidden_size = data_size//10
output_size = 10
batch_size = 5000
train_steps = 25
learning_rate = 0.5
#
# init random key
#
key = random.PRNGKey(0)
#
# load MNIST data
#
xtrain = jnp.load('datasets/MNIST/xtrain.npy')
ytrain = jnp.load('datasets/MNIST/ytrain.npy')
xtest = jnp.load('datasets/MNIST/xtest.npy')
ytest = jnp.load('datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
#
# forward pass
#
@jit
def forward(params,layer_0):
Weight1,bias1,Weight2,bias2 = params
layer_1 = jnp.tanh(layer_0@Weight1+bias1)
layer_2 = layer_1@Weight2+bias2
return layer_2
#
# loss function
#
@jit
def loss(params,xtrain,ytrain):
logits = forward(params,xtrain)
probs = jnp.exp(logits)/jnp.sum(jnp.exp(logits),axis=1,keepdims=True)
error = 1-jnp.mean(probs[jnp.arange(len(ytrain)),ytrain])
return error
#
# gradient update step
#
@jit
def update(params,xtrain,ytrain,rate):
gradient = grad(loss)(params,xtrain,ytrain)
return jax.tree.map(lambda params,gradient:params-rate*gradient,params,gradient)
#
# parameter initialization
#
def init_params(key,xsize,hidden,output):
key1,key = random.split(key)
Weight1 = 0.01*random.normal(key1,(xsize,hidden))
bias1 = jnp.zeros(hidden)
key2,key = random.split(key)
Weight2 = 0.01*random.normal(key2,(hidden,output))
bias2 = jnp.zeros(output)
return (Weight1,bias1,Weight2,bias2)
#
# initialize parameters
#
params = init_params(key,data_size,hidden_size,output_size)
#
# train
#
print(f"starting loss: {loss(params,xtrain,ytrain):.3f}\n")
for batch in range(0,len(ytrain),batch_size):
xbatch = xtrain[batch:batch+batch_size]
ybatch = ytrain[batch:batch+batch_size]
print(f"batch {batch}: ",end='')
for step in range(train_steps):
params = update(params,xbatch,ybatch,rate=learning_rate)
print(f"loss {loss(params,xbatch,ybatch):.3f}")
#
# test
#
logits = forward(params,xtest)
probs = jnp.exp(logits)/jnp.sum(jnp.exp(logits),axis=1,keepdims=True)
error = 1-jnp.mean(probs[jnp.arange(len(ytest)),ytest])
print(f"\ntest loss: {error:.3f}\n")
#
# plot
#
fig,axs = plt.subplots(1,5)
for i in range(5):
axs[i].imshow(jnp.reshape(xtest[i],(28,28)))
axs[i].axis('off')
axs[i].set_title(f"predict: {jnp.argmax(probs[i])}")
plt.tight_layout()
plt.show()
read 784 byte data records, 60000 training examples, 10000 testing examples starting loss: 0.899 batch 0: loss 0.381 batch 5000: loss 0.253 batch 10000: loss 0.198 batch 15000: loss 0.130 batch 20000: loss 0.114 batch 25000: loss 0.100 batch 30000: loss 0.097 batch 35000: loss 0.084 batch 40000: loss 0.082 batch 45000: loss 0.090 batch 50000: loss 0.077 batch 55000: loss 0.050 test loss: 0.085
Architectures¶
Convolutional¶
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
#
wire(.5,.5,1.5,2,linewidth)
wire(1.5,.5,1.5,2,linewidth)
wire(2.5,.5,1.5,2,linewidth)
wire(1.5,.5,2.5,2,linewidth)
wire(2.5,.5,2.5,2,linewidth)
wire(3.5,.5,2.5,2,linewidth)
wire(2.5,.5,3.5,2,linewidth)
wire(3.5,.5,3.5,2,linewidth)
wire(4.5,.5,3.5,2,linewidth)
wire(3.5,.5,4.5,2,linewidth)
wire(4.5,.5,4.5,2,linewidth)
wire(5.5,.5,4.5,2,linewidth)
#
wire(1.5,2,2,3.5,linewidth)
wire(2.5,2,2,3.5,linewidth)
wire(3.5,2,4,3.5,linewidth)
wire(4.5,2,4,3.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
circle(5.5,.5,circlesize)
#
circle(1.5,2,circlesize)
circle(2.5,2,circlesize)
circle(3.5,2,circlesize)
circle(4.5,2,circlesize)
#
circle(2,3.5,circlesize)
circle(4,3.5,circlesize)
#
point(3,4,pointsize)
point(3,4.4,pointsize)
point(3,4.8,pointsize)
#
text(7.2,.5,'input layer',textsize)
text(7.4,1.4,'shared filter weights',textsize)
text(6.,2.8,'pooling layer',textsize)
#
plt.show()
Figure: A Convolutional Neural Network (CNN)
pattern recognition
want invariance to translation rotation
huge number of inputs, e.g. pixels
find feature maps
Hubel, David H., and Torsten N. Wiesel. "Receptive fields and functional architecture of monkey striate cortex." The Journal of physiology 195, no. 1 (1968): 215-243.
LeCun, Yann, Koray Kavukcuoglu, and Clément Farabet. "Convolutional networks and applications in vision." In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pp. 253-256. IEEE, 2010.
filter layers
pooling layers
Recurrent¶
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 7*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def arrow(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
ax.annotate('',xy=(x1,y1),xytext=(x0,y0),
arrowprops=dict(color=(0.6,0.6,0.6),width=width,headwidth=3*width,headlength=3*width))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
#
wire(.5,.5,1,2,linewidth)
wire(.5,.5,2,2,linewidth)
wire(.5,.5,3,2,linewidth)
wire(.5,.5,4,2,linewidth)
wire(1.5,.5,1,2,linewidth)
wire(1.5,.5,2,2,linewidth)
wire(1.5,.5,3,2,linewidth)
wire(1.5,.5,4,2,linewidth)
wire(2.5,.5,1,2,linewidth)
wire(2.5,.5,2,2,linewidth)
wire(2.5,.5,3,2,linewidth)
wire(2.5,.5,4,2,linewidth)
wire(3.5,.5,1,2,linewidth)
wire(3.5,.5,2,2,linewidth)
wire(3.5,.5,3,2,linewidth)
wire(3.5,.5,4,2,linewidth)
wire(4.5,.5,1,2,linewidth)
wire(4.5,.5,2,2,linewidth)
wire(4.5,.5,3,2,linewidth)
wire(4.5,.5,4,2,linewidth)
#
wire(1.5,4,2,5.5,linewidth)
wire(1.5,4,3,5.5,linewidth)
wire(2.5,4,2,5.5,linewidth)
wire(2.5,4,3,5.5,linewidth)
wire(3.5,4,2,5.5,linewidth)
wire(3.5,4,3,5.5,linewidth)
#
wire(3.5,4,5,4,linewidth)
wire(5,4,5,2,linewidth)
arrow(5,2,4.3,2,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
circle(1,2,circlesize)
circle(2,2,circlesize)
circle(3,2,circlesize)
circle(4,2,circlesize)
#
point(2.5,2.6,pointsize)
point(2.5,3,pointsize)
point(2.5,3.4,pointsize)
#
circle(1.5,4,circlesize)
circle(2.5,4,circlesize)
circle(3.5,4,circlesize)
#
circle(2,5.5,circlesize)
circle(3,5.5,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(5.9,1.25,'feed forward',textsize)
text(6.3,3,'feedback',textsize)
text(4.8,5.5,'output layer',textsize)
#
plt.show()
Figure: A Recurrent Neural Network (RNN)
introduces time, memory
MLP uses a fixed window, like an FIR filter
RNN is like an IIR filter
Pineda, Fernando J. "Generalization of back-propagation to recurrent neural networks." Physical review letters 59, no. 19 (1987): 2229.
unroll, do backprop through time
issue vanishing, diverging gradients
LSTM
Hochreiter, S., \& Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Autoencoder¶
import matplotlib.pyplot as plt
import numpy as np
scale = .6
width = 7*scale
height = 5.5*scale
linewidth = 2.5
circlesize = 18
pointsize = 5
textsize = 14
fig,ax = plt.subplots(figsize=(width,height))
ax.axis([0,width,0,height])
ax.set_axis_off()
def wire(x0,y0,x1,y1,width):
x0 = scale*x0
x1 = scale*x1
y0 = scale*y0
y1 = scale*y1
plt.plot([x0,x1],[y0,y1],'-',linewidth=width,color=(0.6,0.6,0.6))
def point(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size)
def circle(x,y,size):
x = scale*x
y = scale*y
plt.plot(x,y,'ko',markersize=size,markeredgewidth=size/8,
markerfacecolor='white',markeredgecolor='black')
def text(x,y,text,size):
x = scale*x
y = scale*y
ax.text(x,y,text,
ha='center',va='center',
math_fontfamily='cm',
fontsize=size,color='black')
#
#wire(3.5,4,3,5.5,linewidth)
#
circle(.5,.5,circlesize)
circle(1.5,.5,circlesize)
circle(2.5,.5,circlesize)
circle(3.5,.5,circlesize)
circle(4.5,.5,circlesize)
#
point(2.5,1.2,pointsize)
point(2.5,1.6,pointsize)
point(2.5,2,pointsize)
#
circle(2,2.6,circlesize)
circle(3,2.6,circlesize)
#
point(2.5,3.2,pointsize)
point(2.5,3.6,pointsize)
point(2.5,4,pointsize)
#
circle(.5,4.7,circlesize)
circle(1.5,4.7,circlesize)
circle(2.5,4.7,circlesize)
circle(3.5,4.7,circlesize)
circle(4.5,4.7,circlesize)
#
text(6.2,.5,'input layer',textsize)
text(4.8,2.6,'latent layer',textsize)
text(6.4,4.7,'output layer',textsize)
#
plt.show()
Figure: An Autoencoder
learn to predict the input through a bottleneck
finds a lower-dimensional representation
unsupervised
Packages¶
References¶
- [Ekman:21] Ekman, M. (2021). Learning deep learning: Theory and practice of neural networks, computer vision, NLP, and transformers using Tensorflow.
- A good balance between breadth and depth.
- [Fleuret:24] The Little Book of Deep Learning, François Fleuret (2024)
- https://fleuret.org/public/lbdl.pdf
- A consise (and freely available) survey
Problems¶
Train and test a neural network classifier on the data set you used for the PCA problem in Transforms.
Train a neural network autoencoder to recognize DTMF tones.
Train a recurrent neural network to predict the output of a Linear Feedback Shift Register, and verify its ability to continue a LFSR sequence.