Machine Learning

The Nature of Mathematical Modeling (draft)

$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} $

Taxonomy

  • AI: model human intelligence
  • ML: learn from data
    • supervised: data labels
    • unsupervised: no data labels
    • generative: synthesize data
    • reinforcement: maximize reward from interacting with an environment
  • hyperparameters
    • parameters that define the model architecture
    • important rule of thumb: results should not depend on them

Surrogate Models

[Alizadeh:20]

  • Given a function that is expensive to compute (FEA, CFD, ...), sample it at selected points and train a model to interpolate
  • Can uses DNNs, RBFs, ...

Sampling

  • random is uniform only in the asymptotic limit
  • want no gaps for any size (low discrepancy)
Latin Hypercube Sampling (LHS)
  • Random sampling can be non-uniform at low densities

  • Uniform sampling is infeasible in high dimensions

  • Latin Hypercube Sampling (LHS) is a compromise

  • A Latin Square is a $N\times N$ array with $N$ symbols that occur once in each row and once in each column

  • LHS is a higher-dimensional generalization

    • divide each variable into N intervals
    • randomly choose N points, one in each interval
    • like placing N rooks on a chessboard that can't capture each other
Halton sequence
  • can continuously sample

Physics-Informed Neural Networks (PINN)

  • Surrogate models based only on sampling don't take advantage of knowledge of governing equations

  • Instead can train a machine learning model using violation of the governing equations as a penalty term [Karniadakis:21]

  • Add a penalty term for agreement with boundary conditions

  • ODE with a differential operator: ${\cal D}(x(t)) = 0$

  • PDE: ${\cal D}(u(x,t)) = 0$

  • showing scalar, could be vector

  • penalty is $|{\cal D}(x(t))|^2$ or $|{\cal D}(u(x,t))|^2$

  • penalty can be calcuated by automatic differentiation of the network in the differential operator

  • trained network gives solution function $x(t)$, $u(x(t))$

  • can speed up solutions by orders of magnitude

  • can add a penalty term for agreement with sampled data to anchor the search

  • for time-dependent problems can do time marching to learn earlier times before learning later times

Example: linear differential equation

$\dd{y}{t}+\gamma\ds{y}{t}+ky=0$

PINN solution, $r^3$ RBFs

low-dimensional input and output, can use RBFs

RBF collocation method

$a+bt$ needed

import jax
import jax.numpy as jnp
from jax import grad,vmap,jit,hessian
import matplotlib.pyplot as plt
g = 0.1
k = 10.0
y0 = 1.0
dy0 = 0.0
tmin = 0.0
tmax = 10.0
nrbf = 100
ncoll = 1000
nplot = 1000
w_ic = 1.0e4
centers = jnp.linspace(tmin,tmax,nrbf)
tcoll = jnp.linspace(tmin,tmax,ncoll)
def y_fn(params,t):
    c = params[:nrbf]
    a,b = params[nrbf],params[nrbf+1]
    r = jnp.abs(t-centers)
    return jnp.sum(c*r**3)+a+b*t
def dy_fn(params,t): return grad(y_fn,argnums=1)(params,t)
def d2y_fn(params,t): return grad(dy_fn,argnums=1)(params,t)
def residual(params,t): return d2y_fn(params,t)+g*dy_fn(params,t)+k*y_fn(params,t)
@jit
def loss_fn(params):
    r = vmap(residual,in_axes=(None,0))(params,tcoll)
    physics = jnp.mean(r**2)
    ic_y = (y_fn(params,0.0)-y0)**2
    ic_dy = (dy_fn(params,0.0)-dy0)**2
    return physics+w_ic*(ic_y+ic_dy)
grad_fn = jit(grad(loss_fn))
hess_fn = jit(hessian(loss_fn))
params = jnp.zeros(nrbf+2)
for it in range(8):
    L = loss_fn(params)
    gv = grad_fn(params)
    H = hess_fn(params)
    dp = jnp.linalg.solve(H,-gv)
    params = params+dp
    print(f"Newton step {it}: loss = {float(L):.3e}")
omega = jnp.sqrt(k-(g/2)**2)
def y_exact(t):
    return jnp.exp(-g*t/2)*(jnp.cos(omega*t)+(g/(2*omega))*jnp.sin(omega*t))
tplot = jnp.linspace(tmin,tmax,nplot)
ypinn = vmap(y_fn,in_axes=(None, 0))(params,tplot)
yex = vmap(y_exact)(tplot)
plt.plot(tplot,yex,'k-',label='exact')
plt.plot(tplot,ypinn,'r--',label='$r^3$ RBF PINN')
plt.ylabel('y(t)')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Newton step 0: loss = 1.000e+04
Newton step 1: loss = 4.717e-04
Newton step 2: loss = 4.710e-04
Newton step 3: loss = 4.710e-04
Newton step 4: loss = 4.710e-04
Newton step 5: loss = 4.710e-04
Newton step 6: loss = 4.710e-04
Newton step 7: loss = 4.710e-04
RK4 solution
from numba import njit
import numpy as np
import matplotlib.pyplot as plt
dt = 1e-3
tmax = 10
g = 0.1
k = 10
def f0(y1):
    return y1
def f1(y0,y1):
    return -g*y1-k*y0
def rk4():
    npts = int(tmax/dt)
    print(npts)
    y0 = 1
    y1 = 0
    t = 0
    error = 0
    yrk4 = []
    trk4 = []
    for i in range(npts):
        a0 = dt*f0(y1)
        a1 = dt*f1(y0,y1)
        b0 = dt*f0(y1+a1/2)
        b1 = dt*f1(y0+a0/2,y1+a1/2)
        c0 = dt*f0(y1+b1/2)
        c1 = dt*f1(y0+b0/2,y1+b1/2)
        d0 = dt*f0(y1+c1)
        d1 = dt*f1(y0+c0,y1+c1)
        y0 = y0+(a0+2*b0+2*c0+d0)/6
        y1 = y1+(a1+2*b1+2*c1+d1)/6
        t += dt
        yrk4.append(y0)
        trk4.append(t)
    return yrk4,trk4
print('solve RK4 ...')
yrk4,trk4 = rk4()
solve RK4 ...
10000
PINN solution, 6 layers
import jax
import jax.numpy as jnp
from jax import grad,vmap,jit,random
import optax
import matplotlib.pyplot as plt
g = 0.1
k = 10.0
y0 = 1.0
dy0 = 0.0
tmin = 0.0
tmax = 10.0
nsteps = 40_000
ntrain = 500
nplot = 500
layers = [1,64,64,64,64,1]
def init_mlp(layer_sizes,key):
    params = []
    keys = random.split(key,len(layer_sizes) - 1)
    for k,(din,dout) in zip(keys,zip(layer_sizes[:-1],layer_sizes[1:])):
        W = random.normal(k,(din,dout))*jnp.sqrt(2.0/din)
        b = jnp.zeros(dout)
        params.append((W, b))
    return params
def mlp_forward(params,t):
    x = jnp.atleast_1d(t)
    for W,b in params[:-1]:
        x = jnp.tanh(x@W+b)
    W,b = params[-1]
    return (x@W+b).squeeze()
def y_fn(params,t): return mlp_forward(params,t)
def dy_fn(params,t): return grad(y_fn,argnums=1)(params,t)
def d2y_fn(params,t): return grad(dy_fn,argnums=1)(params,t)
def residual(params,t): return d2y_fn(params,t)+g*dy_fn(params,t)+k*y_fn(params,t)
def loss_fn(params,tcolloc):
    r = vmap(residual,in_axes=(None,0))(params,tcolloc)
    physics_loss = jnp.mean(r**2)
    ic_y  = (y_fn(params,0.0)-y0)**2
    ic_dy = (dy_fn(params,0.0)-dy0)**2
    return physics_loss+100.0*(ic_y+ic_dy)
key = random.PRNGKey(0)
params = init_mlp(layers,key)
optimizer = optax.adam(1e-3)
opt_state = optimizer.init(params)
def train_step(params,opt_state,tcolloc):
    loss,grads = jax.value_and_grad(loss_fn)(params,tcolloc)
    updates,opt_state = optimizer.update(grads,opt_state,params)
    params = optax.apply_updates(params,updates)
    return params,opt_state,loss
tcolloc = jnp.linspace(tmin,tmax,ntrain)
for step in range(nsteps):
    params,opt_state,loss = train_step(params,opt_state,tcolloc)
    if step%2000 == 0:
        print(f"step {step:6d}, loss = {float(loss):.3e}")
tpinn = jnp.linspace(tmin,tmax,nplot)
ypinn6 = vmap(y_fn,in_axes=(None,0))(params,tpinn)
step      0, loss = 2.646e+03
step   2000, loss = 1.183e+00
step   4000, loss = 7.780e-01
step   6000, loss = 6.288e-01
step   8000, loss = 4.794e-01
step  10000, loss = 3.214e-01
step  12000, loss = 2.781e-01
step  14000, loss = 2.321e-01
step  16000, loss = 2.018e-01
step  18000, loss = 1.683e-01
step  20000, loss = 1.531e-01
step  22000, loss = 1.438e-01
step  24000, loss = 1.373e-01
step  26000, loss = 1.190e-01
step  28000, loss = 9.736e-02
step  30000, loss = 2.480e-02
step  32000, loss = 5.160e-03
step  34000, loss = 8.468e-02
step  36000, loss = 1.634e-02
step  38000, loss = 2.270e-02
PINN solution, 5 layers
layers = [1,64,64,64,1]
key = random.PRNGKey(0)
params = init_mlp(layers,key)
optimizer = optax.adam(1e-3)
opt_state = optimizer.init(params)
tcolloc = jnp.linspace(tmin,tmax,ntrain)
for step in range(nsteps):
    params,opt_state,loss = train_step(params,opt_state,tcolloc)
    if step%2000 == 0:
        print(f"step {step:6d}, loss = {float(loss):.3e}")
ypinn5 = vmap(y_fn,in_axes=(None,0))(params,tpinn)
step      0, loss = 3.913e+03
step   2000, loss = 1.285e+00
step   4000, loss = 9.700e-01
step   6000, loss = 7.768e-01
step   8000, loss = 6.286e-01
step  10000, loss = 5.360e-01
step  12000, loss = 4.307e-01
step  14000, loss = 3.781e-01
step  16000, loss = 3.514e-01
step  18000, loss = 3.152e-01
step  20000, loss = 2.511e-01
step  22000, loss = 2.087e-01
step  24000, loss = 1.939e-01
step  26000, loss = 1.728e-01
step  28000, loss = 1.740e-01
step  30000, loss = 1.379e-01
step  32000, loss = 1.259e-01
step  34000, loss = 9.927e-02
step  36000, loss = 1.059e-01
step  38000, loss = 5.093e-02
plot
plt.figure()
plt.plot(trk4,yrk4,'k--',label='RK4')
plt.plot(tpinn,ypinn6,'r-',label='PINN 6 layers')
plt.plot(tpinn,ypinn5,'g-',label='PINN 5 layers')
plt.xlabel('t')
plt.ylabel('y(t)')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Generative Models

The goal of generative models is to synthesize new data (images, ...)

Variational Autoencoder (VAE)

  • autoencoders can be unreliable in parts of the latent space without training data

  • Variational Autoencoders (VAE) instead use the training data to fit a probability distribution in the latent space, which can then be sampled from

  • encoder outputs Gaussian mean and variance for each sample

  • reparameterization trick: add random Gaussian noise to encoder output, to allow backprop through sampling

  • decoder is more robust because it has to learn to reconstruct from stochastic input

  • loss balances reconstruction error with KL divergence from zero-mean unit-variance Gaussian

Example to come

Generative Adversarial Networks (GAN)

  • A Generative Adversarial Networks (GAN) combines two networks: a generator and a discriminator

  • The discriminator is trained to distinguish between real and synthetic data

  • The generator is trained to fool the discriminator

  • Training for both proceeds until the generator can reliably pass the discriminator

Diffusion

  • stable diffusion [Rombach:22]
  • example: images
  • VAE of image to latent space
  • train a network to remove increasing amounts of noise added in the latent space
  • use the VAE decoder to reconstruct the image after noise reduction
  • sample from the probability distribution in the latent space to generate new images
  • can train on tokens describing the image to steer the generation

Transformers

  • Attention Is All You Need [Vaswani:17]
  • sequences -> tokens (numbers) -> embeddings (vectors)
  • attention heads have weights for queries, keys, and values to find relevance
  • finds relationships between tokens in a window
  • simplifies networks by eliminating the need for memory to capture distant relationships
  • use the last token to predict the probability of the next token from the vocabulary

Reinforcement Learning

  • So far we've assumed that search spaces are static, but that's frequently not the case -- environments change in response to actions, and have histories. Reinforcement Learning (RL) represents this with:

    • an environment that has a state $s_t$ at time $t$, for example the position of pieces on a chess board
    • an agent that takes actions $a_t$ in response to observing the state, for example a chess move
    • the environment transitions to $s_{t+1}$ and emits a reward $r_{t+1}$, for example capturing a chess piece or winning the game
    • a trajectory $\tau$ is a sequence of states and actions
    • the goal of RL is to learn how to maximize the reward, using a strategy for exploration
  • RL can learn to play games [Silver:16, Silver:18], fold proteins [Jumper:21], control systems [Recht:19], ...

  • Types

    • environment model-free vs model-based
    • discrete vs continuous actions
    • fully vs partially observable
    • one vs multiple agents
    • dense vs sparse vs delayed rewards
    • fixed vs continuing time horizon
    • on- vs off-policy retaining old experience
    • deterministic vs stochastic actions
    • explicit vs learned rewards
    • exploration vs execution
  • Algorithms

    • Q-learning: store a table of expected returns [Watkins:92]
    • Deep Q-Network (DQN): train a network to predict expected returns [Mnih:15]
    • policy: uses a policy $p_\theta(a|s)$ to directly give actions, or the probability of actions, given states, using model parameters $\theta$
    • policy gradient: train a policy with the gradient of the expected reward
    • actor-critic: combine policy actor and value critic
    • Monte Carlo Tree Search (MCTS): stochastic sampling of large search spaces to evaluate actions [Świechowski:23]

REINFORCE

  • REINFORCE [Williams:92]
  • trajectory $\tau = (s_1,a_1,\ldots,s_T,a_T)$
  • horizon $T$ could range from one step to a complete game
  • reward $R(\tau) = \sum \gamma^t r_t$; $\gamma$ is a discount factor

We want to maximize:

$$ \max_\theta J(\theta) = \max_\theta \sum_\tau p_\theta(\tau) R(\tau) $$

which depends on the transition probabilites:

$$ p_\theta(\tau) = \prod_{t=0}^T p(s_{t+1}|s_t,a_t)p(a_t|s_t) $$

The policy gradient uses:

$$ \ba \del_\theta J(\theta) &=& \del_\theta \sum_\tau p_\theta(\tau) R(\tau)\\ &=& \sum_\tau \del_\theta p_\theta(\tau) R(\tau)\\ &=& \sum_\tau p_\theta(\tau) \frac{\del_\theta p_\theta(\tau)}{p_\theta(\tau)} R(\tau)\\ &=& \sum_\tau p_\theta(\tau) \del_\theta \log p_\theta(\tau) R(\tau)\\ &=& \left\langle \del_\theta \log p_\theta(\tau) R(\tau) \right\rangle \ea $$

This is exciting, because a sum over a probability distribution can be estimated by an expectation drawn from the distribution

Now expand $p_\theta(\tau)$:

$$ \ba \del_\theta \log p_\theta(\tau) &=& \del_\theta \log \prod_{t=0}^T p(s_{t+1}|s_t,a_t)p(a_t|s_t) \\ &=& \del_\theta \left[\sum_{t=0}^T \log p(s_{t+1}|s_t,a_t) + \sum_{t=0}^T\log p(a_t|s_t)\right] \\ &=& \del_\theta \sum_{t=0}^T\log p(a_t|s_t) \\ &=& \sum_{t=0}^T\del_\theta \log p(a_t|s_t) \\ \ea $$

This is exciting, because the unknown transition probability dropped out

Given the gradient, can then update $\theta$ with $\theta+\alpha\del_\theta J(\theta)$

The policy could continuous or discrete, linear or nonlinear, use Gaussians or a neural network, ...

Example to come

Other Models

AutoML

  • search over search: search for model architecture and hyperparameters [He:21]

Agents

  • singular to plural

  • Multi-Agent Reinforcement Learning (MARL): networks of specialized agents [Zhang:21]

  • Model Context Protocol (MCP) is a protocol for interfacing and interacting with machine-learning systems [Hou:25]

Edge

  • Models that run on processors with reduce resources: power, speed, memory, cost, size, ... [Ravindran:25]

  • There's an asymmetry in the effort required between training and inference

  • Neural network connections can frequently be pruned, and activations and weights can frequently be quantized to a few bits [Madnur:23]

References

  • [Alizadeh:20] Alizadeh, Reza, Janet K. Allen, and Farrokh Mistree. "Managing computational complexity using surrogate models: a critical review." Research in Engineering Design 31, no. 3 (2020): 275-298.
  • [He:21] He, Xin, Kaiyong Zhao, and Xiaowen Chu. "AutoML: A survey of the state-of-the-art." Knowledge-based systems 212 (2021): 106622.
  • [Hou:25] Hou, Xinyi, Yanjie Zhao, Shenao Wang, and Haoyu Wang. "Model context protocol (mcp): Landscape, security threats, and future research directions." ACM Transactions on Software Engineering and Methodology (2025).
  • [Jumper:21] Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool et al. "Highly accurate protein structure prediction with AlphaFold." nature 596, no. 7873 (2021): 583-589.
  • [Karniadakis:21] Karniadakis, George Em, Ioannis G. Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. "Physics-informed machine learning." Nature Reviews Physics 3, no. 6 (2021): 422-440.
  • [Madnur:23] Madnur, Pratham V., Shreyas H. Dabade, Ashlesha Khanapure, Sheldon Rodrigues, Shashank Hegde, and Uday Kulkarni. "Enhancing deep neural networks through pruning followed by quantization pipeline: a comprehensive review." In 2023 2nd international conference on futuristic technologies (INCOFT), pp. 1-8. IEEE, 2023
  • [Mnih:15] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." nature 518, no. 7540 (2015): 529-533.
  • [Ravindran:25] Ravindran, Sandeep. "Cutting AI down to size." Science (New York, NY) 387, no. 6736 (2025): 818-821.
  • [Rombach:22] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).
  • [Silver:16] Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser et al. "Mastering the game of Go with deep neural networks and tree search." nature 529, no. 7587 (2016): 484-489.
  • [Silver:18] Silver, David, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot et al. "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." Science 362, no. 6419 (2018): 1140-1144.
  • [Vaswani:17] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
  • [Watkins:92] Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3 (1992): 279-292.
  • [Williams:92] Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8, no. 3 (1992): 229-256.
  • [Zhang:21] Zhang, Kaiqing, Zhuoran Yang, and Tamer Başar. "Multi-agent reinforcement learning: A selective overview of theories and algorithms." Handbook of reinforcement learning and control (2021): 321-384.
  • [Świechowski:23] Świechowski, Maciej, Konrad Godlewski, Bartosz Sawicki, and Jacek Mańdziuk. "Monte Carlo tree search: A review of recent modifications and applications." Artificial Intelligence Review 56, no. 3 (2023): 2497-2562.
  • [Recht:19] Recht, Benjamin. "A tour of reinforcement learning: The view from continuous control." Annual Review of Control, Robotics, and Autonomous Systems 2, no. 1 (2019): 253-279.

Problems

  1. Train and evaluate a surrogate model on a simulation
  2. Use reinforcement learning to learn how to control a system or play a game

(c) Neil Gershenfeld 4/26/26