Constrained Optimization

The Nature of Mathematical Modeling (draft)

$ \def\%#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\*#1{\vec{#1}} \def\ve#1{\vec{#1}} \def\ds#1#2{\cfrac{d #1}{d #2}} \def\dd#1#2{\cfrac{d^2 #1}{d #2^2}} \def\ps#1#2{\cfrac{\partial #1}{\partial #2}} \def\pp#1#2{\cfrac{\p^2 #1}{\p #2^2}} \def\p{\partial} \def\ba{\begin{eqnarray*}} \def\ea{\end{eqnarray*}} \def\eps{\epsilon} \def\del{\nabla} \def\disp{\displaystyle} \def\la{\langle} \def\ra{\rangle} \def\unit#1{~({\rm #1)}} \def\units#1#2{~\left(\frac{\rm #1}{\rm #2}\right)} $

Search finds best

best frequently has constraints

nutrition

groceries $\ve{g} \ge 0$

prices $\ve{p}$

price $\min_{\ve{g}} \ve{g}\cdot\ve{p}$

minimum health requirements $\ve{m}$

nutrition value $\mat{N}$

$\mat{N}\cdot\ve{g} \ge \ve{m}$

defines linear program, LP

price may be a function of quantity, not linear

quadratic objective, quadratic program, QP

general case mathematical program

portfolios, routing airplanes, running a factory

program as plan, not computer program, but can be same

electrical networks {Dennis:58}

routing {Kelly:91,Papadimitriou:98}

flow control {Low:02}

layering {Chiang:07}

sorting (Problem 3)

variables $\ve{x}$, objective minimize $f(\ve{x})$, constraints $\ve{c}(\ve{x})$

max = -min

slack variables convert inequality to equality

$$ c(\ve{x}) \ge 0 $$

replace with

$$ \ba c(\ve{x}) - s &=& 0 \\ s &\ge& 0 \ea $$

combinatorial $x$ equals 1 or -1 can be relaxed as $(x^2-1)^2=0$

$x$ equals 0 or 1 can be relaxed as $x(x-1)=0$

L1 norm

$$ |\ve{x}|_1 = \sum_i | x_i | $$

used in compressed sensing, for sparsity

non-differentiable

{Schmidt:07}

can be relaxed

$$ \ba \tag{opt-approx} |x| &\approx& |x|_\alpha \\ &=& \frac{1}{\alpha} \left[ \log \left(1+e^{-\alpha x}\right) +\log \left(1+e^{\alpha x}\right) \right] \ea $$$$ \ds{|x|_\alpha}{x} = \frac{1}{1+e^{-\alpha x}} - \frac{1}{1+e^{\alpha x}} $$$$ \dd{|x|_\alpha}{x} = \frac{2\alpha e^{\alpha x}} {\left(1+e^{\alpha x}\right)^2} $$

minimize for increasing $\alpha$

Lagrange Multipliers

single equality constraint $c(\ve{x}) = 0$

step in direction $\ve{\delta}$ to minimize $f$ while satisfying the constraint

$$ \ba 0 &=& c(\ve{x}+\ve{\delta}) \\ &\approx& c(\ve{x}) + \del c \cdot \ve{\delta} \\ &=& \del c \cdot \ve{\delta} \ea $$

step also minimizes $f$

$$ \ba 0 &>& f(\ve{x}+\ve{\delta}) - f(\ve{x}) \\ &\approx& f(\ve{x}) + \del f \cdot \ve{\delta} - f(\ve{x}) \\ &=& \del f \cdot \ve{\delta} \ea $$

if $\del c(\ve{x})$ and $\del f(\ve{x})$ aligned not possible to find a direction, hence $\ve{x}$ is a local minimizer

define Lagrangian

$$ {\cal L} = f(\ve{x}) - \lambda c(\ve{x}) $$

solve for

$$ \ba 0 &=& \del {\cal L} \\ &=& \del f - \lambda \del c \ea $$

$\lambda$ is a degree of freedom that determines the value of $c$

there can be multiple constraints, e.g. grocery max weight as well as min nutrition

Lagrangian becomes a linear combination

$$ \del f(\ve{x}) = \sum_i \lambda_i \del c_i(\ve{x}) $$$$ f(\ve{x}) = \sum_i \lambda_i c_i(\ve{x}) $$

solving gives $\ve{x}(\ve{\lambda})$, substitute into constraints to find $\ve{\lambda}$

inequality constraint

$$ \ba 0 &\le& c(\ve{x}+\ve{\delta}) \\ &\approx& c(\ve{x}) + \del c\cdot\ve{\delta} \ea $$

if constraint not active $(c > 0)$, can just do gradient descent $\ve{\delta} = -\alpha\del f$

for an active constraint $\del f\cdot\ve{\delta} < 0$ and $\del c \cdot \ve{\delta} \ge 0$

define half-planes

no intersection if point in same direction $\del f = \lambda \del c$

same condition, but now $\lambda \ge 0$

Optimality

first-order condition

equality constraints $c_i(\ve{x}), i \in {\cal E}$

inequality constraints $c_i(\ve{x}), i \in {\cal I}$

inactive constraint $\lambda_i = 0$

complementarity: $\lambda_i c_i = 0$: Lagrange multiplier only non-zero when constraint is active, otherwise reduces to gradient descent

$$ \ba \del_{\ve{x}} {\cal L}(\ve{x},\ve{\lambda}) &=& 0 \\ c_i(\ve{x}) &=& 0 ~~~~ (i \in {\cal E}) \\ c_i(\ve{x}) &\ge& 0 ~~~~ (i \in {\cal I}) \\ \lambda_i &\ge& 0 ~~~~ (i \in {\cal I}) \\ \lambda_i c_i(x) &=& 0 \ea $$

Karush-Kuhn-Tucker (KKT) conditions

necessary, not sufficient

second order condition: positive definite Lagrangian Hessian

sensitivity

replace $c(x) = 0$ with $c(x) = \eps$

minimizer $\ve{x}$ goes to $\ve{x}_\eps$

$$ \ba f(\ve{x}_\eps) - f(\ve{x}) &\approx& \del f \cdot (\ve{x}_\eps - \ve{x}) \\ &=& \lambda \del c \cdot (\ve{x}_\eps - \ve{x}) \\ &\approx& \lambda \left( c(\ve{x}_\eps) - c(\ve{x}) \right) \\ &=& \lambda\eps \\ \ds{f}{\eps} &=& \lambda \ea $$

shadow prices: change in utility per change in constraint

there can be multiple objectives, e.g. minimize grocery carbon footprint as well as cost

there's boundary where it's not possible to improve one objective without making others worse

that defines the Pareto frontier

can combine in a multi-objective function with relative weights

Solvers

assume equality constraint, or inequality constraint with slack variables

Penalty

$$ {\cal F} = f(\ve{x}) + \frac{\mu}{2}\sum_i c_i^2(\ve{x}) $$$$ \ps{\cal F}{x_j} = \ps{f}{x_j} + \mu\sum_i c_i \ps{c_i}{x_j} $$$$ {\cal L} = f(\ve{x}) - \sum_i \lambda_i c_i(\ve{x}) $$$$ \ps{\cal L}{x_j} = \ps{f}{x_j} - \sum_i \lambda_i \ps{c_i}{x_j} $$

effectively taking $c_i = -\lambda_i/\mu$

solving a different problem

constraint driven to 0 as $\mu \rightarrow \infty$

small $\mu$ may be unbounded

large $\mu$ may be ill-conditioned

nonsmooth penalty

$$ {\cal F} = f(\ve{x}) + \mu\sum_{i\in E} |c_i(\ve{x})| + \mu\sum_{i\in I} [c_i(\ve{x})]_- $$

can be exact for large $\mu$ {Nocedal:06}

non-differentiable, approximate ({opt-approx})

Augmented Lagrangian

augmented Lagrangian

$$ {\cal L} = f(\ve{x}) - \sum_i \lambda_i c_i(\ve{x}) + \frac{\mu}{2}\sum_i c_i^2(\ve{x}) $$$$ \ps{\cal L}{x_j} = \ps{f}{x_j} - \sum_i \lambda_i \ps{c_i}{x_j} + \mu \sum_i c_i \ps{c_i}{x_j} $$

$\lambda_i^* = \lambda_i - \mu c_i$

$c_i = (\lambda_i - \lambda_i^*)/\mu$

vanishes much faster, as Lagrange multiplier estimates converge

$\lambda_i^{(n+1)} = \lambda_i^{(n)} - \mu c_i$

minimize $\ve{x}$, update $\lambda$, increase $\mu$

Interior Point

interior point

basis some of the largest, most efficient solvers

directly solve KKT system of equations

$$ \ba \min_{\ve{x}} f(\ve{x}) && \\ c_i(\ve{x}) = 0 && ~~ (i \in {\cal E})\\ c_i(\ve{x}) - s_i = 0 && ~~ (i \in {\cal I})\\ s_i \ge 0 && \ea $$

if only equality constraints, can use Newton's method on the Lagrangian

for inequality constraints, perturb from the boundary

$$ \ba \del f - \sum_i\lambda_i \del c_i(\ve{x}) &=& 0 \\ c_i(\ve{x}) &=& 0 ~~ (i \in {\cal E})\\ c_i(\ve{x}) - s_i &=& 0 ~~ (i \in {\cal I})\\ \lambda_i s_i &=& \mu ~~ (i \in {\cal I}) \ea $$

iterate Newton step on system, decrease $\mu$

same as barrier method

$$ \ba \min_{\ve{x},\ve{s}} f(x) - \mu \sum_i \log s_i && ~~ (i \in {\cal I})\\ c_i(\ve{x}) = 0 && ~~ (i \in {\cal E}) \\ c_i(\ve{x}) - s_i = 0 && ~~ (i \in {\cal I}) \ea $$

KKT condition for $s_i$

$$ \mu\frac{1}{s_i} - \lambda_i = 0 $$$$ \lambda_i s_i = \mu $$

Convexity

Convexity watershed {Rockafellar:93}

$$ f(\alpha x+(1−\alpha) y) \le \alpha f(x)+(1−\alpha)f(y) $$

"non-iterative" solvers {Vandenberghe:96,Boyd:04}

Support Vector Machines

{Vapnik:98,Burges:98}

important application of convexity

alternative to DNNs for high-dimensional feature vectors that would require training huge networks

Classification

classification

$\{\ve{x}_i,y_i\}$

linear classifier

$$ f(\ve{x}) = \ve{w}\cdot\ve{x} + b $$

$f(\ve{x})=0$ defines hyperplane

$f \ge 1$ for $y=1$, $f \le -1$ for $y=-1$

combine

$$ y_i \left( \ve{w}\cdot\ve{x}_i + b \right) \ge 1 $$

margin $|\ve{x}| = |\ve{w}|^{-1}$

maximize margin

$$ \ba \min_{\ve{w}} \frac{1}{2} \ve{w}\cdot\ve{w} \\ {\rm subject\ to} ~~ y_i \left( \ve{w}\cdot\ve{x}_i + b \right) \ge 1 \ea $$

(1/2 for algebraic convenience)

to find $\ve{w}$, search in size of feature vector $D$

Lagrangian

$$ {\cal L} = \frac{1}{2} \ve{w}\cdot\ve{w} - \sum_i \lambda_i \left[ y_i \left( \ve{w}\cdot\ve{x}_i + b \right) - 1 \right] $$

set

$$ \ba 0 &=& \ps{\cal L}{\ve{w}} \\ &=& \ve{w} - \sum_i \lambda_i y_i \ve{x}_i \\ \Rightarrow \ve{w} &=& \sum_i \lambda_i y_i \ve{x}_i \ea $$$$ \ba 0 &=& \ps{\cal L}{b} \\ &=& \sum_i \lambda_i y_i \ea $$

substitute

$$ \ba {\cal L} &=& \frac{1}{2} \sum_{i,j} \lambda_i y_i \ve{x}_i\cdot\ve{x}_j y_j \lambda_j - \sum_{i,j} \lambda_i y_i \ve{x}_i\cdot\ve{x}_j y_j \lambda_j - b \sum_i \lambda_i y_i + \sum_i \lambda_i \\ &=& \sum_i \lambda_i - \frac{1}{2} \sum_{i,j} \lambda_i y_i \ve{x}_i\cdot\ve{x}_j y_j \lambda_j \\ &=& \sum_i \lambda_i - \frac{1}{2} \sum_{i,j} \lambda_i y_i G_{ij} y_j \lambda_j \ea $$

Gram matrix $G_{ij} = \ve{x}_i\cdot\ve{x}_j$

dual

$$ \ba \min_{\ve{\lambda}} \frac{1}{2} \sum_{i,j} \lambda_i y_i G_{ij} y_j \lambda_j - \sum_i \lambda_i \\ {\rm subject\ to} ~~~~~~~~ \lambda_i > 0 \\ \sum_i \lambda_i y_i = 0 \ea $$

to find $\ve{\lambda}$, search in number of points $N$

advantageous for huge feature vectors

QP

dual solution $\ve{\lambda}$ $\rightarrow$ primal solution $\ve{w}$ $\rightarrow$ KKT $b$

$$ \lambda_i \left[ y_i \left( \ve{w}\cdot\ve{x} + b \right) - 1 \right] = 0 $$$$ \ba f(\ve{x}) &=& \ve{w}\cdot\ve{x} + b \\ &=& \sum_i \lambda_i y_i \ve{x}_i\cdot\ve{x} + b \\ \ea $$

$\lambda_i$ nonzero only for constraint equality: define support vectors at boundary

sparse sum over support vectors

Kernel Trick

kernel $k$

terminology from functional analysis of integral equations

here symmetric positive definite functions

$\mat{K}$ kernel matrix $K_{ij} = k(\ve{x}_i,\ve{x}_j)$

e.g. Gaussian kernel

$$ K_{ij} = e^{-|\ve{x}_i-\ve{x}_j|^2/(2\sigma^2)} $$

$\mat{K}$ symmetric positive definite $\Rightarrow \mat{K} = \mat{V}\mat{\Lambda}\mat{V}^T$

$\mat{V}$ columns eigenvectors

$i$th row $\ve{v}_i$

$\mat{\Lambda}$ diagonal eigenvalues

$$ \ba k(\ve{x}_i,\ve{x}_j) &=& K_{ij} \\ &=& \left(\mat{V}\mat{\Lambda}\mat{V}^T\right)_{ij} \\ &=& \ve{v}_i\cdot\mat{\Lambda}\ve{v}_j \\ &=& \left(\sqrt{\mat{\Lambda}}~\ve{v}_i\right)\cdot \left(\sqrt{\mat{\Lambda}}~\ve{v}_j\right) \\ &\equiv& \ve{\phi}(\ve{x}_i)\cdot\ve{\phi}(\ve{x}_j) \ea $$

symmetric positive definite kernel implies dot product

$$ \ba \sum_{ij} \alpha_i k(\ve{x}_i,\ve{x}_j) \alpha_j &=& \sum_{ij} \alpha_i \ve{\phi}(\ve{x}_i)\cdot\ve{\phi}(\ve{x}_j) \alpha_j \\ &=& \left(\sum_i\alpha_i\ve{\phi}(\ve{x}_i)\right) \cdot\left(\sum_j\alpha_j\ve{\phi}(\ve{x}_j)\right) \\ &=& \left|\sum_i\alpha_i\ve{\phi}(\ve{x}_i)\right|^2 \\ &\ge& 0 \ea $$

dot product implies symmetric positive definite

Mercer's thm for functions \cite{Smola:02}

generalize dot product, possibly infinite dimensional

map feature vector to symmetric positive definite kernel $\Phi(\ve{x}) = k(\cdot,\ve{x})$

$[\Phi(\ve{x})](\ve{x}') = k(\ve{x}',\ve{x})$

define vector space

$$ f(\cdot) = \sum_i \alpha_i k(\cdot,x_i) $$

define inner product with $g(\cdot) = \sum_j \beta_j k(\cdot,x_j)$

$$ \ba \left\langle f(\cdot),g(\cdot) \right\rangle &=& \left\langle \sum_i \alpha_i k(\cdot,x_i), \sum_j \beta_j k(\cdot,x'_k) \right\rangle \\ &=& \sum_{ij} \alpha_i \left\langle k(\cdot,x_i), k(\cdot,x'_j) \right\rangle \beta_j \\ &=& \sum_{ij} \alpha_i k(x_i,x'_j) \beta_j \ea $$$$ \ba \left\langle \Phi(\ve{x}),\Phi(\ve{x}') \right\rangle &=& \left\langle k(\cdot,\ve{x}),k(\cdot,\ve{x}') \right\rangle \\ &=& k(\ve{x},\ve{x}') \ea $$

giving dot product representation

$$ \ba \left\langle k(\cdot,\ve{x}),f(\cdot) \right\rangle &=& \left\langle k(\cdot,\ve{x}), \sum_i\alpha_i k(\cdot,\ve{x}_i) \right\rangle \\ &=& \sum_i\alpha_i \left\langle k(\cdot,\ve{x}), k(\cdot,\ve{x}_i) \right\rangle \\ &=& \sum_i\alpha_i k(\ve{x},\ve{x}_i) \\ &=& f(\ve{x}) \ea $$

reproducing property

$$ \left| f(\ve{x}) \right|^2 = \left| \left\langle k(\cdot,\ve{x}),f \right\rangle \right|^2 \le k(\ve{x},\ve{x})~\langle f(\cdot),f(\cdot) \rangle $$

$f$ vanishes if dot product vanishes

$$ \left\langle f(\cdot),f(\cdot) \right\rangle = \sum_{ij} \alpha_i k(x_i,x_j) \alpha_j \ge 0 $$

dot product non-negative

with norm $\sqrt{\langle f(\cdot),f(\cdot) \rangle}$ defines Reproducing Kernel Hilbert Space (RKHS) {Berlinet:04}

representer theorem

SDP hierarchy of convex relaxations {Parrilo:03}

SDP for combinatorial optimization {Goemans:95,Goemans:04}

SDP for global polynomial minimization {Lasserre:02,Lasserre:06}

nonlinear

$$ f(\ve{x}) = \ve{W}\cdot\ve{\phi}(\ve{x}) + b $$$$ {\cal L} = \sum_i \lambda_i - \sum_{i,j} \lambda_i y_i \ve{\phi}(\ve{x}_i)\cdot\ve{\phi}(\ve{x}_i) y_j \lambda_j $$

nonlinear classifier with kernel trick

$$ {\cal L} = \sum_i \lambda_i - \sum_{i,j} \lambda_i y_i K_{ij} y_j \lambda_j $$

soft margin

$$ y_i \left( \ve{W}\cdot\ve{\phi}(\ve{x}_i) + b \right) \ge 1 - s_i $$

slack $s_i$ allows violation

$$ \ba \min \frac{1}{2} \ve{W}\cdot\ve{W} +\lambda_s \sum_i s_i \\ {\rm subject\ to} ~~ y_i \left( \ve{W}\cdot\ve{\phi}(\ve{x}_i) + b \right) \ge 1-s_i \\ s_i \ge 0 \ea $$

Lagrangian

$$ {\cal L} = \frac{1}{2} \ve{W}\cdot\ve{W} + \lambda_s \sum_i s_i - \sum_i \lambda_i \left[ y_i \left( \ve{W}\cdot\ve{\phi}(\ve{x}_i) + b \right) - 1 + s_i\right] - \sum_i \lambda_{s,i} s_i $$$$ \ba 0 &=& \ps{\cal L}{s_i} \\ &=& \lambda_s - \lambda_i - \lambda_{s,i} \\ \Rightarrow \lambda_s &=& \lambda_i + \lambda_{s,i} \ea $$

$\lambda_{s,i} \ge 0$, $\lambda_i \ge 0$ therefore $0 \le \lambda_i \le \lambda_s$

same problem, now with upper bound on $\lambda_i$

KKT

$$ \lambda_i \left[ y_i \left( \ve{W}\cdot\ve{\phi}(\ve{x}_i) + b \right) - 1 + s_i \right] = 0 $$$$ \lambda_{s,i} s_i = 0 $$$$ \left( \lambda_s - \lambda_i \right) s_i = 0 $$

three cases

$\lambda_i = 0$ OK

$\lambda_i = \lambda_s$ implies $s_i \ne 0$ wrong side

$0 < \lambda_i < \lambda_s$ implies $s_i = 0$ support vector

M-M

$$ \ba \min \frac{1}{2} \ve{W}\cdot\ve{W} + b^2 + \lambda_s \sum_i s_i \\ {\rm subject\ to} ~~ y_i \left( \ve{W}\cdot\ve{\phi}(\ve{x}_i) + b \right) \ge 1-s_i \\ s_i \ge 0 \ea $$

include $b$ in norm

{Mangasarian:99}

Lagrangian

$$ {\cal L} = \frac{1}{2} \ve{W}\cdot\ve{W} + b^2 + \lambda_s \sum_i s_i - \sum_i \lambda_i \left[ y_i \left( \ve{W}\cdot\ve{\phi}(\ve{x}_i) + b \right) - 1 + s_i\right] - \sum_i \lambda_{s,i} s_i $$$$ \ps{\cal L}{b} = 0 ~\Rightarrow~ b = \frac{1}{2} \sum_i \lambda_i y_i $$$$ {\cal L} = \sum_i \lambda_i - \sum_{i,j} \lambda_i y_i (K_{ij}+1) y_j \lambda_j $$

no equality constraint

least-squares

{Suykens:02}

$$ \ba \min \frac{1}{2} \ve{W}\cdot\ve{W} + \lambda_s \frac{1}{2} \sum_i s_i^2 \\ {\rm subject\ to} ~~ y_i \left( \ve{W}\cdot\ve{\phi}(\ve{x}_i) + b \right) = 1-s_i \ea $$

error target rather than threshold

equality constraint, sign doesn't matter

$$ \ps{\cal L}{s_i} = 0 \Rightarrow \lambda_i = \lambda_s s_i $$

linear KKT

$$ \left[ \begin{array}{c c c c} 0 & y_1 & \cdots & y_N \\ y_1 & & \vdots & \\ \vdots & \cdots & y_i K_{ij} y_j + 1/\lambda_s & \cdots \\ y_N & & \vdots & \end{array} \right] \left[ \begin{array}{c} b \\ \lambda_1 \\ \vdots \\ \lambda_N \end{array} \right] = \left[ \begin{array}{c} 0 \\ 1 \\ \vdots \\ 1 \end{array} \right] $$

lose sparsity

multiple classes \cite{Vapnik:98}

class $k$

$$ f_k(\ve{x}) = \ve{W}_k\cdot\ve{\phi}(\ve{x}) + b_k $$$$ {\rm argmax} \{ \ve{W}_1\cdot\ve{\phi}(\ve{x}) + b_1 , \ldots , \ve{W}_k\cdot\ve{\phi}(\ve{x}) + b_k , \ldots \} $$

could construct each separately

add constraint and find jointly

$$ \ve{W}_k\cdot\ve{\phi}(\ve{x}_k) + b_k -\ve{W}_m\cdot\ve{\phi}(\ve{x}_k) + b_m \ge 1 - s_{k,i} $$

$m \ne k$

Regression

regression {Smola:04}

$$ f(\ve{x}) = \ve{W}\cdot\ve{\phi}(\ve{x}) + b $$

sparsity $\Rightarrow$ ignore errors $< \eps$

like SVD, minimize magnitude $\ve{W}$

primal

$$ \ba \min \frac{1}{2} \ve{W}\cdot\ve{W} + \lambda_s \sum_i (s_i^+ + s_i^-) \\ {\rm subject\ to} ~~ y_i-\ve{W}\cdot\ve{\phi}(\ve{x}_i)-b \ge \eps+s_i^+ \\ \ve{W}\cdot\ve{\phi}(\ve{x}_i)+b - y_i \ge \eps+s_i^- \\ s_i^+,s_i^- \ge 0 \ea $$

Lagrangian

$$ \ba {\cal L} &=& \frac{1}{2}\ve{W}\cdot\ve{W} + \lambda_s\sum_i\left(s_i^+ + s_i^-\right) - \sum_i\left(\lambda_{s,i}^+s_i^+ + \lambda_{s,i}^-s_i^-\right) \\ & & -\sum_i\lambda_i^+\left(\eps+s_i^+ -y_i+\ve{W}\cdot\ve{\phi}(\ve{x}_i)+b\right) \\ & & -\sum_i\lambda_i^-\left(\eps+s_i^- +y_i-\ve{W}\cdot\ve{\phi}(\ve{x}_i)-b\right) \ea $$$$ \ps{\cal L}{\ve{W}} = 0 ~\Rightarrow~ \ve{W} = \sum_i\left(\lambda_i^+-\lambda_i^-\right)\ve{\phi}(\ve{x}_i) $$$$ \ps{\cal L}{b} = 0 ~\Rightarrow~ \sum_i\left(\lambda_i^+-\lambda_i^-\right) = 0 $$$$ \ps{\cal L}{s_i^+} = 0 ~\Rightarrow~ \lambda_s = \lambda_{s,i}^+ + \lambda_i^+ $$$$ \ps{\cal L}{s_i^-} = 0 ~\Rightarrow~ \lambda_s = \lambda_{s,i}^- + \lambda_i^- $$

dual

$$ \ba \max ~ -\frac{1}{2} \sum_{i,j}\left(\lambda_i^+-\lambda_i^-\right) \left(\lambda_j^+-\lambda_i^-\right) K_{ij} \\ -\eps \sum_i \left(\lambda_i^+ + \lambda_i^-\right) + \sum_i y_i \left(\lambda_i^+ - \lambda_i^-\right) \\ {\rm subject\ to} ~~~ \sum_i \left(\lambda_i^+ - \lambda_i^-\right) = 0 \\ 0 \le \lambda_i^+,\lambda_i^- \le \lambda_s \ea $$$$ \ba f(\ve{x}) &=& \ve{W}\cdot\ve{\phi}(\ve{x}) + b \\ &=& \sum_i \left( \lambda_i^+ - \lambda_i^- \right) K(\ve{x},\ve{x}_i) + b \ea $$

Clustering

unsupervised learning

clustering {Vapnik:01}

$$ \left| \ve{\phi}(\ve{x}_i) - \ve{\phi}_0 \right|^2 \le R^2 $$$$ \left| \ve{\phi}(\ve{x}_i) - \ve{\phi}_0 \right|^2 \le R^2 + s_i $$

want smallest hypersphere: minimize magnitude $R$

$$ {\cal L} = R^2 + \lambda_s \sum_i s_i - \sum_i \lambda_i \left[ R^2 + s_i - \left| \ve{\phi}(\ve{x}_i) - \ve{\phi}_0 \right|^2 \right] - \sum_i \lambda_{s,i} s_i $$$$ \ba \ps{\cal L}{R} &=& 2R - 2R\sum_i \lambda_i = 0 \n\\ &\Rightarrow& \sum_i \lambda_i = 1 \ea $$$$ \ba \ps{\cal L}{\ve{\phi}_0} &=& \ps{}{\ve{\phi}_0} \sum_i \lambda_i \left(\ve{\phi}(\ve{x}_i)-\ve{\phi}_0\right) \cdot \left(\ve{\phi}(\ve{x}_i)-\ve{\phi}_0\right) \\ &=& \sum_i \lambda_i \left( - 2\ve{\phi}(\ve{x}_i) + 2 \ve{\phi}_0 \right) \n\\ &=& 0 \\ \Rightarrow \ve{\phi}_0 &=& \sum_i \lambda_i \ve{\phi}(\ve{x}_i) \ea $$$$ \ba \ps{\cal L}{s_i} &=& \lambda_s - \lambda_i - \lambda_{s,i} = 0 \\ \Rightarrow \lambda_{s,i} &=& \lambda_s - \lambda_i \ea $$

KKT

$$ \lambda_{s,i} s_i = 0 $$$$ \lambda_i \left[ R^2 + s_i - \left| \ve{\phi}(\ve{x}_i) - \ve{\phi}_0 \right|^2 \right] = 0 $$

support vectors

substitute

$$ {\cal L} = \sum_i \lambda_i \ve{\phi}(\ve{x}_i) \cdot \ve{\phi}(\ve{x}_i) - \sum_{i,j} \lambda_i \lambda_j \ve{\phi}(\ve{x}_i)\cdot\ve{\phi}(\ve{x}_j) $$$$ {\cal L} = \sum_i \lambda_i K_{ii} - \sum_{i,j} \lambda_i \lambda_j K_{ij} $$$$ 0 \le \lambda_i \le \lambda_s $$

radius

$$ R(\ve{x}_i) = K_{ii} - 2\sum_j\lambda_j K_{ij} + \sum_{i,j}\lambda_i \lambda_j K_{ij} $$

boundary radius of support vectors

cluster assignment line $> R$ adjacency matrix

Routines

References

  • Nocedal, Jorge, \& Wright, Stephen~J. (2006). Numerical Optimization. 2nd edn. New York: Springer.

    • Unusually clear coverage of a field full of unusually opaque books.
  • Boyd, Stephen, \& Vandenberghe, Lieven. (2004). Convex Optimization. Cambridge: Cambridge University Press.

    • Authoritative survey of convex problems and solutions.

Problems

  1. Given a point $(x_0,y_0)$, analytically find the closest point on the line $y=ax+b$ by minimizing the distance $d^2 = (x_0-x)^2+(y_0-y)^2$ subject to the constraint $y-ax-b=0$.

    1. Choose values for $x_0, y_0, a, b$, write constrained optimizer solvers to find this this numerically, and compare their performance.
  2. Consider a set of $N$ nodes that has each measured a quantity $x_i$. The goal is to find the best estimate $\bar{x}$ by minimizing $$ \min_{\bar{x}} \sum_{i=1}^N \left(\bar{x}-x_i\right)^2 ~~~~, $$ however each node $i$ can communicate only with nodes $j$ in its neighborhood $j \in {\cal N}(i)$. This can be handled by having each node obtain a local estimate $\bar{x}_i$, and introducing a consistency constraint $c_{ij} = \bar{x}_i - \bar{x}_j = 0 ~\forall~ j \in {\cal N}(i)$.

    1. What is the Lagrangian?
    2. Find an update rule for the estimates $\bar{x}_i$ by evaluating where the gradient of the Lagrangian vanishes.
    3. Find an update rule for the Lagrange multipliers by taking a Newton root-finding step on their associated constraints.
  3. Sorting can be written in terms of a permutation matrix $\mat{P}$ as $\ve{s} = \mat{P}\ve{u}$, where $\ve{u}$ is a vector of unsorted numbers, $\ve{s}$ are the sorted numbers, and each row and column of $\mat{P}$ has one 1 and the rest of the elements are 0. Defining the vector $\ve{n}$ to be $\{ 1,2,\ldots\}$, sorting can be done by minimizing $\ve{n}\cdot\ve{s}$. This is a combinatorial optimization; there are $N!$ possible $N\times N$ permutation matrices. It is an example of the linear sum assignment problem; use a constrained optimization solver to sort a vector of random numbers.


(c) Neil Gershenfeld 4/7/26