Linear prediction function $$f(\mathbf{x}) = \text{sign}(\langle \mathbf{w}, \mathbf{x}\rangle +b) $$
Defines a hyperplane
w = jnp.ones(2)
b = -0.5
t = 40; tx = jnp.linspace(-1, 2, t); ty = jnp.linspace(-1, 2, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
levels=jnp.linspace(-1.5, 1.5, 10)
y_pred = (1.*(jnp.matmul(xx, w)+b > 0)).reshape(t, t)
plt.contourf(xv, yv, -y_pred, levels=levels);
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Hinge loss:
$$ l(y, f(\mathbf{x})) = \max(0, 1 - yf(\mathbf{x})) $$Given training set $\mathcal{A} = \{(\mathbf{x}, y)\}$, minimize the empirical risk: $$\min_{\mathbf{w}, b} \frac{1}{n}\sum_i max(0, 1 - y_i (\langle\mathbf{w}, \mathbf{x}\rangle + b)) $$
Convex problem (sum of convex) easy optimization by gradient descent
For large training sets, stochastic gradient descent works great
X = data['X_train_bin']
y = data['y_train_bin']*2-1
def func(w, b, x):
return jnp.matmul(x, w) + b
def hinge(w, b, x, y):
return jax.nn.relu(1 - y * func(w, b, x)).mean()
@jax.jit
def update(w, b, x, y):
dw, db = jax.grad(hinge, argnums=(0,1))(w, b, x, y)
return w - 0.01*dw, b - 0.01*db
w = np.random.randn(784)
b = 0.
loss = []
for t in range(500):
loss.append(hinge(w, b, X, y))
w, b = update(w, b, X, y)
plt.plot(loss)
[<matplotlib.lines.Line2D at 0x7f36f437cad0>]
def accuracy(y_pred, y_true):
return jnp.sign(y_true*y_pred).mean()
y_pred = func(w, b, X)
print('accuracy: {}'.format(accuracy(y_pred, y)))
accuracy: 1.0
X_val = data['X_val_bin']
y_val = data['y_val_bin']*2-1
y_pred = func(w, b, X_val)
print('validation accuracy: {}'.format(accuracy(y_pred, y_val)))
validation accuracy: 0.9047619104385376
w = jnp.array([-1, 1]) + 0.1*np.random.randn(5, 2)
plt.scatter([0, 1], [0, 1], c=[0, 1])
for i in range(5):
plt.plot([0, 1], [w[i,1], w[i,0]+w[i,1]])
The Structural Risk Minimization principle defines a trade-off between the quality of the approximation of the given data and the complexity of the approximating function (Vladimir N. Vapnik)
Given a family of functions that all have $R_e = 0$ and that can be split into subsets $S_k$ ordered by their complexity $h_k$
$$ S_0 \subset S_1 \subset \dots \subset S_N $$$$ h_0 \leq h_1 \leq \dots \leq h_N $$We choose the functions with the lowest complexity
The VC Dimension of a set of indicator functions $Q(z, \alpha), \alpha \in \Lambda$, is the maximum number $h$ of vectors $z_1, \dots , z_h$ that can be separated into 2 classes in all $2^h$ possible ways using functions from the set.
True risk is bounded by a combination of empirical risk and structural risk depending on $h$
$$ R(\alpha) \leq R_e(\alpha) + F(h) $$Let $\mathbf{w}$ be the separating hyperplane with margin $\Delta$ :
\begin{equation} y = \left\{ \begin{array}{rl} 1 & \text{if } \mathbf{w}^\top x -b \geq \Delta, \\ -1 & \text{if } \mathbf{w}^\top x -b \leq -\Delta. \end{array} \right. \end{equation}Theorem (Vapnik 1995): Given a training set $\mathcal{A} = \{(\mathbf{x}_i\in \mathbb{R}^d, y_i)\}$ such that $\|\mathbf{x}_i\|\leq R$ and that can be separated by an hyperplane with margin $\Delta$,
\begin{equation} h \leq \min\left(\frac{R^2}{\Delta^2}, d\right)+1 \end{equation}$\Rightarrow$ We have to maximize the margin
$\Rightarrow$ we have to minimize $||w||^2$
Of all the hyperplane that perfectly classify the training sample, select the one with minimal norm
In practice, define a soft margin \begin{align*} \min_{\mathbf{w}, b, \zeta_i}&\quad \frac{\lambda}{2}\|\mathbf{w}\|^2 + \frac{1}{n}\sum_i \zeta_i \\ \text{s.t.} & \quad \forall i, y_i(\mathbf{w}^\top\mathbf{x}_i +b) \geq 1 - \zeta_i \end{align*}
Solve the equivalent problem using stochastic gradient descent
\begin{align*} \min_{\mathbf{w}, b}&\quad \frac{\lambda}{2}\|\mathbf{w}\|^2 + \frac{1}{n}\sum_i \max (0, 1 - y_i(\mathbf{w}^\top\mathbf{x} +b)) \end{align*}Strongly convex problem
def func(w, b, x):
return jnp.matmul(x, w) + b
def hinge(w, b, x, y):
return jax.nn.relu(1 - y * func(w, b, x)).mean()
def loss(w, b, x, y):
return 0.001*(w*w).sum() + hinge(w, b, x, y)
@jax.jit
def update(w, b, x, y):
dw, db = jax.grad(loss, argnums=(0,1))(w, b, x, y)
return w - 0.1*dw, b - 0.01*db
w = np.random.randn(784)
b = 0.
l = []
for t in range(500):
l.append(hinge(w, b, X, y))
w, b = update(w, b, X, y)
plt.plot(l)
[<matplotlib.lines.Line2D at 0x7f36f4263e90>]
def accuracy(y_pred, y_true):
return jnp.sign(y_true*y_pred).mean()
y_pred = func(w, b, X)
print('accuracy: {}'.format(accuracy(y_pred, y)))
accuracy: 1.0
y_pred = func(w, b, X_val)
print('validation accuracy: {}'.format(accuracy(y_pred, y_val)))
validation accuracy: 0.9047619104385376
2 types of approaches for handling $M$ classes
One versus All: $M$ classifiers, take the $\text{argmax}$
One versus One: $M(M-1)/2$ classifiers, majority vote
One versus all
X = data['X_train']
y = jax.nn.one_hot(data['y_train'], 10)*2 - 1
def func(w, b, x):
return jnp.matmul(x, w) + b
def hinge(w, b, x, y):
return jax.nn.relu(1 - y * func(w, b, x)).mean()
def loss(w, b, x, y):
return 0.01*(w*w).sum() + hinge(w, b, x, y)
@jax.jit
def update(w, b, x, y):
dw, db = jax.grad(loss, argnums=(0,1))(w, b, x, y)
return w - 0.1*dw, b - 0.1*db
w = np.random.randn(784, 10)
b = jnp.zeros(10)
l = []
for t in range(2500):
l.append(hinge(w, b, X, y))
w, b = update(w, b, X, y)
plt.plot(l)
[<matplotlib.lines.Line2D at 0x7f36dc422290>]
def accuracy(y_pred, y_true):
return (1.*(jnp.argmax(y_true, axis=1) == jnp.argmax(y_pred, axis=1))).mean()
y_pred = func(w, b, X)
print('accuracy: {}'.format(accuracy(y_pred, y)))
accuracy: 1.0
X_val = data['X_val']
y_val = jax.nn.one_hot(data['y_val'], 10)*2 - 1
y_pred = func(w, b, X_val)
print('validation accuracy: {}'.format(accuracy(y_pred, y_val)))
validation accuracy: 0.6600000262260437
Back to the hard margin: \begin{align*} \min_{\mathbf{w}, b}&\quad \frac{1}{2} \|\mathbf{w}\|^2 \\ \text{s.t.} &\quad \forall i, y_i(\langle\mathbf{w}, \mathbf{x}_i \rangle +b) \geq 1 \end{align*}
Compute the Lagrangian:
\begin{align*} \mathcal{L}(\mathbf{w}, b, \alpha)& \quad = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_i \alpha_i(y_i(\mathbf{w}^\top\mathbf{x_i}+b) - 1) \\ \text{s.t.} & \quad \forall i, \alpha_i \geq 0 \end{align*}$\alpha_i$ are the Lagrange multipliers for the augmented problem
Karush-Kuhn-Tucker optimal conditions (well known in optimization):
Stationarity: $\frac{\partial\mathcal{L}}{\partial\mathbf{w^\star}} = \mathbf{0}$
Primal feasibility: $\forall i, y_i(\mathbf{w}^\top\mathbf{x}_i +b) \geq 1$
Dual feasibility: $\forall i, \alpha_i \geq 0$
Complementary slackness: $\forall i, \alpha_i(y_i(\mathbf{w}^\top\mathbf{x_i}+b)-1)=0$
$\mathbf{w}^\star$ is a linear combination of the training samples
Theorem (Schölkopf et al.): Let a training set $\mathcal{A} = \{(\mathbf{x}_i, y_i)\}$, an arbitrary error measuring function $l(\cdot, \cdot)$ and a stricly increasing function $g$, then any minimizer of the empirical risk
$$\mathbf{w}^\star = \text{argmin}_\mathbf{w} \frac{1}{n}\sum_i l(y_i, \langle \mathbf{w}, \mathbf{x}_i \rangle) + g(\|\mathbf{w}\|) $$has a decomposition of the form
$$\mathbf{w}^\star = \sum_i \alpha_i \mathbf{x}_i $$(The training set is a spanning set of the solution space)
Complementary slackness:
\begin{align*} \forall i, \alpha_i(y_i(\mathbf{w}^\top\mathbf{x_i}+b) - 1)=0 \end{align*}Which means \begin{align*} \mathbf{w}^\top\mathbf{x_i}+b \neq y_i \Rightarrow \alpha_i = 0 \end{align*}
$\mathbf{w}^\star$ is a combination of the samples that are on the margin
Solving the dual problem: \begin{align*} \max_{\alpha} \inf_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \alpha) \end{align*}
Since $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$
\begin{align*} \inf_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \alpha) = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j}\alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j \end{align*}$k(\mathbf{x}_i, \mathbf{x}_j)$ is called a kernel and defines the similarity between $\mathbf{x}_i$ and $\mathbf{x}_j$
Dual problem: \begin{align*} \max_\alpha D(\alpha) = \sum_i\alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j) \end{align*}
Or in matrix form: \begin{align*} \max_\alpha D(\alpha) = \alpha^\top \mathbf{1} - \frac{1}{2} (\alpha \circ \mathbf{y})^\top \mathbf{K} (\alpha \circ \mathbf{y}) \end{align*}
With $\circ$ the Hadamard (element wise) product, and $\mathbf{K}$ the Gram matrix of the kernel
Examples:
Show that the polynomial kernel and the Gaussian kernel are indeed kernels
Problem non linearly separable in the mapped space
\begin{align*} \min_{\mathbf{w}, b}&\quad \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \zeta_i \\ \text{s.t.} & \quad \forall i, y_i(\mathbf{w}^\top\mathbf{x} +b) \geq 1 - \zeta_i \\ & \quad \forall i, \zeta_i \geq 0 \end{align*}Compute Lagrangian:
\begin{align*} \mathcal{L} =& \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \zeta_i - \sum_i \mu_i \zeta_i - \sum_i \alpha_i(y_i(\mathbf{w}^\top\mathbf{x}_i +b) -1 + \zeta_i )\\ \text{s.t.} & \forall i, \alpha_i \geq 0, \mu_i \geq 0 \end{align*}Stationarity: \begin{align*} \frac{\partial \mathcal{L}}{\partial\mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i \end{align*} $\Rightarrow \mathbf{w}^\star = \sum_i \alpha_i y_i \mathbf{x}_i$} \begin{align*} \frac{\partial \mathcal{L}}{\partial\zeta_i} = C - \mu_i - \alpha_i \end{align*} $\Rightarrow C = \alpha_i + \mu_i \Rightarrow 0 \leq \alpha_i \leq C$
\begin{align*} \frac{\partial \mathcal{L}}{\partial b} = \sum_i \alpha_i y_i \Rightarrow \sum_i\alpha_i y_i = 0 \end{align*}Similar to the hard margin problem but with upper bound on the Lagrange multipliers (a training sample can not contribute more than $C$)
Decision function: \begin{align*} f(\mathbf{x}) = \sum_i \alpha_i y_i k(\mathbf{x}, \mathbf{x_i}) + b \end{align*}
bias $b$ is annoying because of $\sum \alpha_iy_i = 0$
Stochastic Dual Coordinate Ascent
Second order ascent using diagonal Hessian approximation
def GaussKernel(x1, x2, gamma=10.0):
return jnp.exp(-gamma*( jnp.linalg.norm(x1, axis=-1, keepdims=True)**2 + jnp.linalg.norm(x2, axis=-1, keepdims=True).T**2 - 2*jnp.dot(x1, x2.T)))
def SDCAupdate(i, alpha, x, y, K, C=1.0, eps=1e-7):
y_pred = jnp.dot(K, alpha)
err = 1 - y[i]*y_pred[i]
if jnp.abs(err) < eps:
return alpha[i]
da = err/K[i,i]
ai = y[i] * jnp.maximum(0, jnp.minimum(C, da+y[i]*alpha[i]))
return ai
def f_pred(x, alpha, x_train, gamma=10.0):
K = GaussKernel(x, x_train, gamma)
return jnp.dot(K, alpha)
def f_true(x):
if x <= 0.25: return -1.
if x < 0.25 and x <= 0.5: return 1.
if x > 0.5 and x <= 0.75: return -1.
return 1.
n = 20
gamma = 100.0
key = jax.random.PRNGKey(42)
x = jax.random.uniform(key, (n,1))
y = [f_true(xi) for xi in x]
alpha = jnp.zeros((n,1))
K = GaussKernel(x,x, gamma)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
camera = Camera(fig)
t = jnp.arange(51)/50.
for e in range(10):
r = jax.random.permutation(key, n)
for i in r:
ai = SDCAupdate(i, alpha, x, y, K, C=100.)
alpha = jax.ops.index_update(alpha, i, ai)
ax1.plot(t, [f_true(i) for i in t], '-k')
y_pred = f_pred(t[:,None], alpha, x, gamma)
ax1.plot(t, y_pred, '-r')
ax2.stem(x, alpha, basefmt=" ", use_line_collection=True)
camera.snap()
animation = camera.animate()
HTML(animation.to_html5_video())
Knowing that the VC dimension of a linear classifier in $\mathbb{R}^d$ is $d+1$,
What is the VC dimension of a kernel SVM using $k(\mathbf{x}_i, \mathbf{x}_j) = \langle \mathbf{x}_i, \mathbf{x}_j \rangle^2$?
What is the VC dimension of a kernel SVM using $k(\mathbf{x}_i, \mathbf{x}_j) = exp(-\gamma \| \mathbf{x}_i - \mathbf{x}_j \|^2)$?
A RKHS is a space $\mathcal{H}$ of functions $f: \mathcal{X} \rightarrow \mathbb{R}$ for which the pointwise evaluation corresponds to the dot product with specific functions
$$ f(x) = \langle f, \phi_x \rangle $$Since $\phi_x \in \mathcal{H}$, we have
$$ \phi_x(y) = \langle \phi_x, \phi_y \rangle = k(x, y)$$Theorem (Schölkopf et al.): Let a training set $\mathcal{A} = \{(\mathbf{x}_i, y_i)\}$, $\mathcal{H}$ a Hilbert space of function associated with reproducing kernel $k$, an arbitrary error measuring function $l(\cdot, \cdot)$ and a stricly increasing function $g$, then any minimizer $f\in \mathcal{H}$ of the empirical risk
$$f^\star = \text{argmin}_{f\in \mathcal{H}} \frac{1}{n}\sum_i l(y_i, f(\mathbf{x}_i)) + g(\|\mathbf{f}\|_\mathcal{H}) $$has a decomposition of the form
$$f^\star = \sum_i \alpha_i k(\mathbf{x}_i, \cdot) $$(The training set is a spanning set of the solution space)
Find an explicit mapping that approximate the kernel
$$ k(\mathbf{x}_i, \mathbf{x}_j) \approx \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle $$Map the training set
$$ \forall i, \bar{\mathbf{x}_i} = \phi(\mathbf{x_i}) $$Train a linear SVM on mapped samples
\begin{align*} \min_{\mathbf{w}, b}&\quad \frac{\lambda}{2}\|\mathbf{w}\|^2 + \frac{1}{n}\sum_i \max (0, 1 - y_i(\mathbf{w}^\top\bar{\mathbf{x}} +b)) \end{align*}Prediction function
$$ f(\mathbf{x}) = \langle \mathbf{w}, \phi(\mathbf{x})\rangle +b $$Kernel matrix of the training set
$$ \mathbf{K} = [k(\mathbf{x}_i, \mathbf{x}_j)]_{ij} $$Low rank approximation
$$ \mathbf{K} = \mathbf{ULU}^\top $$Non linear projection
$$ \phi(\mathbf{x}) = \mathbf{L}_m^{-1/2}\mathbf{U}_m^\top\mathbf{K}(x), \quad \mathbf{K}(x) = [k(\mathbf{x}_i, \mathbf{x})]_i $$Limits the VC dimension to m
def SquareKernel(X1, X2):
return jnp.matmul(X1, X2.T)**2
X = data['X_train']
X = X/jnp.linalg.norm(X, axis=1)[:,None]
y = jax.nn.one_hot(data['y_train'], 10)*2 - 1
K = SquareKernel(X, X)
L, U = jnp.linalg.eigh(K)
L = L[-64:]
U = U[:, -64:]
P = jnp.sqrt(1./L)[:, None]*U.T
X_bar = jnp.matmul(P, K).T
def func(w, b, x):
return jnp.matmul(x, w) + b
def hinge(w, b, x, y):
return jax.nn.relu(1 - y * func(w, b, x)).mean()
def loss(w, b, x, y):
return 0.1*(w*w).sum() + hinge(w, b, x, y)
@jax.jit
def update(w, b, x, y):
dw, db = jax.grad(loss, argnums=(0, 1))(w, b, x, y)
return w - 0.1*dw, b - 0.1*db
w = np.random.randn(64, 10)
b = np.random.randn(10)
l = []
for t in range(2500):
l.append(hinge(w, b, X_bar, y))
w, b = update(w, b, X_bar, y)
plt.plot(l)
[<matplotlib.lines.Line2D at 0x7f36f4198a10>]
def accuracy(y_pred, y_true):
return (1.*(jnp.argmax(y_true, axis=1) == jnp.argmax(y_pred, axis=1))).mean()
y_pred = func(w, b, X_bar)
print('accuracy: {}'.format(accuracy(y_pred, y)))
accuracy: 0.949999988079071
X_val = data['X_val']
X_val = X_val/jnp.linalg.norm(X_val, axis=1)[:,None]
y_val = jax.nn.one_hot(data['y_val'], 10)*2 - 1
K_val = SquareKernel(X, X_val)
X_valbar = jnp.matmul(P, K_val).T
y_pred = func(w, b, X_valbar)
print('validation accuracy: {}'.format(accuracy(y_pred, y_val)))
validation accuracy: 0.6299999952316284
Xt = X[jnp.argsort(data['y_train']), :]
Xt = Xt/jnp.linalg.norm(Xt, axis=1)[:,None]
Kt = SquareKernel(Xt, Xt)
plt.imshow(Kt)
<matplotlib.image.AxesImage at 0x7f373221f790>
Combination kernel \begin{align*} k(\mathbf{x}_i, \mathbf{x}_j) = \sum_m \beta_m k_m(\mathbf{x}_i, \mathbf{x}_j), \beta_m \geq 0 \end{align*} MKL problem: \begin{align*} \min_\beta\max_\alpha & \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \sum_m \beta_m k_m(\mathbf{x}_i, \mathbf{x_j}) \\ \text{s.t.} & \quad \forall i, 0 \leq \alpha_i \leq C \\ & \quad \forall m, \beta_m \geq 0\\ &\quad \Omega(\beta) = 1 \end{align*} Norm constraint $\Omega$:
Kernel function $$k(\mathbf{x}_1, \mathbf{x}_2) = \langle \phi(\mathbf{x}_1), \phi(\mathbf{x}_2)\rangle $$
Ridge regression problem $$\min_\mathbf{w} \frac{\lambda}{2}\|\mathbf{w}\|^2 + \frac{1}{n} \sum_i (y_i - \langle \mathbf{w}, \phi(\mathbf{x}_i)\rangle)^2 $$
Representer theorem
$$\mathbf{w} = \sum_i\alpha_i\phi(\mathbf{x}_i) $$Pseudo-inverse solution
$$\mathbf{w} = (\phi(\mathbf{X})\phi(\mathbf{X})^\top + \lambda I)^{-1} \phi(\mathbf{X})\mathbf{y} $$Identity trick
$$(P^{−1}+BTR^{−1}B)^{−1}BTR^{−1}=PBT(BPBT+R)^{−1}$$with
$P^{-1} = \lambda I$
$R = I$
$B = \phi(\mathbf{X})$
we get
$$ \mathbf{w} = \phi(\mathbf{X})(\mathbf{K}+\lambda I)^{-1}\mathbf{y} $$Thus
$$\alpha = (\mathbf{K} + \lambda I)^{-1})\mathbf{y} $$$$ f(\mathbf{x}) = \sum_i \alpha_i k(\mathbf{x_i}, \mathbf{x}) $$Linear binary classification:
hinge loss
Gradient descent or stochastic gradient descent
many equivalent predictor
Linear SVM
SRM: Structural risk Minimization (Occam's razor)
VC Dimension: $d+1$ for linear predictors
$\ell_2$ regularization of the predictor
Kernel SVM
Multiclass
One versus all
One versus One with voting strategy
Kernel methods
Representer theorem: solution is in $\text{span}(\mathcal{A})$
Combination is as large as the training set
Support vectors for hinge loss
Approximate kernel methods