Image('Neuron.png', width=400)
Image('a_neuron.png', width=400)
Linear: $\varphi(x) = x$
Rectified Linear: $\varphi(x) = \max(0, x)$
Sigmoid: $\varphi(x) = \frac{1}{1+e^{-x}}$
Hyperbolic tangent $\varphi(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
x = jnp.linspace(-3, 3, 100)
plt.plot(x, x, label='linear')
plt.plot(x, jax.nn.relu(x), label='relu')
plt.plot(x, jax.nn.sigmoid(x), label='sigmoid')
plt.plot(x, jnp.tanh(x), label='tanh')
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[<matplotlib.lines.Line2D at 0x7fd5f428e110>]
Stochastic gradient descent on mini-batches from $\mathcal{A} = \{(\mathbf{x}, y)\}$, with loss function $l$
# Load the dataset
data = np.load('mnist.npz')
X = data['X_train']
y = data['y_train']
plt.imshow(X[0,:].reshape(28,28))
print(y[0])
5
Using sigmoid activation, $f(\mathbf{x})$ is a probability
Minimize the cross-entropy (push output to either 0 or 1) $$\mathcal{L} = - y \log f(\mathbf{x}) - (1-y) \log (1 - f(\mathbf{x})) $$
X = data['X_train_bin']
y = data['y_train_bin']
def func(w, b, x):
return jax.nn.sigmoid(jnp.matmul(x, w) + b)
def xe(w, b, x, y):
fx = func(w, b, x)
return (-y*jnp.log(fx)-(1-y)*jnp.log(1-fx)).mean()
@jax.jit
def update(w, b, x, y):
dw, db = jax.grad(xe, argnums=(0,1))(w, b, x, y)
return w - 0.01*dw, b - 0.01*db
w = np.random.randn(784)/784
b = 0.
loss = []
for t in range(500):
loss.append(xe(w, b, X, y))
w, b = update(w, b, X, y)
plt.plot(loss)
[<matplotlib.lines.Line2D at 0x7fd5ec740990>]
def accuracy(y_pred, y_true):
return (y_true==y_pred).mean()
y_pred = (func(w, b, X)>0.5)*1.
print('accuracy: {}'.format(accuracy(y_pred, y)))
accuracy: 1.0
What about XOR?
Xor = jnp.sign(np.random.randn(200, 2)) + 0.1*np.random.randn(200,2)
yor = 1.*((Xor[:,0]*Xor[:,1])>0)
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x7fd5ec6bfe10>
Image('mlp.png', width=400)
Set layer $i$ as the function that maps to $\mathbb{R}^d$ by stacking neurons
$$ f_i(\mathbf{x}) = [\sigma(\mathbf{W}_{ij}^\top\mathbf{x}+\theta_{ij})]_{j\leq d}$$With
Create a network by composing $L$ layers
$$ F(\mathbf{x}) = f_L \circ \cdots \circ f_1(\mathbf{x}) $$Find all weights for 1 hidden layer of width 2
Use Relu activation for the hidden layer and $\text{sign}$ for the output layer
Image('2_xor.png', width=400)
ERM principle
$$ \min_{\{\mathbf{w}_i\}_i} \mathbb{E}_{(\mathbf{x},y)}\left[l(y, F(\mathbf{x}))\right] $$Gradient descent
$$\forall i, \mathbf{w}_i \leftarrow \mathbf{w}_i - \eta \mathbb{E}_{(\mathbf{x},y)}\left[\frac{\partial l(y, F(\mathbf{x}))}{\partial\mathbf{w}_i}\right]$$Monte-Carlo estimation with mini-batch strategy
$$ \mathbb{E}_{(\mathbf{x},y)}\left[\frac{\partial l(y, F(\mathbf{x}))}{\partial\mathbf{w}_i}\right] \approx \frac{1}{N}\sum_n \frac{\partial l(y_n, F(\mathbf{x}_n))}{\partial \mathbf{w}_i}$$Denote $\mathbf{x}_k = f_k \circ \cdots \circ f_1(\mathbf{x})$ the $k$-th intermediate output
Denote $g_k(\mathbf{x}_k) = f_L \circ \cdots \circ f_{k+1}(\mathbf{x}_k)$ the output computed from $\mathbf{x}_k$
Remark $\forall k, F_k = g_k(\sigma(\mathbf{w}_k^\top\mathbf{x}_{k-1} + \theta_k))$
Chain rule (Leibnitz notation)
$$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial z}\frac{\partial z}{\partial x}$$Single neuron chain
Image('neural_chain.png', width=600)
Recursion \begin{align} \frac{\partial\mathbf{x}_{k+t+1}}{\partial w_k} = \sigma'(\mathbf{x}_{k+t+1})w_{k+t+1}\frac{\partial \mathbf{x}_{k+t}}{\partial w_k} \\ \frac{\partial\mathbf{x}_{k}}{\partial w_k} = \sigma'(\mathbf{x}_k)\mathbf{x}_k \end{align}
Note:
if $w_t \ll 1$: vanishing gradients
if $w_t \gg 1$: exploding gradients
if $\sigma'(\cdot) \approx 0$: vanishing gradient (sigmoid, tanh, but not relu)
Recursion \begin{align} \delta_{L} &= l(y, F(\mathbf{x}))\sigma'(\mathbf{x}_L) \\ \delta_k &= \mathbf{w}_{k+1}(\sigma'(\mathbf{x}_{k+1}) \circ \delta_{k+1}) \\ \frac{\partial l(y, F(\mathbf{x}))}{\partial \mathbf{w}_k} &= \delta_k \circ \mathbf{x}_k \end{align}
Forward pass
Backward pass
Compute $l'(y, F(\mathbf{x}))$
$\forall k$, compute $\delta_k$
Update $\mathbf{w}_k$ using $l'(y, F(\mathbf{x})), \delta_k, \mathbf{x}_k$
In practice, ML libraries have auto-grad features (pytorch, tensorflow, jax, etc) for certain operators that you compose to build $F$
def l1(w1, b1, x):
return jax.nn.relu(jnp.matmul(x, w1) + b1)
def func(w1, w2, b1, b2, x):
x1 = l1(w1, b1, x)
x2 = jax.nn.sigmoid(jnp.matmul(x1, w2) + b2)
return x2
def xe(w1, w2, b1, b2, x, y):
fx = func(w1, w2, b1, b2, x)
return (-y*jnp.log(fx)-(1-y)*jnp.log(1-fx)).mean()
@jax.jit
def update(w1, w2, b1, b2, x, y, eta=0.1):
dw1, dw2, db1, db2 = jax.grad(xe, argnums=(0,1,2,3))(w1, w2, b1, b2, x, y)
return w1 - eta*dw1, w2 - eta*dw2, b1 - eta*db1, b2 - eta*db2
w1 = np.random.randn(2, 4)/10
w2 = np.random.randn(4)/10
b1 = np.random.randn(4)/10
b2 = np.random.randn(1)/10
loss = []
for t in range(2500):
ind = np.random.choice(len(Xor), size=64, replace=False)
loss.append(xe(w1, w2, b1, b2, Xor[ind, :], yor[ind]))
w1, w2, b1, b2 = update(w1, w2, b1, b2, Xor[ind, :], yor[ind], eta=0.1)
plt.plot(loss)
[<matplotlib.lines.Line2D at 0x7fd5ec0ae210>]
t = 50; tx = jnp.linspace(-1.5, 1.5, t)
xv, yv = jnp.meshgrid(tx, tx, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(func(w1, w2, b1, b2, xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x7fd5cc7f0b90>
t = 50; tx = jnp.linspace(-1.5, 1.5, t);
xv, yv = jnp.meshgrid(tx, tx, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(l1(w1, b1, xx)).reshape(t, t, 4)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-4., 2., 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred[:,:,0], shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x7fd5cc718b50>
levels=jnp.linspace(-4., 2., 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred[:,:,1], shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x7fd5cc683e10>
levels=jnp.linspace(-4., 2., 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred[:,:,2], shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x7fd5cc5fa350>
levels=jnp.linspace(-4., 2., 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred[:,:,3], shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x7fd5ec74e490>
Classification
Independent classes, binary crossentropy with sigmoid activation
Exclusive classes, categorical crossentropy with softmax activation
Regression
Consider that a feed forward neural network is an acyclic directed graph $(V, E)$
Theorem: $\forall n$, there exists a graph $V, E$ of depth 2, such that $\mathcal{H}(V, E, \text{sign})$ contains all function from $\{\pm 1\}^n$ to $\{\pm 1\}$.
\begin{align*} ~ \end{align*}A neural network with a single hidden layer and the sign activation function can approximate any binary function over binary vectors
Theorem (Cybenko 1989): $\forall n$, let $s(n)$ be the minimal integer such that there exist a graph $(V, E)$ with $\vert V \vert = n$ such that the hypothesis class $\mathcal{H}(V, E, \text{sign})$ contains all the function to $\{0, 1\}^n$ to $\{0, 1\}$. Then, $s(n)$ is exponential in $n$. Similar results hold for the sigmoid function.
\begin{align*} ~ \end{align*}1 hidden layer MLPs can approximate any function but require an exponential number of neurons.
Theorem: The VC dimension of $\mathcal{H}(V, E, \text{sign})$ is $\mathcal{O}(|E|\log |E|)$
\begin{align*} ~ \end{align*}The capacity of neural networks is more defined by their connectivity than by their number of neurons. Deeper networks have higher capacity.
def train_nn(n):
w1 = np.random.randn(2, n)/(jnp.sqrt(n))
w2 = np.random.randn(n)/(jnp.sqrt(n))
b1 = np.random.randn(n)/(jnp.sqrt(n))
b2 = np.random.randn(1)/(jnp.sqrt(n))
loss = []
for t in range(1000):
loss.append(xe(w1, w2, b1, b2, Xor, yor))
w1, w2, b1, b2 = update(w1, w2, b1, b2, Xor, yor, eta=0.1)
return loss
for n in range(1, 8):
plt.plot(train_nn(2**n), label='n={}'.format(2**n))
plt.legend()
<matplotlib.legend.Legend at 0x7fd5ec6506d0>
Larger NN, easier to train?
def train_nn2(n):
w1 = np.random.randn(2, n)/(jnp.sqrt(n))
w2 = np.random.randn(n)/(jnp.sqrt(n))
b1 = np.random.randn(n)/(jnp.sqrt(n))
b2 = np.random.randn(1)/(jnp.sqrt(n))
w = w1
w_change = []
for t in range(1000):
ind = np.random.choice(len(Xor), size=128, replace=False)
w1, w2, b1, b2 = update(w1, w2, b1, b2, Xor[ind, :], yor[ind], eta=0.1)
w_change.append(jnp.linalg.norm(w1 - w)/jnp.linalg.norm(w))
return w_change
for n in range(1, 8):
plt.plot(train_nn2(2**n), label='n={}'.format(2**n))
plt.legend()
<matplotlib.legend.Legend at 0x7fd5f404d150>
Larger networks tend to change less than smaller networks?
Consider the loss function as a function of $\mathbf{w}$ instead of $x$
$$ l_\mathbf{x}(\mathbf{w}) = l(y, F_\mathbf{w}(\mathbf{x})) $$Approximate it using its Taylor expansion
$$l_\mathbf{x}(\mathbf{w}) \approx l_\mathbf{x}(\mathbf{w}_0) + \nabla l_\mathbf{x}(\mathbf{w}_0) ^\top (\mathbf{w} - \mathbf{w}_0) $$Corresponds to a linear model with the non-linear mapping
$$ \phi(\mathbf{x}) = \nabla l_\mathbf{x}(\mathbf{w}_0) $$Defining a kernel
This approximation tends to be better when the width grows
Some weights are highly influencial ( $| \sum_i \phi(\mathbf{x}_i)[j] | \gg 0$ )
Blind spots in the kernel ( $\sum_i \phi(\mathbf{x}_i)[j] \approx 0$ ) mean that some components are barely updated
Very large model are easier to train because only few neurons need to be changed to (over?)fit the training data
The Lottery Ticket Hypothesis A randomly-initialized, dense neural network contains a subnet-work that is initialized such that—when trained in isolation—it can match the test accuracy of theoriginal network after training for at most the same number of iterations.
Pruning a randomly initialized network can yield a good predictor (verified empirically).
In practice, pruning is difficult, better train the full model.
Do larger networks systematically overfit?
Xor_tr = jnp.sign(np.random.randn(200, 2))
yor_tr = 1.*((Xor_tr[:,0]*Xor_tr[:,1])>0)
Xor_tr += 0.7*np.random.randn(200,2)
Xor_te = jnp.sign(np.random.randn(200, 2))
yor_te = 1.*((Xor_te[:,0]*Xor_te[:,1])>0)
Xor_te += 0.7*np.random.randn(200,2)
plt.scatter(Xor_tr[:,0], Xor_tr[:,1], c=yor_tr)
<matplotlib.collections.PathCollection at 0x7fd5cc3e7f50>
n = 10000
w1 = np.random.randn(2, n)/(jnp.sqrt(n))
w2 = np.random.randn(n)/(jnp.sqrt(n))
b1 = np.random.randn(n)/(jnp.sqrt(n))
b2 = np.random.randn(1)/(jnp.sqrt(n))
loss = 0
for t in range(1000):
loss = xe(w1, w2, b1, b2, Xor_te, yor_te)
w1, w2, b1, b2 = update(w1, w2, b1, b2, Xor_tr, yor_tr, eta=0.1)
t = 50; tx = jnp.linspace(-3.5, 3.5, t)
xv, yv = jnp.meshgrid(tx, tx, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(func(w1, w2, b1, b2, xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='gouraud', norm=norm);
plt.scatter(Xor_tr[:,0], Xor_tr[:,1], c=yor_tr, marker='s')
plt.scatter(Xor_te[:,0], Xor_te[:,1], c=yor_te)
<matplotlib.collections.PathCollection at 0x7fd5cc4982d0>
Double descent phenomenon
Overfitting solution exist, but are difficult to attain with SGD from well initialized network
Training
Backpropagation: Automatic differenciation
Stochastic gradient descent with mini-batch
Vanishing/exploding gradient (saturating non-linearity)
Non-convex optimization problem, but very effective in practice
Capacity
Universal approximation theorem
Capacity depending on connectivity (rather than size)
NTK
Well initialized large networks tend to behave linearly during optimization
Some neurons will not be updated
$\exists$ a good subnetwork in the random initialization - Lottery ticket
Double Descent
Extremely large models can avoid overfitting
Double descent phenomenon (overfitting difficult to reach with SGD from good init)