Natural neuron¶
Image('Neuron.png', width=400)
Artificial Neuron (McCulloch & Pitts)¶
f(x)=φ(⟨w,x⟩+θ)
Image('a_neuron.png', width=400)
Activation functions¶
Linear: φ(x)=x
Rectified Linear: φ(x)=max(0,x)
Sigmoid: φ(x)=11+e−x
Hyperbolic tangent φ(x)=ex−e−xex+e−x
x = jnp.linspace(-3, 3, 100)
plt.plot(x, x, label='linear')
plt.plot(x, jax.nn.relu(x), label='relu')
plt.plot(x, jax.nn.sigmoid(x), label='sigmoid')
plt.plot(x, jnp.tanh(x), label='tanh')
[<matplotlib.lines.Line2D at 0x77e488785f60>]
Training¶
Stochastic gradient descent on mini-batches from A={(x,y)}, with loss function l
- Draw random samples batch B={(xi,yi)}⊂A
- Compute gradient estimator δ=1|B|∑(xi,yi)∈B∂l(yi,f(xi)∂w
- Apply gradient descent w←w−ηδ
Small example¶
# Load the dataset
data = np.load('mnist.npz')
X = data['X_train']
y = data['y_train']
plt.imshow(X[0,:].reshape(28,28))
print(y[0])
5
Binary cross-entropy loss¶
Using sigmoid activation, f(x) is a probability
Minimize the cross-entropy (push output to either 0 or 1) L=−ylogf(x)−(1−y)log(1−f(x))
X = data['X_train_bin']
y = data['y_train_bin']
def func(w, b, x):
return jax.nn.sigmoid(jnp.matmul(x, w) + b)
def xe(w, b, x, y):
fx = func(w, b, x)
return (-y*jnp.log(fx)-(1-y)*jnp.log(1-fx)).mean()
@jax.jit
def update(w, b, x, y):
dw, db = jax.grad(xe, argnums=(0,1))(w, b, x, y)
return w - 0.01*dw, b - 0.01*db
w = np.random.randn(784)/784
b = 0.
loss = []
for t in range(500):
loss.append(xe(w, b, X, y))
w, b = update(w, b, X, y)
plt.plot(loss)
[<matplotlib.lines.Line2D at 0x77e4885c63e0>]
def accuracy(y_pred, y_true):
return (y_true==y_pred).mean()
y_pred = (func(w, b, X)>0.5)*1.
print('accuracy: {}'.format(accuracy(y_pred, y)))
accuracy: 1.0
Non linearly separable problems¶
What about XOR?
Xor = jnp.sign(np.random.randn(200, 2)) + 0.1*np.random.randn(200,2)
yor = 1.*((Xor[:,0]*Xor[:,1])>0)
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x77e48848b280>
Multiple layer¶
Image('mlp.png', width=400)
Multiple Layer Perceptron¶
Set layer i as the function that maps to Rd by stacking neurons
fi(x)=[σ(W⊤ijx+θij)]j≤d
With
- Wij,θij the weights and bias of neuron j at layer i
- σ the activation function
Create a network by composing L layers
F(x)=fL∘⋯∘f1(x)
XOR - Exercise¶
Find all weights for 1 hidden layer of width 2
Use Relu activation for the hidden layer and sign for the output layer
Image('2_xor.png', width=400)
Monte-Carlo estimation with mini-batch strategy
E(x,y)[∂l(y,F(x))∂wi]≈1N∑n∂l(yn,F(xn))∂wi
Backpropagation¶
Denote xk=fk∘⋯∘f1(x) the k-th intermediate output
Denote gk(xk)=fL∘⋯∘fk+1(xk) the output computed from xk
Remark ∀k,Fk=gk(σ(w⊤kxk−1+θk))
Chain rule (Leibnitz notation)
∂y∂x=∂y∂z∂z∂x
Backpropagation¶
Single neuron chain
Image('neural_chain.png', width=600)
∂l(y,F(x))∂wk=∂l(y,F(x))∂F(x)∂F(x)∂wk
=l′(y,F(x))∂σ(wLxL−1+θL)∂wk=l′(y,F(x))σ′(xL)∂wLxL−1+θL∂wk=l′(y,F(x))σ′(xL)wL∂xL−1∂wk
∂xL−1∂wk=∂σ(wL−1xL−2+θL−1)∂wk=σ′(xL−1)wL−1∂xL−2∂wk
Recursion ∂xk+t+1∂wk=σ′(xk+t+1)wk+t+1∂xk+t∂wk∂xk∂wk=σ′(xk)xk
∂l(y,F(x))∂wk=l′(y,F(x))L∏t=k+1σ′(xt)wtσ′(xk)xk
Note:
if wt≪1: vanishing gradients
if wt≫1: exploding gradients
if σ′(⋅)≈0: vanishing gradient (sigmoid, tanh, but not relu)
Fully connected network¶
Recursion δL=l(y,F(x))σ′(xL)δk=wk+1(σ′(xk+1)∘δk+1)∂l(y,F(x))∂wk=δk∘xk
Algorithm¶
Forward pass
- Compute and store ∀k,xk
Backward pass
Compute l′(y,F(x))
∀k, compute δk
Update wk using l′(y,F(x)),δk,xk
In practice, ML libraries have auto-grad features (pytorch, tensorflow, jax, etc) for certain operators that you compose to build F
XOR with MLP¶
def l1(w1, b1, x):
return jax.nn.relu(jnp.matmul(x, w1) + b1)
def func(w1, w2, b1, b2, x):
x1 = l1(w1, b1, x)
x2 = jax.nn.sigmoid(jnp.matmul(x1, w2) + b2)
return x2
def xe(w1, w2, b1, b2, x, y):
fx = func(w1, w2, b1, b2, x)
return (-y*jnp.log(fx)-(1-y)*jnp.log(1-fx)).mean()
@jax.jit
def update(w1, w2, b1, b2, x, y, eta=0.1):
dw1, dw2, db1, db2 = jax.grad(xe, argnums=(0,1,2,3))(w1, w2, b1, b2, x, y)
return w1 - eta*dw1, w2 - eta*dw2, b1 - eta*db1, b2 - eta*db2
w1 = np.random.randn(2, 4)/10
w2 = np.random.randn(4)/10
b1 = np.random.randn(4)/10
b2 = np.random.randn(1)/10
loss = []
for t in range(2500):
ind = np.random.choice(len(Xor), size=64, replace=False)
loss.append(xe(w1, w2, b1, b2, Xor[ind, :], yor[ind]))
w1, w2, b1, b2 = update(w1, w2, b1, b2, Xor[ind, :], yor[ind], eta=0.1)
plt.plot(loss)
[<matplotlib.lines.Line2D at 0x77e46cd9bb20>]
t = 50; tx = jnp.linspace(-1.5, 1.5, t)
xv, yv = jnp.meshgrid(tx, tx, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(func(w1, w2, b1, b2, xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x77e46cc3fd90>
Intermediate layers¶
t = 50; tx = jnp.linspace(-1.5, 1.5, t);
xv, yv = jnp.meshgrid(tx, tx, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(l1(w1, b1, xx)).reshape(t, t, 4)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-4., 2., 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred[:,:,0], shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x77e4883af400>
levels=jnp.linspace(-4., 2., 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred[:,:,1], shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x77e46ca05c30>
levels=jnp.linspace(-4., 2., 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred[:,:,2], shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x77e46c880bb0>
levels=jnp.linspace(-4., 2., 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred[:,:,3], shading='nearest', norm=norm);
plt.scatter(Xor[:,0], Xor[:,1], c=yor)
<matplotlib.collections.PathCollection at 0x77e46c8f3940>
Neural networks losses¶
Classification
Independent classes, binary crossentropy with sigmoid activation
Exclusive classes, categorical crossentropy with softmax activation
σ(xi)=exi∑jexjl(y,F(x)))=−∑iyilogF(x)[i]
Regression
- Usual ℓ2, ℓ1 losses
Neural networks capacity¶
Consider that a feed forward neural network is an acyclic directed graph (V,E)
Theorem: ∀n, there exists a graph V,E of depth 2, such that H(V,E,sign) contains all function from {±1}n to {±1}.
A neural network with a single hidden layer and the sign activation function can approximate any binary function over binary vectors
Proof¶
- Consider a network with a single hidden layer of 2n neurons
- Let {ui}1≤k≤n be the set of k input vector that have label 1
- Remark that ∀i,⟨ui,ui⟩=n and ∀x,∀i,x≠ui⇔⟨x,ui⟩≤n−2 (minimum 1 bit difference)
- Set k neurons to hi(x)=sign(u⊤ix−n+1), we have hi(x)=1 iff x=ui and −1 else
- Set the output to F(x)=sign(∑ihi(x)+k−1)
Neural network capacity¶
Theorem (see (Hornik et al, 1989) and (Cybenko, 1989)): ∀n, let s(n) be the minimal integer such that there exist a graph (V,E) with |V|=n such that the hypothesis class H(V,E,sign) contains all the function to {0,1}n to {0,1}. Then, s(n) is exponential in n. Similar results hold for the sigmoid function.
1 hidden layer MLPs can approximate any function but require an exponential number of neurons.
VC dimension¶
Theorem: The VC dimension of H(V,E,sign) is O(|E|log|E|)
The capacity of neural networks is more defined by their connectivity than by their number of neurons. Deeper networks have higher capacity.
Effect of width¶
def train_nn(n):
w1 = np.random.randn(2, n)/(jnp.sqrt(n))
w2 = np.random.randn(n)/(jnp.sqrt(n))
b1 = np.random.randn(n)/(jnp.sqrt(n))
b2 = np.random.randn(1)/(jnp.sqrt(n))
loss = []
for t in range(1000):
loss.append(xe(w1, w2, b1, b2, Xor, yor))
w1, w2, b1, b2 = update(w1, w2, b1, b2, Xor, yor, eta=0.1)
return loss
for n in range(1, 8):
plt.plot(train_nn(2**n), label='n={}'.format(2**n))
plt.legend()
<matplotlib.legend.Legend at 0x77e46c4a68f0>
Larger NN, easier to train?
def train_nn2(n):
w1 = np.random.randn(2, n)/(jnp.sqrt(n))
w2 = np.random.randn(n)/(jnp.sqrt(n))
b1 = np.random.randn(n)/(jnp.sqrt(n))
b2 = np.random.randn(1)/(jnp.sqrt(n))
w = w1
w_change = []
for t in range(1000):
ind = np.random.choice(len(Xor), size=128, replace=False)
w1, w2, b1, b2 = update(w1, w2, b1, b2, Xor[ind, :], yor[ind], eta=0.1)
w_change.append(jnp.linalg.norm(w1 - w)/jnp.linalg.norm(w))
return w_change
for n in range(1, 8):
plt.plot(train_nn2(2**n), label='n={}'.format(2**n))
plt.legend()
<matplotlib.legend.Legend at 0x77e46c741bd0>
Larger networks tend to change less than smaller networks?
Neural Tangent Kernel (Jacot et al, 2018)¶
Consider the loss function as a function of w instead of x
lx(w)=l(y,Fw(x))
Approximate it using its Taylor expansion
lx(w)≈lx(w0)+∇lx(w0)⊤(w−w0)
Corresponds to a linear model with the non-linear mapping
ϕ(x)=∇lx(w0)
Defining a kernel
This approximation tends to be better when the width grows
Some weights are highly influencial ( |∑iϕ(xi)[j]|≫0 )
Blind spots in the kernel ( ∑iϕ(xi)[j]≈0 ) mean that some components are barely updated
Very large model are easier to train because only few neurons need to be changed to (over?)fit the training data
Lottery ticket hypothesis¶
The Lottery Ticket Hypothesis (Malach et al, 2020)
A randomly-initialized, dense neural network contains a subnet-work that is initialized such that—when trained in isolation—it can match the test accuracy of theoriginal network after training for at most the same number of iterations.
Pruning a randomly initialized network can yield a good predictor (verified empirically).
In practice, pruning is difficult, better train the full model.
Double Descent (Belkin et al, 2019)¶
Do larger networks systematically overfit?
Xor_tr = jnp.sign(np.random.randn(200, 2))
yor_tr = 1.*((Xor_tr[:,0]*Xor_tr[:,1])>0)
Xor_tr += 0.7*np.random.randn(200,2)
Xor_te = jnp.sign(np.random.randn(200, 2))
yor_te = 1.*((Xor_te[:,0]*Xor_te[:,1])>0)
Xor_te += 0.7*np.random.randn(200,2)
plt.scatter(Xor_tr[:,0], Xor_tr[:,1], c=yor_tr)
<matplotlib.collections.PathCollection at 0x77e46c04ed40>
n = 10000
w1 = np.random.randn(2, n)/(jnp.sqrt(n))
w2 = np.random.randn(n)/(jnp.sqrt(n))
b1 = np.random.randn(n)/(jnp.sqrt(n))
b2 = np.random.randn(1)/(jnp.sqrt(n))
loss = 0
for t in range(1000):
loss = xe(w1, w2, b1, b2, Xor_te, yor_te)
w1, w2, b1, b2 = update(w1, w2, b1, b2, Xor_tr, yor_tr, eta=0.1)
t = 50; tx = jnp.linspace(-3.5, 3.5, t)
xv, yv = jnp.meshgrid(tx, tx, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(func(w1, w2, b1, b2, xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 100)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(Xor_tr[:,0], Xor_tr[:,1], c=yor_tr, marker='s')
plt.scatter(Xor_te[:,0], Xor_te[:,1], c=yor_te)
<matplotlib.collections.PathCollection at 0x77e46cd7ea70>
Double descent phenomenon
Overfitting solution exist, but are difficult to attain with SGD from well initialized network
Metric Learning¶
So far, we use either the natural distance on X or the one induced by the choice of a kernel
Can we just learn the distance?
Train a neural network ϕ(⋅) such that
- Related samples have short distances
- Unrelated samples have larger distances
Define Positive and Negative sets P(x),N(x) for each example x
minP∑x∑xp∈P(x)‖Px−Pxp‖2−λ∑xn∈N(x)‖Px−Pxn‖2
In practice, we don't want to put negative examples at an infinite distance
minP∑x∑xp∈P(x)‖Px−Pxp‖2+λ∑xp∈N(x)max(0,β−‖Px−Pxn‖2)
Push negative example until they are above margin β
Similar argument for the positive set with margin
max(0,‖Px−Pxp‖2−α)
Large Margin Nearest Neighbor (Weinberger et al, 2009)¶
Learn a distance that enhances a nearest neighbor classifier
- Define positives as elements of the k nearest neighbors with the same label as x P(x)={xc∈kNN(x)|yc=y}
- Define negatives as elements of the k nearest neighbors with different labels as x N(x)={xc∈kNN(x)|yc≠y}
Neural Networks , take home¶
- Artificial neuron: linear combination with pointwise non-linearity
- Layer: stacked neurons
- MLP: Layer composition
Training
Backpropagation: Automatic differenciation
Stochastic gradient descent with mini-batch
Vanishing/exploding gradient (saturating non-linearity)
Non-convex optimization problem, but very effective in practice
Capacity
Universal approximation theorem
Capacity depending on connectivity (rather than size)
NTK
Well initialized large networks tend to behave linearly during optimization
Some neurons will not be updated
∃ a good subnetwork in the random initialization - Lottery ticket
Double Descent
Extremely large models can avoid overfitting
Double descent phenomenon (overfitting difficult to reach with SGD from good init)