Unsupervised learning¶
What if we don't have labels?
- Training set $\mathcal{A} = \{\mathbf{x}\}$
What if we don't even have any idea about the structure of $\mathcal{Y}$?
- Density estimation (how likely is $\mathbf{x}$)
- Clustering (partition $\mathcal{X}$ into exclusive classes)
Density estimation¶
Given a training set of samples $\mathcal{A} = \{\mathbf{x}\}$, we want to model a probability density function $f(\mathbf{x})$ corresponding to the distribution $D$ such that $\mathbf{x} \sim D$
$f$ has to be a pdf:
$\forall \mathbf{x}\in\mathcal{X}, f(\mathbf{x})\geq 0$
$\int_\mathcal{X} f(\mathbf{x})d\mathbf{x} = 1$
Useful for anomaly detection, e.g., if $f(\mathbf{x}) \leq \theta$ (Also called Out of Distribution detection)
- Simple pdf model, ex
$$ f(\mathbf{x}) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\|\mathbf{x} - \mu \|^2}{2\sigma^2}} $$
We have observations, they are likely by definition
Maximizing the probability of $\mathcal{A}$
$$ \max_f Pr[\mathcal{A}] = \prod_{\mathbf{x}_i} f(\mathbf{x}_i) $$
Equivalently, maximizing the log probability
$$ \max_f \log Pr[\mathcal{A}] = \sum_{\mathbf{x}_i} \log f(\mathbf{x}_i) $$
For the Gaussian
$$ \sum_{\mathbf{x}_i} \log f(\mathbf{x}_i) = \sum_{\mathbf{x}_i} -\|\mathbf{x}_i - \mu \|^2 / 2\sigma^2 - n\log\sqrt{2\pi} \sigma$$
The maximimzation is equivalent to
$$ \min_\mu \sum_{\mathbf{x}_i} \| \mathbf{x}_i - \mu \|^2 $$
Which is attained for
$$ \begin{aligned} \frac{\partial \sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2}{\partial \mu} &= 0\\ 2n\mu - 2\sum_{\mathbf{x}_i}\mathbf{x}_i &= 0 \\ \mu &= \frac{1}{n} \sum_{\mathbf{x}_i} \mathbf{x}_i \end{aligned} $$
For $\sigma$
$$\min_\sigma \sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2 / 2\sigma^2 + n\log\sqrt{2\pi} \sigma $$
$$ \begin{aligned} \frac{\partial \sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2 / 2\sigma^2 + n\log\sqrt{2\pi} \sigma }{\partial \sigma} &= 0 \\ - \frac{\sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2}{\sigma^3} + \frac{n}{\sigma} &=0 \\ \sigma^2 &= \frac{\sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2}{n} \end{aligned} $$
Expectation-Maximization¶
Some pdf models do not lead to closed form solution (case of mixture models)
Alternate optimisation scheme
Initialize model parameters $\Gamma_0$
Expectation step: Assumming parameters $\Gamma_t$, compute expected likelihood (or log-likelihood) of $f$ w.r.t. $\mathcal{A}$
Maximization step: Maximize this expected likelihood of $f$ w.r.t. $\Gamma$
Gaussian Mixture model¶
$$ f(\mathbf{x}) = \sum_k \pi_k e^{ (\mathbf{x}-\mu_k)^\top \Gamma_k (\mathbf{x}-\mu_k)}$$
$\pi_k$ is the weight (population) , $\mu_k$ the mean and $\Gamma_k$ is the inverse covariance matrix of component $k$
An observation $\mathbf{x}$ belongs to component $k$ with likelihood $f_k(\mathbf{x})$
E step
$$ E[\log f(\mathbf{x}_i)] = \sum_k Pr[k|\mathbf{x}_i]\log f_k(\mathbf{x}_i)] $$
With
$$ Pr[k|\mathbf{x}_i] = h_i^k = \frac{\pi_k e^{(\mathbf{x}_i - \mu_k)^\top\Gamma_k(\mathbf{x}_i - \mu_k)}}{\sum_j\pi_j e^{(\mathbf{x}_i - \mu_j)^\top\Gamma_j(\mathbf{x}_i - \mu_j)}} $$
M step:
$$ \max_{\pi, \mu, \Gamma} E[\log f(\mathbf{x})] = \sum_i \sum_k h_i^k \log f_k(\mathbf{x}_i) $$
$$ \begin{aligned} \pi_k &= \frac{\sum_i h_i^k}{n} \\ \mu_k &= \frac{\sum_i h_i^k \mathbf{x}_i}{\sum_i h_i^k}\\ \Gamma_k^{-1} &= \frac{\sum_i h_i^k (\mathbf{x}_i - \mu_k)(\mathbf{x}_i-\mu_k)^\top}{\sum_i h_i^k} \end{aligned} $$
X, y = make_moons(100, noise=0.1)
plt.scatter(X[:,0], X[:,1])
<matplotlib.collections.PathCollection at 0x73ac154a5090>
def Estep(X, w, mu, s2): # nx5, 5, 5x2, 5x2
xmu = X[:, None, :] - mu[None, :, :] # n x 5 x 2
s2xm = xmu/(1e-7+s2[None,:,:]) # n x 5 x 2
xmsxm = w * jnp.exp(-(xmu * s2xm).sum(2)) # n x 5
return xmsxm # n x 5
def Mstep(X, h): # nx2, nx5
h = h / h.sum(axis=1, keepdims=True) # normalize likelihood
w = h.mean(axis=0, keepdims=True) # n x 5 -> 1x5
m = (h[:,:,None]*X[:,None,:]).sum(axis=0) / h.sum(axis=0)[:,None] # 5 x 2
xm = X[:,None,:] - m[None,:,:] # n x 5 x 2
s = (h[:,:,None]*xm*xm).sum(axis=0) / h.sum(axis=0)[:,None] # 5 x 2
return w, m, s
np.random.seed(4)
w = np.ones((1, 5))/5.
m = 0.5*np.random.randn(5, 2)+0.5
s = 0.5*jnp.ones((5,2))
fig = plt.figure(dpi=150)
plt.axis('off')
camera = Camera(fig)
for i in range(30):
plot_density(X, w, m, s)
h = Estep(X, w, m, s)
w, m, s = Mstep(X, h)
plt.axis([-2, 3, -2, 2])
camera.snap()
animation = camera.animate()
HTML(animation.to_html5_video())
plot_density(X, w, m, s)
One class SVM¶
Given a training set $\mathcal{A} = \{\mathbf{x}_i\}$ and a kernel $k(\cdot, \cdot) = \langle \phi(\cdot), \phi(\cdot) \rangle$, we want to find a classifier of high density regions:
$$ \min_{\mathbf{w}, \xi, \rho} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_i \xi_i - \rho$$ $$\text{ s.t. } \forall i, \langle \mathbf{w}, \phi(\mathbf{x}_i) \rangle \geq \rho - \xi_i$$ $$\forall i, \xi_i \geq 0 $$
Using KKT:
$$ \begin{aligned} \mathbf{w} &= \sum_i \alpha_i \phi(\mathbf{x}_i)\\ \sum_i\alpha_i &= 1\\ 0 &\leq \alpha_i \leq C \end{aligned} $$
Dual problem:¶
$$\max_{\alpha} \frac{1}{2} \sum_{ij} \alpha_i\alpha_j k(\mathbf{x}_i,\mathbf{x}_j)$$ $$\text{s.t. } \forall i, 0 \leq \alpha_i \leq C$$ $$\sum_i \alpha_i = 1 $$
Solved using any QP solver (or projected coordinate ascent)
Recovering $\rho$
KKT, complementary slackness:
$$ \forall i, \lambda_i(\rho - \xi_i - \sum_{ij}\alpha_jk(\mathbf{x}_i, \mathbf{x}_j)) = 0 $$
if $\alpha_i \neq 0$ and $\alpha_i \neq C$
$$\xi_i = 0$$
and
$$ \lambda_i \neq 0 $$
Thus $$\rho = \sum_j \alpha_j k(\mathbf{x}_i, \mathbf{x}_j) $$
def gauss_kernel(X1, X2, gamma=5.):
D = (X1[:,None,:] - X2[None,:,:])**2
return jnp.exp(-gamma*D.sum(axis=2))
def loss(alpha, X):
K = gauss_kernel(X, X)
return 0.5 * (alpha[None,:] @ (K @ alpha[:,None])).squeeze()
@jax.jit
def update(alpha, X, C = 1., eta=0.1):
da = jax.grad(loss, argnums=0)(alpha, X)
a = jnp.clip(alpha + eta*da, 0, C)
return a/a.sum()
fig = plt.figure(dpi=150); plt.axis('off'); camera = Camera(fig)
alpha = np.ones(100)/100; l = []
for i in range(200):
y_pred = gauss_kernel(xx, X)@alpha[:,None]
y_pred = jnp.array(y_pred).reshape(t, t)
plot_density(X, xx, y_pred)
camera.snap()
alpha = update(alpha, X, C=0.02, eta=.04)
l.append(loss(alpha, X))
animation = camera.animate()
HTML(animation.to_html5_video())
plt.plot(l)
[<matplotlib.lines.Line2D at 0x73abe03fe3e0>]
Exercice¶
Code a GMM and run it on MNIST, plot the means as images.
Clustering¶
If we have a mixture model, could we use the likelihood of each component as a categorization prediction?
Yes, but not the objective function (i.e., no competition between classes)
Better use a dedicated algorithm
$k$-means¶
Categorization by approximation
Define $M$ partitions $C_k$ of the training set $\mathcal{A}$ and their associated predictor $\mu_k$
$$\min_{C_k, \mu_k} \sum_k \sum_{\mathbf{x} \in C_k} \| \mathbf{x} - \mu_k \|^2 $$
Alternate stepest descent between $C_k$ and $\mu_k$
Stepest descent on $C_k$
$$ \min_{C_k, \cup C_k = \mathcal{A}} \sum_k \sum_{\mathbf{x} \in C_k} \| \mathbf{x} - \mu_k \|^2 $$
Attained with nearest nearbor assignment
$$ C_k = \{ \mathbf{x} \in \mathcal{A} | \mu_k = \text{argmin}_c \|\mathbf{x} - \mu_c \|^2 \} $$
Corresponds to an E step in EM with
$$ h_i^k = \mathbb{1}_{[ \mu_k = \text{argmin}_c \|\mathbf{x} - \mu_c \|^2]} $$
Stepest descent on $\mu_k$
$$\text{min}_{\mu_k} \sum_{\mathbf{x}\in C_k} \|\mathbf{x} - \mu_k \|^2 $$
Attained for the barycenter
$$\mu_k = \frac{1}{|C_k|}\sum_{\mathbf{x} \in C_k} \mathbf{x} $$
Corresponds to an M step in EM
Kernel $k$-means¶
Using a kernel $k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$
$$\min_{C_k, \mu_k} \sum_k \sum_{\mathbf{x} \in C_k} \| \phi(\mathbf{x}) - \mu_k \|^2 $$
Representer theorem
$$ \mu_k = \sum_i \alpha_i \phi(\mathbf{x}_i) $$
$$ \| \phi(\mathbf{x}) - \mu_k \|^2 = k(\mathbf{x},\mathbf{x}) + \sum_{ij}\alpha_i\alpha_jk(\mathbf{x}_i, \mathbf{x}_j) - 2\sum_i\alpha_i k(\mathbf{x}, \mathbf{x_i}) $$
$C_k$ step unchanged
$\mu_k$ step, note that
$$\mu_k = \frac{1}{|C_k|}\sum_{\mathbf{x}\in C_k} \phi(\mathbf{x}) $$
Thus $$ \begin{aligned} \alpha_i = \begin{cases} 1/|C_k| \text{if } \mathbf{x} \in C_k \\ 0, \text{else} \end{cases} \end{aligned} $$
def Estep(X, mu):
D = ((X[:,None,:] - mu[None,:,:])**2).sum(axis=2) # n x M
return D.argmin(axis=1)
def Mstep(X, h):
h = jax.nn.one_hot(h, num_classes=25)
mu = (X[:,None,:]*h[:,:,None]).sum(axis=0)/(1e-5+h[:,:,None].sum(axis=0))
return mu
def plot_km(X, xv, yv, y_pred, mu):
plt.pcolormesh(xv, yv, y_pred, shading='auto', aa=True)
plt.scatter(mu[:,0], mu[:,1], c=np.arange(25), marker='^', s=150, edgecolors='k')
plt.scatter(X[:,0], X[:,1], c=h, edgecolors='k')
fig = plt.figure(dpi=150)
camera = Camera(fig)
mu = 2*np.random.random((25, 2))-0.5
for e in range(20):
h = Estep(X, mu)
y_pred = Estep(xx, mu)+1
y_pred = jnp.array(y_pred).reshape(t, t)
plot_km(X, xv, yv, y_pred, mu)
camera.snap()
mu = Mstep(X, h)
plot_km(X, xv, yv, y_pred, mu)
camera.snap()
animation = camera.animate(interval=700)
HTML(animation.to_html5_video())
Exercise¶
Code k-Means and run it on MNIST, plot the centroids as images
Probably Approximately Correct¶
Can we obtain formal guaranties in Learning?
Formal model of learnability (Valient, 19984)
- Domain set $\mathcal{X}$: observations with distribution $\mathcal{D}$
- Label set $\mathcal{Y}$: target of prediction
- Concept $f: \mathcal{X} \rightarrow \mathcal{Y}$, data generation process
- Hypothesis $h: \mathcal{X} \rightarrow \mathcal{Y}$, prediction function
- Hypothesis class $\mathcal{H} = \{ h\}$, set of hypotheses
- Error $L_{\mathcal{D}, f}(h) = Pr_{x\sim \mathcal{D}}[h(x) \neq f(x)]$
Empirical Risk Minimization¶
- $\mathcal{D}$ is unknown, and so is $L_{\mathcal{D}, f}(h)$
- Training set $\mathcal{A} = \{(x_1, y_1), \dots, (x_m, y_m)\}$ sampled i.i.d from $\mathcal{D}$ and labeled with $f$
- Empirical risk: $L_\mathcal{A}(h) = \frac{1}{m}\sum_i[h(x_i) \neq y_i]$
ERM principle:
$$ h_\mathcal{A} = \text{argmin}_{h\in\mathcal{H}}L_\mathcal{A}(h)$$
Remark that with probability 1 over a random set $\mathcal{A} = \{(x_i, y_i)\}, x_i \sim \mathcal{D}$ and $y_i = f(x_i)$, we have $L_\mathcal{A}(h^\star) = 0$
ERM Failures?¶
Given the realizability assumption, what is the probability that the ERM fails?
Bounding the generalization risk
$$ Pr_\mathcal{A}[L_{\mathcal{D}, f}(h_\mathcal{A}) > \epsilon] \leq \delta$$
Probably ($\delta$) Approximately ($\epsilon$) Correct
Generalization bound¶
- Bad hypotheses set
$$\mathcal{H}_B = \{h\in\mathcal{H} | L_{\mathcal{D},f}(h) > \epsilon\}$$
- Misleading training sets
$$M = \{\mathcal{A} | \exists h \in \mathcal{H}_B, L_\mathcal{A}(h) = 0 \} $$
Set of training sets that appear to be good ($L_{\mathcal{A}}(h_\mathcal{A}) = 0$) but are bad in reality ($L_{\mathcal{D}, f}(h_\mathcal{A}) > \epsilon$)
We want to know the probability that the ERM fails because we have sampled a misleading dataset
Let us construct $M$
- We follow the ERM
- We have the realizability assumption
- Misleading training sets achieve zero empirical risk for bad classifiers
$$M = \cup_{h\in\mathcal{H}_B}\{\mathcal{A}|L_\mathcal{A}(h) = 0\} $$
Remark that
$$ \{\mathcal{A}|L_{\mathcal{D}, f}(h_\mathcal{A}) > \epsilon\} \subseteq M $$
(a training set that leads to $\epsilon$ generalization error is necessarily a misleading training set beacuse of realizability)
Combining the two using an union bound
$$Pr_\mathcal{A}[\mathcal{A}|L_{\mathcal{D}, f}(h_\mathcal{A}) > \epsilon] \leq \sum_{h\in\mathcal{H}_B} Pr_\mathcal{A}[\mathcal{A}|L_\mathcal{A}(h) = 0] $$
Elements of $\mathcal{A}$ are sampled i.i.d
$$Pr_\mathcal{A}[\mathcal{A}|h\in\mathcal{H}_B,L_\mathcal{A}(h) = 0] = \prod_{i=1}^m Pr[h(x_i) = f(x_i)] $$
Since $h\in\mathcal{H}_B$, $Pr[h(x_i) = f(x_i)] \leq 1 - \epsilon$, thus
$$Pr_\mathcal{A}[\mathcal{A}|h\in\mathcal{H}_B,L_\mathcal{A}(h) = 0] \leq (1-\epsilon)^m \leq e^{-\epsilon m} $$
$$Pr_\mathcal{A}[\mathcal{A}|L_{\mathcal{D},f}(h_\mathcal{A}) >\epsilon] \leq |\mathcal{H}_B|e^{-\epsilon m} \leq |\mathcal{H}|e^{-\epsilon m} $$
Theorem¶
Let $\mathcal{H}$ be a finite hypothesis class, $\delta \in [0, 1]$, $\epsilon > 0$ and $m$ such that
$$ m\geq \frac{\log (|\mathcal{H}|/\delta)}{\epsilon} $$
Then, for any labeling function $f$, and on any distribution $\mathcal{D}$ for which the realizability assumption holds, we have for every ERM hypothesis $h_\mathcal{A}$
$$Pr_\mathcal{A}[L_{\mathcal{D}, f}(h_\mathcal{A}) \leq \epsilon] \geq 1 - \delta $$
PAC Learning¶
Definition (PAC Learnability):¶
A hypothesis class $\mathcal{H}$ is PAC learnable if $\exists m_\mathcal{H}: [0,1]^2 \rightarrow \mathbb{N}$ and a learning algorithm such that $\forall \epsilon, \delta \in [0,1]^2, \forall \mathcal{D}$ over $\mathcal{X}, \forall f: \mathcal{X} \rightarrow \{0, 1\}$, if the realizability assumption holds w.r.t. $\mathcal{H, D}, f$, then running the algorithm on $m\geq m_\mathcal{H}(\epsilon, \delta)$ i.i.d. samples generated by $\mathcal{D}$ and labeled by $f$, the algorithm returns $h$ such that with probability at least $1-\delta$, $L_{\mathcal{D}, f}(h) \leq \epsilon$
Theorem¶
Every finite hypothesis class is PAC learnable with sample complexity
$$m_\mathcal{H}(\epsilon, \delta) \leq \left\lceil \frac{\log (|\mathcal{H}|/\delta)}{\epsilon} \right\rceil $$
Fundamental theorem of PAC Learning¶
Theorem¶
Let $\mathcal{H}$ be a hypothesis class over $\mathcal{X} \rightarrow \{0, 1\}$ and using the 0-1 loss function, then the following are equivalent
- $\mathcal{H}$ is PAC learnable
- Any ERM rule is a successful PAC learner for $\mathcal{H}$
- $\mathcal{H}$ has finite VC dimension
No Free Lunch Theorem¶
Theorem¶
Let $A$ be any learning algorithm for the task of binary classification with the 0-1 loss over $\mathcal{X}$, Let $m < |\mathcal{X}|/2$ be a training set size. Then there exists a distribution over $\mathcal{X}$ and a concept $f$ such that:
- $\exists h: \mathcal{X} \rightarrow \{0, 1\}, L_{\mathcal{D}, f}(h) = 0$ (there is a good hypothesis)
- With probability at least 1/7 over the choice of $\mathcal{A}\sim \mathcal{D}$, we have $L_{\mathcal{D}, f}(A(\mathcal{A})) \geq 1/8$ (the algorithm does not find it)
No universal learner
Clustering, Metric learning and PAC, take home¶
Unsupervised learning
Density estimation: how likely is an example
Clustering: partitioning $\mathcal{X}$ into arbitrarily chosen classes
EM is a powerful algorithm for DE and clustering, but sensitive to init
$k$-means shows up frequently as a basic tool (many improved version: splitting init, codeword shifting, etc)
$k$-means is an excellent init for GMM
Metric learning
ML algorithm dependent on the natural distance on $\mathcal{X}$
can be improved by using a better kernel (metric in the induced space)
Learning the metric can turn hard learning problem into easy ones
$k$NN with metric learning often has a good complexity/accuracy trade-off
PAC
Formal study of learnability without specifying the data distribution or the type of hypothesis
Finite classes are learnable
But bounds have unrealistic number of samples
No universal learner: some algorithms are better at some problems than others