What if we don't have labels?
What if we don't even have any idea about the structure of $\mathcal{Y}$?
Given a training set of samples $\mathcal{A} = \{\mathbf{x}\}$, we want to model a probability density function $f(\mathbf{x})$ corresponding to the distribution $D$ such that $\mathbf{x} \sim D$
$f$ has to be a pdf:
$\forall \mathbf{x}\in\mathcal{X}, f(\mathbf{x})\geq 0$
$\int_\mathcal{X} f(\mathbf{x})d\mathbf{x} = 1$
Useful for anomaly detection, e.g., if $f(\mathbf{x}) \leq \theta$ (Also called Out of Distribution detection)
We have observations, they are likely by definition
Maximizing the probability of $\mathcal{A}$
$$ \max_f Pr[\mathcal{A}] = \prod_{\mathbf{x}_i} f(\mathbf{x}_i) $$Equivalently, maximizing the log probability
$$ \max_f \log Pr[\mathcal{A}] = \sum_{\mathbf{x}_i} \log f(\mathbf{x}_i) $$For the Gaussian
$$ \sum_{\mathbf{x}_i} \log f(\mathbf{x}_i) = \sum_{\mathbf{x}_i} -\|\mathbf{x}_i - \mu \|^2 / 2\sigma^2 - n\log\sqrt{2\pi} \sigma$$The maximimzation is equivalent to
$$ \min_\mu \sum_{\mathbf{x}_i} \| \mathbf{x}_i - \mu \|^2 $$Which is attained for
\begin{align} \frac{\partial \sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2}{\partial \mu} &= 0\\ 2n\mu - 2\sum_{\mathbf{x}_i}\mathbf{x}_i &= 0 \\ \mu &= \frac{1}{n} \sum_{\mathbf{x}_i} \mathbf{x}_i \end{align}For $\sigma$
$$\min_\sigma \sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2 / 2\sigma^2 + n\log\sqrt{2\pi} \sigma $$Some pdf models do not lead to closed form solution (case of mixture models)
Alternate optimisation scheme
Initialize model parameters $\Gamma_0$
$\pi_k$ is the weight (population) , $\mu_k$ the mean and $\Gamma_k$ is the inverse covariance matrix of component $k$
An observation $\mathbf{x}$ belongs to component $k$ with likelihood $f_k(\mathbf{x})$
E step
$$ E[\log f(\mathbf{x}_i)] = \sum_k Pr[k|\mathbf{x}_i]\log f_k(\mathbf{x}_i)] $$With
$$ Pr[k|\mathbf{x}_i] = h_i^k = \frac{\pi_k e^{(\mathbf{x}_i - \mu_k)^\top\Gamma_k(\mathbf{x}_i - \mu_k)}}{\sum_j\pi_j e^{(\mathbf{x}_i - \mu_j)^\top\Gamma_j(\mathbf{x}_i - \mu_j)}} $$M step:
$$ \max_{\pi, \mu, \Gamma} E[\log f(\mathbf{x})] = \sum_i \sum_k h_i^k \log f_k(\mathbf{x}_i) $$X, y = make_moons(100, noise=0.1)
plt.scatter(X[:,0], X[:,1])
<matplotlib.collections.PathCollection at 0x7f57d071d190>
def Estep(X, w, mu, s):
xmu = X[:, None, :] - mu[None, :, :] # n x 5 x 2
sxm = xmu/(1e-7+s[None,:,:]) # n x 5 x 2
xmsxm = w * jnp.exp(-(xmu * sxm).sum(2)) # n x 5
return xmsxm
def Mstep(X, h):
w = h.mean(axis=0)
m = (h[:,:,None]*X[:,None,:]).sum(axis=0) / h.sum(axis=0)[:,None] # 5 x 2
xm = X[:,None,:] - m[None,:,:] # n x 5 x 2
s = (h[:,:,None]*xm*xm).sum(axis=0) / h.sum(axis=0)[:,None] # 5 x 2
return w, m, s
np.random.seed(4)
w = np.ones(5)/5.
m = 0.5*np.random.randn(5, 2)+0.5
s = 0.5*jnp.ones((5,2))
fig = plt.figure(dpi=150)
plt.axis('off')
camera = Camera(fig)
for i in range(50):
plot_density(X, w, m, s)
h = Estep(X, w, m, s)
h = h / (1e-4+h.sum(axis=1, keepdims=True))
w, m, s = Mstep(X, h)
plt.axis([-2, 3, -2, 2])
camera.snap()
animation = camera.animate()
HTML(animation.to_html5_video())
Given a training set $\mathcal{A} = \{\mathbf{x}_i\}$ and a kernel $k(\cdot, \cdot) = \langle \phi(\cdot), \phi(\cdot) \rangle$, we want to find a classifier of high density regions:
$$ \min_{\mathbf{w}, \xi, \rho} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_i \xi_i - \rho\\ \text{ s.t. } \forall i, \langle \mathbf{w}, \phi(\mathbf{x}_i) \rangle \geq \rho - \xi_i\\ \forall i, \xi_i \geq 0 $$Using KKT:
\begin{align} \mathbf{w} &= \sum_i \alpha_i \phi(\mathbf{x}_i)\\ \sum_i\alpha_i &= 1\\ 0 &\leq \alpha_i \leq C \end{align}Solved using any QP solver (or projected coordinate ascent)
Recovering $\rho$
KKT, complementary slackness:
$$ \forall i, \lambda_i(\rho - \xi_i - \sum_{ij}\alpha_jk(\mathbf{x}_i, \mathbf{x}_j)) = 0 $$if $\alpha_i \neq 0$ and $\alpha_i \neq C$
$$\xi_i = 0$$and
$$ \lambda_i \neq 0 $$Thus $$\rho = \sum_j \alpha_j k(\mathbf{x}_i, \mathbf{x}_j) $$
def gauss_kernel(X1, X2, gamma=5.):
D = (X1[:,None,:] - X2[None,:,:])**2
return jnp.exp(-gamma*D.sum(axis=2))
def loss(alpha, X):
K = gauss_kernel(X, X)
return 0.5 * (alpha[None,:] @ (K @ alpha[:,None])).squeeze()
@jax.jit
def update(alpha, X, C = 1., eta=0.1):
da = jax.grad(loss, argnums=0)(alpha, X)
a = jnp.clip(alpha + eta*da, 0, C)
return a/a.sum()
fig = plt.figure(dpi=150); plt.axis('off'); camera = Camera(fig)
alpha = np.ones(100)/100; l = []
for i in range(200):
y_pred = gauss_kernel(xx, X)@alpha[:,None]
y_pred = jnp.array(y_pred).reshape(t, t)
plot_density(X, xx, y_pred)
camera.snap()
alpha = update(alpha, X, C=0.02, eta=.04)
l.append(loss(alpha, X))
animation = camera.animate()
HTML(animation.to_html5_video())
If we have a mixture model, could we use the likelihood of each component as a categorization prediction?
Yes, but not the objective function (i.e., no competition between classes)
Better use a dedicated algorithm
Categorization by approximation
Define $M$ partitions $C_k$ of the training set $\mathcal{A}$ and their associated predictor $\mu_k$
$$\min_{C_k, \mu_k} \sum_k \sum_{\mathbf{x} \in C_k} \| \mathbf{x} - \mu_k \|^2 $$Alternate stepest descent between $C_k$ and $\mu_k$
Stepest descent on $C_k$
$$ \min_{C_k, \cup C_k = \mathcal{A}} \sum_k \sum_{\mathbf{x} \in C_k} \| \mathbf{x} - \mu_k \|^2 $$Attained with nearest nearbor assignment
$$ C_k = \{ \mathbf{x} \in \mathcal{A} | \mu_k = \text{argmin}_c \|\mathbf{x} - \mu_c \|^2 \} $$Corresponds to an E step in EM with
$$ h_i^k = \mathbb{1}_{[ \mu_k = \text{argmin}_c \|\mathbf{x} - \mu_c \|^2]} $$Stepest descent on $\mu_k$
$$\text{min}_{\mu_k} \sum_{\mathbf{x}\in C_k} \|\mathbf{x} - \mu_k \|^2 $$Attained for the barycenter
$$\mu_k = \frac{1}{|C_k|}\sum_{\mathbf{x} \in C_k} \mathbf{x} $$Corresponds to an M step in EM
Using a kernel $k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$
$$\min_{C_k, \mu_k} \sum_k \sum_{\mathbf{x} \in C_k} \| \phi(\mathbf{x}) - \mu_k \|^2 $$Representer theorem
$$ \mu_k = \sum_i \alpha_i \phi(\mathbf{x}_i) $$$C_k$ step unchanged
$\mu_k$ step, note that
$$\mu_k = \frac{1}{|C_k|}\sum_{\mathbf{x}\in C_k} \phi(\mathbf{x}) $$Thus
\begin{align} \alpha_i = \begin{cases} 1/|C_k| \text{if } \mathbf{x} \in C_k \\ 0, \text{else} \end{cases} \end{align}def Estep(X, mu):
D = ((X[:,None,:] - mu[None,:,:])**2).sum(axis=2) # n x M
return D.argmin(axis=1)
def Mstep(X, h):
h = jax.nn.one_hot(h, num_classes=25)
mu = (X[:,None,:]*h[:,:,None]).sum(axis=0)/(1e-5+h[:,:,None].sum(axis=0))
return mu
def plot_km(X, xv, yv, y_pred, mu):
plt.pcolormesh(xv, yv, y_pred, shading='auto', aa=True)
plt.scatter(mu[:,0], mu[:,1], c=np.arange(25), marker='^', s=150, edgecolors='k')
plt.scatter(X[:,0], X[:,1], c=h, edgecolors='k')
fig = plt.figure(dpi=150)
camera = Camera(fig)
mu = 2*np.random.random((25, 2))-0.5
for e in range(20):
h = Estep(X, mu)
y_pred = Estep(xx, mu)+1
y_pred = jnp.array(y_pred).reshape(t, t)
plot_km(X, xv, yv, y_pred, mu)
camera.snap()
mu = Mstep(X, h)
plot_km(X, xv, yv, y_pred, mu)
camera.snap()
animation = camera.animate(interval=700)
HTML(animation.to_html5_video())
So far, we use either the natural distance on $\mathcal{X}$ or the one induced by the choice of a kernel
Can we just learn the distance?
Find transform $\phi(\cdot)$ such that
Linear model
$$ \phi(\mathbf{x}) = \mathbf{Px} $$Define Positive and Negative sets $\mathcal{P}(\mathbf{x}), \mathcal{N}(\mathbf{x})$ for each example $\mathbf{x}$
In practice, we don't want to put negative examples at an infinite distance
$$ \min_\mathbf{P} \sum_\mathbf{x} \sum_{\mathbf{x}_p \in \mathcal{P}(\mathbf{x})} \|\mathbf{Px} - \mathbf{Px}_p\|^2 + \lambda \sum_{\mathbf{x}_p \in \mathcal{N}(\mathbf{x})} \max(0, \beta - \|\mathbf{Px} - \mathbf{Px}_n\|^2) $$Push negative example until they are above margin $\beta$
Similar argument for the positive set with margin
$$ \max(0, \|\mathbf{Px} - \mathbf{Px}_p\|^2 - \alpha) $$Learn a distance that enhances a nearest neighbor classifier
Can we obtain formal guaranties in Learning?
Formal model of learnability
ERM principle:
$$ h_\mathcal{A} = \text{argmin}_{h\in\mathcal{H}}L_\mathcal{A}(h)$$Remark that with probability 1 over a random set $\mathcal{A} = \{(x_i, y_i)\}, x_i \sim \mathcal{D}$ and $y_i = f(x_i)$, we have $L_\mathcal{A}(h^\star) = 0$
Given the realizability assumption, what is the probability that the ERM fails?
Bounding the generalization risk
$$ Pr_\mathcal{A}[L_{\mathcal{D}, f}(h_\mathcal{A}) > \epsilon] \leq \delta$$Probably ($\delta$) Approximately ($\epsilon$) Correct
Set of training sets that appear to be good ($L_{\mathcal{A}}(h_\mathcal{A}) = 0$) but are bad in reality ($L_{\mathcal{D}, f}(h_\mathcal{A}) > \epsilon$)
We want to know the probability that the ERM fails because we have sampled a misleading dataset
Let us construct $M$
Remark that
$$ \{\mathcal{A}|L_{\mathcal{D}, f}(h_\mathcal{A}) > \epsilon\} \subseteq M $$(a training set that leads to $\epsilon$ generalization error is necessarily a misleading training set beacuse of realizability)
Combining the two using an union bound
$$Pr_\mathcal{A}[\mathcal{A}|L_{\mathcal{D}, f}(h_\mathcal{A}) > \epsilon] \leq \sum_{h\in\mathcal{H}_B} Pr_\mathcal{A}[\mathcal{A}|L_\mathcal{A}(h) = 0] $$Elements of $\mathcal{A}$ are sampled i.i.d
$$Pr_\mathcal{A}[\mathcal{A}|h\in\mathcal{H}_B,L_\mathcal{A}(h) = 0] = \prod_{i=1}^m Pr[h(x_i) = f(x_i)] $$Since $h\in\mathcal{H}_B$, $Pr[h(x_i) = f(x_i)] \leq 1 - \epsilon$, thus
$$Pr_\mathcal{A}[\mathcal{A}|h\in\mathcal{H}_B,L_\mathcal{A}(h) = 0] \leq (1-\epsilon)^m \leq e^{-\epsilon m} $$Let $\mathcal{H}$ be a finite hypothesis class, $\delta \in [0, 1]$, $\epsilon > 0$ and $m$ such that
$$ m\geq \frac{\log (|\mathcal{H}|/\delta)}{\epsilon} $$Then, for any labeling function $f$, and on any distribution $\mathcal{D}$ for which the realizability assumption holds, we have for every ERM hypothesis $h_\mathcal{A}$
$$Pr_\mathcal{A}[L_{\mathcal{D}, f}(h_\mathcal{A}) \leq \epsilon] \geq 1 - \delta $$A hypothesis class $\mathcal{H}$ is PAC learnable if $\exists m_\mathcal{H}: [0,1]^2 \rightarrow \mathbb{N}$ and a learning algorithm such that $\forall \epsilon, \delta \in [0,1]^2, \forall \mathcal{D}$ over $\mathcal{X}, \forall f: \mathcal{X} \rightarrow \{0, 1\}$, if the realizability assumption holds w.r.t. $\mathcal{H, D}, f$, then running the algorithm on $m\geq m_\mathcal{H}(\epsilon, \delta)$ i.i.d. samples generated by $\mathcal{D}$ and labeled by $f$, the algorithm returns $h$ such that with probability at least $1-\delta$, $L_{\mathcal{D}, f}(h) \leq \epsilon$
Every finite hypothesis class is PAC learnable with sample complexity
$$m_\mathcal{H}(\epsilon, \delta) \leq \left\lceil \frac{\log (|\mathcal{H}|/\delta)}{\epsilon} \right\rceil $$Let $\mathcal{H}$ be a hypothesis class over $\mathcal{X} \rightarrow \{0, 1\}$ and using the 0-1 loss function, then the following are equivalent
Let $A$ be any learning algorithm for the task of binary classification with the 0-1 loss over $\mathcal{X}$, Let $m < |\mathcal{X}|/2$ be a training set size. Then there exists a distribution over $\mathcal{X}$ and a concept $f$ such that:
No universal learner
Unsupervised learning
Density estimation: how likely is an example
Clustering: partitioning $\mathcal{X}$ into arbitrarily chosen classes
EM is a powerful algorithm for DE and clustering, but sensitive to init
$k$-means shows up frequently as a basic tool (many improved version: splitting init, codeword shifting, etc)
$k$-means is an excellent init for GMM
Metric learning
ML algorithm dependent on the natural distance on $\mathcal{X}$
can be improved by using a better kernel (metric in the induced space)
Learning the metric can turn hard learning problem into easy ones
$k$NN with metric learning often has a good complexity/accuracy trade-off
PAC
Formal study of learnability without specifying the data distribution or the type of hypothesis
Finite classes are learnable
But bounds have unrealistic number of samples
No universal learner: some algorithms are better at some problems than others