What if we don't have labels?
What if we don't even have any idea about the structure of $\mathcal{Y}$?
Given a training set of samples $\mathcal{A} = \{\mathbf{x}\}$, we want to model a probability density function $f(\mathbf{x})$ corresponding to the distribution $D$ such that $\mathbf{x} \sim D$
$f$ has to be a pdf:
$\forall \mathbf{x}\in\mathcal{X}, f(\mathbf{x})\geq 0$
$\int_\mathcal{X} f(\mathbf{x})d\mathbf{x} = 1$
Useful for anomaly detection, e.g., if $f(\mathbf{x}) \leq \theta$ (Also called Out of Distribution detection)
We have observations, they are likely by definition
Maximizing the probability of $\mathcal{A}$
$$ \max_f Pr[\mathcal{A}] = \prod_{\mathbf{x}_i} f(\mathbf{x}_i) $$Equivalently, maximizing the log probability
$$ \max_f \log Pr[\mathcal{A}] = \sum_{\mathbf{x}_i} \log f(\mathbf{x}_i) $$For the Gaussian
$$ \sum_{\mathbf{x}_i} \log f(\mathbf{x}_i) = \sum_{\mathbf{x}_i} -\|\mathbf{x}_i - \mu \|^2 / 2\sigma^2 - n\log\sqrt{2\pi} \sigma$$The maximimzation is equivalent to
$$ \min_\mu \sum_{\mathbf{x}_i} \| \mathbf{x}_i - \mu \|^2 $$Which is attained for
\begin{align} \frac{\partial \sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2}{\partial \mu} &= 0\\ 2n\mu - 2\sum_{\mathbf{x}_i}\mathbf{x}_i &= 0 \\ \mu &= \frac{1}{n} \sum_{\mathbf{x}_i} \mathbf{x}_i \end{align}For $\sigma$
$$\min_\sigma \sum_{\mathbf{x}_i} \|\mathbf{x}_i - \mu \|^2 / 2\sigma^2 + n\log\sqrt{2\pi} \sigma $$Some pdf models do not lead to closed form solution (case of mixture models)
Alternate optimisation scheme
Initialize model parameters $\Gamma_0$
$\pi_k$ is the weight (population) , $\mu_k$ the mean and $\Gamma_k$ is the inverse covariance matrix of component $k$
An observation $\mathbf{x}$ belongs to component $k$ with likelihood $f_k(\mathbf{x})$
E step
$$ E[\log f(\mathbf{x}_i)] = \sum_k Pr[k|\mathbf{x}_i]\log f_k(\mathbf{x}_i)] $$With
$$ Pr[k|\mathbf{x}_i] = h_i^k = \frac{\pi_k e^{(\mathbf{x}_i - \mu_k)^\top\Gamma_k(\mathbf{x}_i - \mu_k)}}{\sum_j\pi_j e^{(\mathbf{x}_i - \mu_j)^\top\Gamma_j(\mathbf{x}_i - \mu_j)}} $$M step:
$$ \max_{\pi, \mu, \Gamma} E[\log f(\mathbf{x})] = \sum_i \sum_k h_i^k \log f_k(\mathbf{x}_i) $$X, y = make_moons(100, noise=0.1)
plt.scatter(X[:,0], X[:,1])
<matplotlib.collections.PathCollection at 0x7f57d071d190>
def Estep(X, w, mu, s):
xmu = X[:, None, :] - mu[None, :, :] # n x 5 x 2
sxm = xmu/(1e-7+s[None,:,:]) # n x 5 x 2
xmsxm = w * jnp.exp(-(xmu * sxm).sum(2)) # n x 5
return xmsxm
def Mstep(X, h):
w = h.mean(axis=0)
m = (h[:,:,None]*X[:,None,:]).sum(axis=0) / h.sum(axis=0)[:,None] # 5 x 2
xm = X[:,None,:] - m[None,:,:] # n x 5 x 2
s = (h[:,:,None]*xm*xm).sum(axis=0) / h.sum(axis=0)[:,None] # 5 x 2
return w, m, s
np.random.seed(4)
w = np.ones(5)/5.
m = 0.5*np.random.randn(5, 2)+0.5
s = 0.5*jnp.ones((5,2))
fig = plt.figure(dpi=150)
plt.axis('off')
camera = Camera(fig)
for i in range(50):
plot_density(X, w, m, s)
h = Estep(X, w, m, s)
h = h / (1e-4+h.sum(axis=1, keepdims=True))
w, m, s = Mstep(X, h)
plt.axis([-2, 3, -2, 2])
camera.snap()
animation = camera.animate()
HTML(animation.to_html5_video())