for region Ri in all regions:
if x ∈ Ri:
return yi
A decision tree is a hierarchical classifier with a tree structure where each node partitions the feature space along a specified component.
Leo Breiman (1928 - 2005)
Proportion of class $k$ in $\mathcal{S}$
$$ p_k(\mathcal{S}) = \frac{1}{\vert\mathcal{S}\vert} \sum_{y_i = k} 1 $$Prediction for $\mathcal{S}$
$$f(\mathcal{S}) = \text{argmax}_k p_k(\mathcal{S}) $$0-1 loss
$$C(\mathcal{S}) = \frac{1}{N}\sum_i (1 - \delta(y_i, f(\mathbf{x}_i)) = 1 - p_{f(\mathcal{S})}(\mathcal{S})$$How much did the error decrease with the split on component $d$ at threshold $\theta$ that leads to subsets $\mathcal{S}_1$ and $\mathcal{S}_2$:
$$ \text{Gain}(\mathcal{S}, d, \theta) = C(\mathcal{S}) - \left[ \frac{N_1}{N}C(\mathcal{S}_1) + \frac{N_2}{N}C(\mathcal{S}_2) \right] $$Choose $d, \theta$ with maximal gain
Other popular gain measures:
X = np.random.rand(75, 2)
y = 1.*(X[:,1] > X[:,0])
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb6994a1890>
def entropyGain(X, y, d, theta):
if len(y) <= 1:
return 0.
p = y.mean()
e = jax.scipy.special.entr(p)
l = 1.*(X[:,d] < theta)
p1 = (y * l).sum()/(l.sum()+1e-12)
e1 = jax.scipy.special.entr(p1)
r = 1-l
p2 = (y * r).sum()/(r.sum()+1e-12)
e2 = jax.scipy.special.entr(p2)
return e - (l.sum()*e1 + r.sum()*e2)/len(y)
def findBestTheta(X, y, d, gain=entropyGain):
n = len(y)
best_g = -1.
theta = None
xx = jnp.sort(X[:,d])-1e-7
for t in xx:
g = gain(X, y, d, t)
if g > best_g:
best_g = g
theta = t
if theta == None:
print('theta faillure!!')
return theta, best_g
def findBestDTheta(X, y, gain=entropyGain):
best_d = None
theta = None
best_g = -1
for d in range(X.shape[1]):
t, g = findBestTheta(X, y, d, gain)
if g > best_g:
best_d = d
theta = t
best_g = g
if best_d is None:
print('D failure!!')
return best_d, theta
class BinaryClassificationTree():
def __init__(self, X, y, gain=entropyGain, min_size=1):
p = y.mean()
if len(y) <= min_size or jax.scipy.special.entr(p) == 0.:
self.label = 1.*(p>=0.5)
else:
self.label = None
self.d, self.theta = findBestDTheta(X, y, gain)
ind = 1.*(X[:,self.d] < self.theta)
if ind.sum() == 0 or ind.sum() == len(y):
print('single split !!! {} {} {}'.format(ind, y, X))
ind1 = ind.nonzero()
X1 = X[ind1]
y1 = y[ind1]
ind2 = (1-ind).nonzero()
X2 = X[ind2]
y2 = y[ind2]
self.T1 = BinaryClassificationTree(X1, y1, gain=gain, min_size=min_size)
self.T2 = BinaryClassificationTree(X2, y2, gain=gain, min_size=min_size)
def __call__(self, X):
if self.label is not None:
return self.label * jnp.ones(len(X))
return jnp.concatenate([ self.T1([x]) if x[self.d] < self.theta else self.T2([x]) for x in X])
T = BinaryClassificationTree(X, y)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
t = 50; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb694204250>
Interpretable
Fast
Handle categorical data
But
Poor accuracy
Unstable
Need a lot of examples
Finding the optimal tree is hard, growing is greedy
y[6] = 1 - y[6]
T = BinaryClassificationTree(X, y)
t = 50; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb6941195d0>
Theorem: For a tree of $n$ nodes in dimension $d$ and for $m$ samples, we have with probability $\delta$
$$ R \leq R_e + \sqrt{\frac{(n+1)\log_2(d+3) + \log_2(2/\delta)}{2m}}$$Exercise: What is the VC dimension of decision tree over $\{0, 1\}^d$ ?
Overcome DT instabilities by averaging $B$ randomized trees
Final decision by majority vote: $f(\mathbf{x}) = \text{argmax}_d \left[\sum_b f_b(\mathcal{x})\right]_d$
Ensemble of classifier $h_1, \dots, h_K$, define margin function
$$ mg(\mathbf{x}, y) = \text{avg}_k \mathbb{1}[h_k(\mathbf{x}) = y] - \max_{j\neq y}\text{avg}_k \mathbb{1}[h_k(\mathbf{x}) = j] $$(difference between true class vote and max false class vote)
Generalization error
$$ R = \mathbb{P}[ mg(\mathbf{x}, y) < 0 ] $$Random forest: classifier drawn i.i.d. from a distribution of parameters $\Theta$
Theorem (Breiman): As the number of trees increases, for almost surely all sequences $\Theta_1, \dots$, the generalization error $R$ converges to
$$ \mathbb{P} \left[ \mathbb{P}_\Theta[ h_\theta(\mathbf{x}) = y ] - \max_{j\neq y}\mathbb{P}_\Theta[h_\theta(\mathbf{x}) = j] < 0\right] $$R does not increase as the number of trees grows, limiting overfitting
class RandomForest():
def __init__(self, X, y, nb_tree=25, p=0.5):
self.trees = []
n = len(y)
k = int(p*n)
for b in range(nb_tree):
i = np.random.permutation(n)
Xb = X[i[0:k], ...]
yb = y[i[0:k]]
DT = BinaryClassificationTree(Xb, yb)
self.trees.append(DT)
def __call__(self, X):
y = []
for DT in self.trees:
y.append(DT(X))
return 1.*(jnp.array(y).mean(axis=0))
T = RandomForest(X, y)
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb65049fad0>
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -1.*(y_pred>0.5), shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb6402955d0>
T = RandomForest(X, y, nb_tree=100, p=0.2)
t = 50; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb64022a450>
t = 50; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -1.*(y_pred>0.5), shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb6501a0050>
What is the variance of the average of $B$ random variables, each of variance $\sigma^2$, that are correlated by $\rho$?
Correlation
$$\frac{1}{\sigma^2}\mathbb{E}\left[ (x_i - \mu)(x_j - \mu) \right] = \rho \geq 0 $$ERM principle subject to bias-variance trade-off
Ensemble idea: aggregate many simple models
Each model has low estimation error
Aggregation has low approximation error
Assume $M$ independent predictors $h_m(\mathbf{x})$
Trained on different features (e.g., colors and textures)
Trained on different samples (e.g., different images)
Bagging aggregate \begin{align*} h(\mathbf{x}) = \sum_m h_m(\mathbf{x}) \end{align*} Corresponds to a voting strategy
Random axis, select optimal threshold
class RandomAxisClassifier():
def __init__(self, X, y, d):
self.d = d
self.theta, _ = findBestTheta(X, y, d)
i = jnp.where(X[:,d]<self.theta)
self.y = (y[i].mean()>0.5)
def __call__(self, X):
return (X[:,self.d]<self.theta) if self.y else 1-(X[:,self.d]<self.theta)
class BaggingClassifier():
def __init__(self, X, y, nb_cls, p=0.5):
self.cls = []
n = len(y)
k = int(p*n)
for b in range(nb_cls):
i = np.random.permutation(n)
Xb = X[i[0:k], ...]
yb = y[i[0:k]]
self.cls.append(RandomAxisClassifier(Xb, yb, np.random.randint(2)))
def __call__(self, X):
y = []
for c in self.cls:
y.append(c(X))
return jnp.array(y).mean(axis=0)
T = BaggingClassifier(X, y, 1)
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb64005ae90>
T = BaggingClassifier(X, y, 10)
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb60872c990>
T = BaggingClassifier(X, y, 100, p=0.2)
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb6084f7810>
In bagging, each predictor as the same weight
Some decisions are redundant
Some decisions are bad and we know they are bad
Can we define weights that better reflect each classifier strong points?
Key ideas:
Given a classifier $f_{m-1}$, we want to add a new classifier that reduces the error
We have to solve \begin{align*} \beta_m, G_m = \arg\min_{\beta, G} \sum_i\exp[-y_i (f_{m-1}(\mathbf{x}_i) + \beta G(\mathbf{x}_i))] \end{align*}
Can be rewritten as \begin{align*} \beta_m, G_m = \arg\min_{\beta, G} \sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)] \end{align*} with \begin{align*} w_i = \exp[-y_i f_{m-1}(\mathbf{x}_i)] \end{align*}
Remark that given $\beta > 0$ \begin{align*} \arg\min_G\sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)] \end{align*} is obtained by \begin{align*} \arg\min_G \sum_i w_i I(y_i \neq G(\mathbf{x}_i)) \end{align*} because \begin{align*} \sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)] = e^{-\beta} \sum_{y_i = G(\mathbf{x}_i)} w_i + e^{\beta} \sum_{y_i \neq G(\mathbf{x}_i)} w_i \end{align*}
Given $G$, we have to solve \begin{align*} \arg\min_\beta\sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)] \end{align*} Remark that \begin{align*} \sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)]&\\ \quad\quad = \left(e^\beta - e^{-\beta} \right)&\sum_iw_i I(y_i \neq G(\mathbf{x}_i) + e^{-\beta}\sum_i w_i \end{align*} Thus \begin{align*} \beta = \frac{1}{2} \log \frac{1 - err_m}{err_m},\quad err_m = \frac{\sum_i w_i I(y_i \neq G(\mathbf{x}_i)}{\sum_i w_i} \end{align*}
err_m = \frac{\sum_i w_i I(y_i \neq G_m(\mathbf{x}_i))}{\sum_i w_i}
\end{align*}
def weightedError(w, y_pred, y_true):
return (w * (y_pred != y_true)).sum()/w.sum()
def weightedFindBestTheta(w, X, y, d):
n = len(y)
err = w.sum()+1
theta = None
p = None
xx = jnp.sort(X[:,d])-1e-7
for t in xx:
e = weightedError(w, 1.*(X[:,d] < t), y)
if e < err:
err = e
theta = t
p = True
e = weightedError(w, 1. - (X[:,d] < t), y)
if e < err:
err = e
theta = t
p = False
if theta == None:
print('theta faillure!!')
return theta, p, err
class WeightedRandomAxisClassifier():
def __init__(self, w, X, y, d):
self.d = d
self.theta, self.y, self.err = weightedFindBestTheta(w, X, y, d)
def __call__(self, X):
return 1.*(X[:,self.d]<self.theta) if self.y else 1.-(X[:,self.d]<self.theta)
class AdaBoost():
def __init__(self, X, y, nb_cls=5):
n = len(y)
w = jnp.ones(n)/n
self.beta = []
self.cls = []
for b in range(nb_cls):
err_b = []
cls_b = []
for d in range(X.shape[1]):
c = WeightedRandomAxisClassifier(w, X, y, d)
cls_b.append(c)
err_b.append(c.err)
a = jnp.argmin(jnp.array(err_b))
c = cls_b[a]
e = c.err
b = jnp.log((1-e)/e)
w = w * jnp.exp(b * (c(X)!=y))
self.beta.append(b)
self.cls.append(c)
def __call__(self, X):
y = []
for i, c in enumerate(self.cls):
y.append(self.beta[i] * c(X))
return jnp.array(y).mean(axis=0)
T = AdaBoost(X, y, 2)
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb5f8560910>
T = AdaBoost(X, y, 10)
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb5f860a5d0>
T = AdaBoost(X, y, 50)
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
<matplotlib.collections.PathCollection at 0x7fb5f8511d50>
Classifiers have to be weak
A perfect classifier yields $\beta_m = \infty$ breaking subsequent classifiers
Example: Kernel SVM with sufficiently tight Gaussian kernel
Classifiers have to be slightly better than random
Worst than random gets an negative weight (opposite classifier)
Example: single feature classifier
Boosted Decision Trees are an excellent first try in most cases
Show that \begin{align*} f^\star(\mathbf{x}) &= \arg\min_f \mathbb{E}_{\sim(\mathbf{x},y)}\left[ e^{-yf(\mathbf{x})} \right]= \frac{1}{2}\log\frac{P[y = 1 | \mathbf{x}]}{P[y=-1|\mathbf{x}]} \end{align*}
Decision Trees
Simple
Fast
Explainable (domain experts understand the decision process)
Handle categorical data (or even mixed)
But
Overfit, unstable
Require massive amount of data
Random forest
Simple, Fast, handle categorical data
Stable
No longer explainable
Ensemble
Bagging: simple solution, good idea to reduce bias
Boosting: optimized combination
Large literature and many libraries on boosting
Grinsztajn, Leo, Edouard Oyallon, and Gael Varoquaux. "Why do tree-based models still outperform deep learning on typical tabular data?." Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2022.