Machine Learning and Applications - Decision Trees and Ensembling Methods¶

David Picard¶

École des Ponts ParisTech¶

david.picard@enpc.fr¶

Region based classification¶

region

  1. Memorizing all regions is cumbersome
    • R 1 = {(a 1 , b 1 , c 1 , d 1 ), y 1 }
    • R 2 = {(a 2 , b 2 , c 2 , d 2 ), y 2 }
    • ...
  1. Classification requires to check all regions
for region Ri in all regions:
    if x ∈ Ri:
      return yi

Tree based equivalent¶

tree1

tree2

tree3

Tree representation¶

dt

Decision Tree¶

A decision tree is a hierarchical classifier with a tree structure where each node partitions the feature space along a specified component (Breiman et al, 1984).

Leo Breiman (1928 - 2005)

leo

Growing the tree¶

  1. Tree($\mathcal{S} = \{\mathbf{x}_i, y_i\}$):
  2. $\quad$if $\vert\mathcal{S}\vert < T$:
  3. $\quad\quad$return $\text{Leaf}(\text{argmax}_c\sum_i 1_{y_i=c}$)
  4. $\quad d^\star, \theta^\star = \text{argmax}_{d,\theta} \text{Gain}(\mathcal{S}, d, \theta)$
  5. $\quad T_1 = \text{Tree}(\{\mathbf{x}_i, y_i\}\in\mathcal{S} \vert \mathbf{x}_i[d^\star] < \theta^\star)$
  6. $\quad T_2 = \text{Tree}(\{\mathbf{x}_i, y_i\}\in\mathcal{S} \vert \mathbf{x}_i[d^\star] \geq \theta^\star)$
  7. $\quad$return $\text{Node}(d^\star, \theta^\star, T_1, T_2)$

Gain measure¶

Proportion of class $k$ in $\mathcal{S}$

$$ p_k(\mathcal{S}) = \frac{1}{\vert\mathcal{S}\vert} \sum_{y_i = k} 1 $$

Prediction for $\mathcal{S}$

$$f(\mathcal{S}) = \text{argmax}_k p_k(\mathcal{S}) $$

0-1 loss

$$C(\mathcal{S}) = \frac{1}{N}\sum_i (1 - \delta(y_i, f(\mathbf{x}_i)) = 1 - p_{f(\mathcal{S})}(\mathcal{S})$$

How much did the error decrease with the split on component $d$ at threshold $\theta$ that leads to subsets $\mathcal{S}_1$ and $\mathcal{S}_2$:

$$ \text{Gain}(\mathcal{S}, d, \theta) = C(\mathcal{S}) - \left[ \frac{N_1}{N}C(\mathcal{S}_1) + \frac{N_2}{N}C(\mathcal{S}_2) \right] $$

Choose $d, \theta$ with maximal gain

Information Gain¶

Other popular gain measures:

  • Entropy

$$C(\mathcal{S}) = -\sum_k p_k(\mathcal{S})\log p_k(\mathcal{S}) $$

  • Gini index

$$C(\mathcal{S}) = -\sum_k p_k(\mathcal{S})(1- p_k(\mathcal{S}))$$

Small example¶

In [8]:
X = np.random.rand(75, 2)
y = 1.*(X[:,1] > X[:,0])
In [9]:
plt.scatter(X[:,0], X[:,1], c=y)
Out[9]:
<matplotlib.collections.PathCollection at 0x7b530f18ffa0>
No description has been provided for this image
In [10]:
def entropyGain(X, y, d, theta):
    if len(y) <= 1:
        return 0.
    p = y.mean()
    e = jax.scipy.special.entr(p)
    l = 1.*(X[:,d] < theta)
    p1 = (y * l).sum()/(l.sum()+1e-12)
    e1 = jax.scipy.special.entr(p1)
    r = 1-l
    p2 = (y * r).sum()/(r.sum()+1e-12)
    e2 = jax.scipy.special.entr(p2)
    return e - (l.sum()*e1 + r.sum()*e2)/len(y)
In [11]:
def findBestTheta(X, y, d, gain=entropyGain):
    n = len(y)
    best_g = -1.
    theta = None
    xx = jnp.sort(X[:,d])-1e-7
    for t in xx:
        g = gain(X, y, d, t)
        if g > best_g:
            best_g = g
            theta = t
    if theta == None:
        print('theta faillure!!')
    return theta, best_g
In [12]:
def findBestDTheta(X, y, gain=entropyGain):
    best_d = None
    theta = None
    best_g = -1
    for d in range(X.shape[1]):
        t, g = findBestTheta(X, y, d, gain)
        if g > best_g:
            best_d = d
            theta = t
            best_g = g
    if best_d is None:
        print('D failure!!')
    return best_d, theta
In [13]:
class BinaryClassificationTree():
    def __init__(self, X, y, gain=entropyGain, min_size=1):
        p = y.mean()
        if len(y) <= min_size or jax.scipy.special.entr(p) == 0.:
            self.label = 1.*(p>=0.5)
        else:
            self.label = None
            self.d, self.theta = findBestDTheta(X, y, gain)
            ind = 1.*(X[:,self.d] < self.theta) 
            if ind.sum() == 0 or ind.sum() == len(y):
                print('single split !!! {} {} {}'.format(ind, y, X))
            ind1 = ind.nonzero()
            X1 = X[ind1]
            y1 = y[ind1]
            ind2 = (1-ind).nonzero()
            X2 = X[ind2]
            y2 = y[ind2]
            self.T1 = BinaryClassificationTree(X1, y1, gain=gain, min_size=min_size)
            self.T2 = BinaryClassificationTree(X2, y2, gain=gain, min_size=min_size)
    def __call__(self, X):
        if self.label is not None:
            return self.label * jnp.ones(len(X))
        return jnp.concatenate([ self.T1([x]) if x[self.d] < self.theta else self.T2([x]) for x in X])
In [14]:
T = BinaryClassificationTree(X, y)
In [15]:
t = 50; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[15]:
<matplotlib.collections.PathCollection at 0x7b52d4222f80>
No description has been provided for this image

Decision Trees¶

  • Interpretable

  • Fast

  • Handle categorical data

But

  • Poor accuracy

  • Unstable

  • Need a lot of examples

  • Finding the optimal tree is hard, growing is greedy

Unstable¶

In [10]:
y[6] = 1 - y[6]
T = BinaryClassificationTree(X, y)
In [11]:
t = 50; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[11]:
<matplotlib.collections.PathCollection at 0x7fb5d6f7c310>
No description has been provided for this image

Generalization¶

Theorem: For a tree of $n$ nodes in dimension $d$ and for $m$ samples, we have with probability $\delta$

$$ R \leq R_e + \sqrt{\frac{(n+1)\log_2(d+3) + \log_2(2/\delta)}{2m}}$$

Exercise: What is the VC dimension of decision tree over $\{0, 1\}^d$ ?

Random Forest¶

Overcome DT instabilities by averaging $B$ randomized trees (Breiman, 2001)

  • Randomized training set $\mathcal{A}_b \subset \mathcal{A}$
  • Randomized components $\mathcal{x} \in \mathcal{X}_b \subset \mathcal{X}$

Final decision by majority vote: $f(\mathbf{x}) = \text{argmax}_d \left[\sum_b f_b(\mathcal{x})\right]_d$

  • Average value for regression

Limiting overfitting¶

Ensemble of classifier $h_1, \dots, h_K$, define margin function

$$ mg(\mathbf{x}, y) = \text{avg}_k \mathbb{1}[h_k(\mathbf{x}) = y] - \max_{j\neq y}\text{avg}_k \mathbb{1}[h_k(\mathbf{x}) = j] $$ (difference between true class vote and max false class vote)

Generalization error

$$ R = \mathbb{P}[ mg(\mathbf{x}, y) < 0 ] $$

Random forest: classifier drawn i.i.d. from a distribution of parameters $\Theta$

Theorem (Breiman, 2001): As the number of trees increases, for almost surely all sequences $\Theta_1, \dots$, the generalization error $R$ converges to

$$ \mathbb{P} \left[ \mathbb{P}_\Theta[ h_\theta(\mathbf{x}) = y ] - \max_{j\neq y}\mathbb{P}_\Theta[h_\theta(\mathbf{x}) = j] < 0\right] $$

R does not increase as the number of trees grows, limiting overfitting

In [12]:
class RandomForest():
    def __init__(self, X, y, nb_tree=25, p=0.5):
        self.trees = []
        n = len(y)
        k = int(p*n)
        for b in range(nb_tree):
            i = np.random.permutation(n)
            Xb = X[i[0:k], ...]
            yb = y[i[0:k]]
            DT = BinaryClassificationTree(Xb, yb)
            self.trees.append(DT)
    def __call__(self, X):
        y = []
        for DT in self.trees:
            y.append(DT(X))
        return 1.*(jnp.array(y).mean(axis=0))
In [13]:
T = RandomForest(X, y)
In [14]:
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[14]:
<matplotlib.collections.PathCollection at 0x7fb5ac716ce0>
No description has been provided for this image
In [15]:
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -1.*(y_pred>0.5), shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[15]:
<matplotlib.collections.PathCollection at 0x7fb5ac7b5870>
No description has been provided for this image
In [16]:
T = RandomForest(X, y, nb_tree=100, p=0.2)
In [17]:
t = 50; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[17]:
<matplotlib.collections.PathCollection at 0x7fb5ac6463b0>
No description has been provided for this image
In [18]:
t = 50; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -1.*(y_pred>0.5), shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[18]:
<matplotlib.collections.PathCollection at 0x7fb5aff8f3a0>
No description has been provided for this image

Reducing the variance¶

What is the variance of the average of $B$ random variables, each of variance $\sigma^2$, that are correlated by $\rho$?

Correlation

$$\frac{1}{\sigma^2}\mathbb{E}\left[ (x_i - \mu)(x_j - \mu) \right] = \rho \geq 0 $$

$$ \begin{aligned} \mathbb{E}[x_i^2] = \sigma^2 +m^2\\ \mathbb{E}[x_ix_j] = \rho \sigma^2 +m^2 \end{aligned} $$

$$ Var\left(\frac{\sum_i x_i}{B}\right) = \frac{1}{B^2}Var\left(\sum_i x_i \right)$$

$$ = \frac{1}{B^2} \left( \mathbb{E}\left[ \left(\sum_ix_i \right)^2 \right] - \mathbb{E}\left[\sum_ix_i \right]^2\right)$$

$$=\frac{1}{B^2} \left( \sum_{i,j} \mathbb{E}\left[ x_i x_j \right] - \left( \sum_i \mathbb{E}\left[ x_i \right] \right)^2 \right)$$

$$=\frac{1}{B^2} \left(B(\sigma^2+m^2) + (B^2-B)(\rho\sigma^2 + m^2) - B^2m^2 \right) $$

$$Var\left(\frac{\sum_i x_i}{B}\right) = \frac{1}{B^2} \left(B(\sigma^2+m^2) + (B^2-B)(\rho\sigma^2 + m^2) - B^2m^2 \right)$$

$$=\frac{\sigma^2 + m^2}{B} + \rho\sigma^2 + m^2 - \frac{\rho\sigma^2 + m^2}{B} - m^2$$

$$=\frac{\sigma^2}{B} + \rho\sigma^2 - \frac{\rho\sigma^2}{B}$$

$$= \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$

$$\leq \sigma^2 \text{ iff } \rho< 1 \text{ and } B > 1$$

Ensemble learning¶

ERM principle subject to bias-variance trade-off

  • Simple model: low estimation error, large approximation error

  • Complex model: high estimation error, low approximation error

Ensemble idea: aggregate many simple models

  • Each model has low estimation error

  • Aggregation has low approximation error

Bagging¶

Assume $M$ independent predictors $h_m(\mathbf{x})$

  • Trained on different features (e.g., colors and textures)

  • Trained on different samples (e.g., different images)

Bagging aggregate $$ h(\mathbf{x}) = \sum_m h_m(\mathbf{x}) $$ Corresponds to a voting strategy

Exemple¶

Random axis, select optimal threshold

In [19]:
class RandomAxisClassifier():
    def __init__(self, X, y, d):
        self.d = d
        self.theta, _ = findBestTheta(X, y, d)
        i = jnp.where(X[:,d]<self.theta)
        self.y = (y[i].mean()>0.5)
    def __call__(self, X):
        return (X[:,self.d]<self.theta) if self.y else 1-(X[:,self.d]<self.theta)
In [20]:
class BaggingClassifier():
    def __init__(self, X, y, nb_cls, p=0.5):
        self.cls = []
        n = len(y)
        k = int(p*n)
        for b in range(nb_cls):
            i = np.random.permutation(n)
            Xb = X[i[0:k], ...]
            yb = y[i[0:k]]
            self.cls.append(RandomAxisClassifier(Xb, yb, np.random.randint(2)))
    def __call__(self, X):
        y = []
        for c in self.cls:
            y.append(c(X))
        return jnp.array(y).mean(axis=0)
In [21]:
T = BaggingClassifier(X, y, 1)
In [22]:
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[22]:
<matplotlib.collections.PathCollection at 0x7fb5affa1330>
No description has been provided for this image
In [23]:
T = BaggingClassifier(X, y, 10)
In [24]:
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[24]:
<matplotlib.collections.PathCollection at 0x7fb5afd81b10>
No description has been provided for this image
In [25]:
T = BaggingClassifier(X, y, 100, p=0.2)
In [26]:
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[26]:
<matplotlib.collections.PathCollection at 0x7fb5afe03640>
No description has been provided for this image

Boosting¶

In bagging, each predictor as the same weight

  • Some decisions are redundant

  • Some decisions are bad and we know they are bad

Can we define weights that better reflect each classifier strong points?

Adaboost (Freund et al, 1996)¶

Key ideas:

  • Weight each sample $\mathbf{x}_i$ with $w_i$
  • Train a classifier $g$ with weighted error $w_i I(y_i \neq g(\mathbf{x}_i))$
  • Update weights such that samples with high error have higher weights
  • Train new classifier $f_m$ with updated weighted error
  • Combine both classifier $g \leftarrow g + \beta f_m$
  • Iterate until combined classifier is good enough

Exponential loss function¶

$$ L(y, f(\mathbf{x})) = e^{-yf(x)} $$ Given a classifier $f_{m-1}$, we want to add a new classifier that reduces the error

We have to solve $$ \beta_m, G_m = \arg\min_{\beta, G} \sum_i\exp[-y_i (f_{m-1}(\mathbf{x}_i) + \beta G(\mathbf{x}_i))] $$

Independent updates¶

$$ \beta_m, G_m = \arg\min_{\beta, G} \sum_i\exp[-y_i (f_{m-1}(\mathbf{x}_i) + \beta G(\mathbf{x}_i))] $$ Can be rewritten as $$ \beta_m, G_m = \arg\min_{\beta, G} \sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)] $$ with $$ w_i = \exp[-y_i f_{m-1}(\mathbf{x}_i)] $$

Solving for $G$¶

Remark that given $\beta > 0$ $$ \arg\min_G\sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)] $$ is obtained by $$ \arg\min_G \sum_i w_i I(y_i \neq G(\mathbf{x}_i)) $$ because $$ \sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)] = e^{-\beta} \sum_{y_i = G(\mathbf{x}_i)} w_i + e^{\beta} \sum_{y_i \neq G(\mathbf{x}_i)} w_i $$

Solving for $\beta$¶

Given $G$, we have to solve $$ \arg\min_\beta\sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)] $$ Remark that $$ \begin{aligned} \sum_i w_i\exp[-y_i \beta G(\mathbf{x}_i)]&\\ \quad\quad = \left(e^\beta - e^{-\beta} \right)&\sum_iw_i I(y_i \neq G(\mathbf{x}_i) + e^{-\beta}\sum_i w_i \end{aligned} $$ Thus $$ \beta = \frac{1}{2} \log \frac{1 - err_m}{err_m},\quad err_m = \frac{\sum_i w_i I(y_i \neq G(\mathbf{x}_i)}{\sum_i w_i} $$

Adaboost¶

  1. Initialize $\forall i, w_i = 1/N$
  2. For $m = 1 \ldots M$
  3. $\quad$Fit classifier $G_m(\mathbf{x})$ to training sample using $w_i$
  4. $\quad$ Compute \begin{align*} err_m = \frac{\sum_i w_i I(y_i \neq G_m(\mathbf{x}_i))}{\sum_i w_i} \end{align*}
  5. $\quad$ Compute $\beta_m = \log ((1-err_m)/err_m)$
  6. $\quad$ Set $\forall i, w_i \leftarrow w_i e^{\beta_m I(y_i \neq G_m(\mathbf{x}_i))}$
  7. Output $G(\mathbf{x}) = \sum_m \beta_m G_m(\mathbf{x})$
In [16]:
def weightedError(w, y_pred, y_true):
    return (w * (y_pred != y_true)).sum()/w.sum()
In [17]:
def weightedFindBestTheta(w, X, y, d):
    n = len(y)
    err = w.sum()+1
    theta = None
    p = None
    xx = jnp.sort(X[:,d])-1e-7
    for t in xx:
        e = weightedError(w, 1.*(X[:,d] < t), y)
        if e < err:
            err = e
            theta = t
            p = True
        e = weightedError(w, 1. - (X[:,d] < t), y)
        if e < err:
            err = e
            theta = t
            p = False
    if theta == None:
        print('theta faillure!!')
    return theta, p, err
In [18]:
class WeightedRandomAxisClassifier():
    def __init__(self, w, X, y, d):
        self.d = d
        self.theta, self.y, self.err = weightedFindBestTheta(w, X, y, d)
    def __call__(self, X):
        return 1.*(X[:,self.d]<self.theta) if self.y else 1.-(X[:,self.d]<self.theta)
In [19]:
class AdaBoost():
    def __init__(self, X, y, nb_cls=5):
        n = len(y)
        w = jnp.ones(n)/n
        self.beta = []
        self.cls = []
        for b in range(nb_cls):
            err_b = []
            cls_b = []
            for d in range(X.shape[1]):
                c = WeightedRandomAxisClassifier(w, X, y, d)
                cls_b.append(c)
                err_b.append(c.err)
            a = jnp.argmin(jnp.array(err_b))
            c = cls_b[a]
            e = c.err
            b = jnp.log((1-e)/e)
            w = w * jnp.exp(b * (c(X)!=y))
            self.beta.append(b)
            self.cls.append(c)
    def __call__(self, X):
        y = []
        for i, c in enumerate(self.cls):
            y.append(self.beta[i] * c(X))
        return jnp.array(y).mean(axis=0)            
In [20]:
T = AdaBoost(X, y, 2)
In [21]:
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=(y>0.5))
Out[21]:
<matplotlib.collections.PathCollection at 0x7b52bb7818a0>
No description has been provided for this image
In [33]:
T = AdaBoost(X, y, 10)
In [34]:
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=y)
Out[34]:
<matplotlib.collections.PathCollection at 0x7fb5af9adea0>
No description has been provided for this image
In [22]:
T = AdaBoost(X, y, 50)
In [24]:
t = 20; tx = jnp.linspace(0, 1, t); ty = jnp.linspace(0, 1, t)
xv, yv = jnp.meshgrid(tx, ty, sparse=True); xv = xv.squeeze(); yv = yv.squeeze()
xx = jnp.array([[xx, yy] for yy in yv for xx in xv])
y_pred = jnp.array(T(xx)).reshape(t, t)
cmap = plt.get_cmap('PiYG')
levels=jnp.linspace(-1.5, .5, 10)
norm = matplotlib.colors.BoundaryNorm(levels, ncolors=cmap.N, clip=True)
plt.pcolormesh(xv, yv, -y_pred, shading='nearest', norm=norm);
plt.scatter(X[:,0], X[:,1], c=(y>0.5))
Out[24]:
<matplotlib.collections.PathCollection at 0x7b52bb7f1930>
No description has been provided for this image

Gradient Tree Boosting¶

Given clasiffier $\hat{y}_i = \sum_kf_k(x_i)$, consider objective

$$\mathcal{L} = \sum_i l(y_i, \hat{y}_i) + \sum_k^T\Omega(f_k), \quad \Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2 $$

Second order approx

$$\mathcal{L}^t \approx \sum_i[l(y_i, \hat{y}_i) + g_i f_t(x_i) + \frac{1}{2}h_i f_t^2(x_i] +\Omega(f_t)$$

$g_i = \partial_{\hat{y}_i}l(y_i, \hat{y}_i)$, $g_i = \partial_{\hat{y}_i}^2l(y_i, \hat{y}_i)$

Remove constant term and define $I_j = \{i|q(x_i) = j\}$ the set of samples in leaf $j$

$$\tilde{\mathcal{L}} = \sum_j^T\left[w_j\sum_{i\in I_j}g_i +\frac{w_j^2}{2}(\lambda + \sum_{i\in I_j}h_i)\right] + \gamma T$$

Optimal weight $$w_j^\star = -\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j} h_i + \lambda}$$

Optimal objective value $$\mathcal{L}^\star = -\frac{1}{2}\sum_j^T\frac{(\sum_{i\in I_j}g_i)^2}{\sum_{i\in I_j}h_i + \lambda} +\gamma T $$

Provides a criterion for splitting nodes:

  • Letting $I = I_L \cup I_R$

$$ \mathcal{L}_\text{split} = \frac{1}{2}\left[\frac{(\sum_{i\in I_L}g_i)^2}{\sum_{i\in I_L} h_i + \lambda} + \frac{(\sum_{i\in I_R}g_i)^2}{\sum_{i\in I_R} h_i + \lambda} - \frac{(\sum_{i\in I}g_i)^2}{\sum_{i\in I} h_i + \lambda} \right] - \gamma$$

  • if $<0$, do not split
  • can be $<0$ because of $\gamma$ (regularisation)
  • In practice, randomizing features and samples allows very large trees
  • taking a sub-optimal weight ($\eta w_i^\star$) allows more room for subsequent splits.

Very efficient implementations (e.g., XGBoost, (Chen et al, 2016))

Remarks¶

Classifiers have to be weak

  • A perfect classifier yields $\beta_m = \infty$ breaking subsequent classifiers

  • Example: Kernel SVM with sufficiently tight Gaussian kernel

Classifiers have to be slightly better than random

  • Worst than random gets an negative weight (opposite classifier)

  • Example: single feature classifier

Boosted Decision Trees are an excellent first try in most cases

Exercise¶

Show that $$ f^\star(\mathbf{x}) = \arg\min_f \mathbb{E}_{\sim(\mathbf{x},y)}\left[ e^{-yf(\mathbf{x})} \right]= \frac{1}{2}\log\frac{P[y = 1 | \mathbf{x}]}{P[y=-1|\mathbf{x}]} $$

$$ \begin{aligned} \frac{\partial \mathbb{E}_{\sim(\mathbf{x},y)}\left[ e^{-yf(\mathbf{x})} \right]}{\partial f} &= \mathbb{E}_{\sim(\mathbf{x},y)}\left[ -y e^{-yf(\mathbf{x})} \right] = 0\\ \mathbb{E}\left[ -ye^{-yf(\mathbf{x})} \right] = -e^{-f(\mathbf{x})}&P[y=1|\mathbf{x}] + e^{f(\mathbf{x})}P[y=-1|\mathbf{x}]\\ e^{2f(\mathbf{x})} &= \frac{P[y=1|\mathbf{x}]}{P[y=-1|\mathbf{x}]} \end{aligned} $$

Decision Trees and Ensemble Learning, take home¶

Decision Trees

  • Simple

  • Fast

  • Explainable (domain experts understand the decision process)

  • Handle categorical data (or even mixed)

But

  • Overfit, unstable

  • Require massive amount of data

Random forest

  • Simple, Fast, handle categorical data

  • Stable

  • No longer explainable

Ensemble

  • Bagging: simple solution, good idea to reduce bias

  • Boosting: optimized combination

Large literature and many libraries on boosting

  • Bossted trees: very good default classifier in many cases (see (Grinsztajn et al, 2022))
In [ ]: