Statistical Machine Learning
UoC Stats 37700,Winter quarter
Lecture 1:Introduction.Decision Trees.
1/28
What is machine learning?
Machine Learning is traditionally classied as part of computer
science.
A short history note:
1940's,1950's:rst computers are created;early on,strong belief
that one day computers will be intelligent (e.g.:Alan Turing's
imitation game,a.k.a.Turing test,1950).Mathematical
formalism grounded on logic and symbolic calculus.
1960's,1970's:development of symbolicreasoning articial
intelligence based on formalism and rule inference.Rules are
learned from data but the statistical analysis is almost inexistant.
The results obtained in practice fall short of the initial expectations
and stall.
Introduction
Brief history
2/28
1980's:development of articial neural networks:a clear
departure from symbolicbased AI (early version:Rosenblatt's
perceptron 1957) that brings forth some successes.The
ambitions are more modest.
19902000's:development of statistical learning methods:
decision trees,kernel methods...The mathematical formalism of
these methods is now more rmly grounded in probability,
statistics,information theory and analysis (e.g.optimization).
Now:a certain inching towards more ambitious goals...
Introduction
Brief history
3/28
A limited goal:classication
Typical machine learning problem:classication.
The task is to classify an unknown object x ∈ X into one category of a
certain set Y = {1,2,...c} (labels).
Examples:
(Handwritten) character recognition:x is a grey scale image,Y is
the list of possible characters or digits.
Medical diagnosis:x is a set of medical observations (numerical
or categorical),Y = {benign,malignant} (for example).
Recognition of coding/noncoding gene sequences
Junk email automatic sorting
Introduction
Some specic goals
4/28
Supervised and unsupervised learning
The learning stage is to construct (in an automatic way) such a
classication method from a known set of examples whose class
is already known:training sample.
Unsupervised learning
:no labels are available from the training
sample.We want to extract some relevant information,for
example a separation into clusters (a kind of classication without
predened classes).
Introduction
Some specic goals
5/28
Some formalization
A
classier
is a function f:X →Y.
The
training sample
is S = ((X
1
,Y
1
),...,(X
n
,Y
n
)).
A learning method is a mapping S →(
f:X →Y).
How can we (theoretically) assess if
f is a good classier?
Test it on
new
examples.
Formalization and rst approaches
Some denitions
6/28
Probabilistic framework:
the performance of the classier is
theoretically measured by the average percentage of classication
errors commited on a unknwown'test'object (X,Y) drawn at
random:
E(
f ) = E
(X,Y)∼P
{
f (X) = Y}
= P
f (X) = Y
.
This is called the
generalization error
.
The learning from examples makes sense if we assume that the
sample S constains some information on the test objects.
Simplest assumption:S = ((X
i
,Y
i
)
1≤i ≤n
) are drawn from the
same distribution P,independently.
Formalization and rst approaches
Some denitions
7/28
Machine learning = statistics?
Obviously,with this formalism machine learning is very close to
traditional statistics:
Classication → Regression
Unsupervised learning → Density estimation
Emphasis of machine learning on:
complex data:high dimensional,nonnumerical,structured
very few modelling assumptions on the distribution of the data.
nonparametric methods coming from various sources of
inspiration.
Formalization and rst approaches
Machine learning and statistics
8/28
The Bayes classier
Assuming the probabilistic framework:(X,Y) drawn according to
a distribution P,what is the best possible classier possible?
Represent P under the form
P(x,y) = P(X = x)P(Y = xX = x) = µ(x)η(Y = yX = x).
For any xed x,the best possible deterministic classication for x
is to output the class having the largest conditional probability
given X = x:
f
∗
(x) = Arg Max
y∈Y
η(Y = yX = x).
This is the
Bayes
classier;the
Bayes error
is
L
∗
:= E[f
∗
] = E[ {f
∗
(X) = Y}].
Important
:the Bayes classier f
∗
is entirely determined by the
function
η
only.
Formalization and rst approaches
The classication problem
9/28
Plugin rules
One way to construct a classier is therefore to estimate the
function
η
by η.
If we know an estimator η it is natural to consider the classier
f (x) = Arg Max
y∈Y
η(Y = yX = x)
This is called a
plugin
classier.
Formalization and rst approaches
Plugin rules
10/28
Quality of plugin rules
We can relate the performance of estimator η to the peformance of the
corresponding plugin rule:
E(
f ) −E(f
∗
) ≤ E
{
f (x) = f
∗
(x)}
y
η(yx) − η(yx)
≤ E
y
η(yx) − η(yx)
.
In the binary classication case:
E(
f ) −E(f
∗
) = E
{
f (x) = f
∗
(x)} 2η(y = 1x) −1
≤ 2η − η
1
.
Formalization and rst approaches
Plugin rules
11/28
Logistic regression for η
If Y = {0,1} (
binary classication
),there is one classical way do
to estimate η:
logistic regression
:estimate instead
γ(x) = logit(η(1x)) = log
η(1x)
η(0x)
,
the logodds ratio.
Advantage:η ∈ [0,1] but γ ∈ R,therefore you can apply your
favorite regression method to estimate γ.
Formalization and rst approaches
Plugin rules
12/28
Class density estimation
Another classical way to go:estimate separately the density of
each class
g
y
(x)dx = dP(X = xY = y)
Generally modelbased;for example,each class is modelled by a
mixture of Gaussians:
g
y
(x) =
m
y
i =1
p
y,i
φ
µ
y,i
,Σ
y,i
,
which can be estimated via the EM algorithm for example.
Then estimate the marginal probabilities of each class:
c
y
= P(Y = y);
if the above are estimated by
g and
c,the plugin rule is by
denition
f (x) = Arg Max
y∈Y
c
y
g
y
(x)
Formalization and rst approaches
Plugin rules
13/28
Decision trees
Decision trees are a way of dening classiers that are somehow
descendants of rulebased classication methods from symbolic
approaches to ML.
Can be used for classication and for regression.
Different variants:CART (Breiman,Friedman,Olshen and Stone),
C4.5,ID3 (Quinlan)
Not the best method available nowadays in terms of generalization
error,but still very popular because it provides the user with an
explanation of the decision rule.
Decision trees
Introduction
14/28
The shape of a decision tree:the 20 questions game:
Decision trees
Introduction
15/28
Formally,a decision tree is:
a binary tree
whose internal nodes are labeled by questions q ∈ Q;
whose terminal nodes are labeled by the decision (e.g.for
classication:some y ∈ Y).
Decision trees
Formulation
16/28
Formally,a question is a function q:X →{left,right}.
Note:so,questions can be identied with (binary) classiers...A
decision tree is a way to combine elementary classiers.
Standard choice of questions:when x is a collection of numerical
and/or categorical data,
x = (x
1
,...,x
k
),with x
i
∈ R or x
i
∈ C
i
=
C
i,1
,...,C
i,n
i
,
consider questions of the form
q(x) = {x
i
> t} if x
i
is numerical
(where t is some real threshold) or
q(x) = {x
i
∈ C} if x
i
is categorical
(where C is some subset of C
i
).
Decision trees
Formulation
17/28
Choosing the decision when the tree structure is xed
Assume for now that the tree structure and the questions are xed.
What is the best decision to take on each of the leaves?(based
on the available data)
Let the training data fall down the tree and for each leaf,decide
to pick the majority class among those datapoints having reached
that leaf.
Decision trees
Formulation
18/28
Growing a decision tree
Now,how can we choose the structure of the tree itself and the
questions?
Assume we want to build a decision tree of size k (the size is the
number of leaves)
One standard way to choose a classier belonging to a certain set
F (here trees of size k) is to nd the minimizer of the training (or
empirical) classication error:
f = Arg Min
f ∈F
E(S,f ) = Arg Min
f ∈F
1
n
n
i =1
{f (X
i
) = Y
i
}
This is known as
Empirical Risk Minimization
(
ERM
).
Unfortunately,in the case of trees this is intractable from a
practical point of view.
Decision trees
Growing a tree
19/28
Greedy growing
An alternative to global optimization:greedy construction:
•
Start with a tree reduced to its root ( a constant classier!)
•
Choose the question that results in the largest reduction in the
empirical risk when splitting the root into two subnodes.
•
Iterate this procedure for splitting the subnodes in turn.
Unfortunately,in the case of classication,there arises a new
problem:it can happen that no available question leads to any
local improvement.
Decision trees
Growing a tree
20/28
Impurity functions
As a function of the (estimated) class probabilities
(p
i
)
,the
(training) error of a locally constant classier is
piecewise linear
;
this is the source of the latter problem.
Idea:replace this by a
strictly convex
impurity function I((p
i
)):
Decision trees
Growing a tree
21/28
Once an impurity function
I
has been chosen,the greedy
criterion to choose a question is to nd the minimum of
N
left
I((p
i,left
)) +N
right
I((p
i,right
));
then strict convexity of I implies that this is always strictly smaller
than
N
tot
I((p
i,tot
))
whenever (p
i,left
) = (p
i,right
).
Classical choices for I:(1)
Entropy
H((p
i
)) = −
i
p
i
logp
i
(2)
Gini criterion
G((p
i
)) = −
i
p
2
i
Decision trees
Growing a tree
22/28
Note:a classication tree can be seen as a plugin rule wherein
the function η is estimated by a constant function on the leaves of
the trees.
To this regard the entropy criterion is the natural cost function
when estimating η on such a model via Maximum Likelihood
(greedy maximum likelihood).
Similarly considering the Gini criterion is equivalent to locally
minimizing a kind of least square error:
(η,x,y) =
y
∈Y
( {y = y
} −η(x,y
))
2
.
Decision trees
Growing a tree
23/28
We now have a reasonable way to construct a tree structure
recursively.
But when should we stop growing the tree?
We could keep growing the tree until each leaf only contains data
points of one single class...
Unfortunately this is not a very good idea:why?
Decision trees
Tree pruning
24/28
Overtting
A classication problem (here totally random data).
Decision trees
Tree pruning
25/28
Overtting
Output of maximally grown decision tree.
Decision trees
Tree pruning
25/28
Undertting and overtting
Informally,there is a tradeoff to be found between the complexity of a
classier and the amount of data available.
Decision trees
Tree pruning
26/28
One rst idea:stop if a split leads to a leafs containing less than
datapoints.
This might not be the best idea.
More interesting idea:
complexity regularization
:nd a tradeoff
between the empirical risk
E(S,f ) and the complexity (tree size)
of f.
Grow a tree T of maximal size using the greedy procedure and
select a subtree T ⊂ T optimizing the following regularized error:
Arg Min
T⊂T
E(S,T) +λT:= R
λ
(T);
this is called
pruning
.
Note:
λ
has to be chosen,too!But let us assume for now that it is
xed.
Decision trees
Tree pruning
27/28
Interestingly,when the maximal tree
T
is xed the problem of
nding the optimal subtree minimizing the previous regularize
criterion is tractable.
If
T
λ
denotes the pruned tree for a xed λ,we have
R
λ
(
T
λ
) = min(R
λ
(T
root
),R
λ
(
T
left,λ
) +R
λ
(
T
right,λ
)),
then use recursive principle.
Furthermore,then
λ
1
≥ λ
2
⇒
T
λ
1
⊂
T
λ
2
.
Hence,as λ grows from 0 to +∞,we have a decreasing
sequence of pruned trees
T =
T
0
⊃
T
λ
1
⊃...⊃ T
root
,
which is easily computable.
Decision trees
Tree pruning
28/28
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment