Machine Learning
Major Approaches
Joakim Nivre
Modied and extended by
Oscar T
ackstr
om
Uppsala University and Vaxjo University,Sweden
nivre@msi.vxu.se
oscar@sics.se
Machine Learning 1(33)
Approaches to Machine Learning
I
Decision trees
I
Articial neural networks
I
Bayesian learning
I
Instancebased learning
I
Genetic algorithms
Machine Learning 2(33)
Decision Tree Learning
I
Decision trees classify instances by sorting them down the tree
from the root to some leaf node,where:
1.Each internal node species a test of some attribute.
2.Each branch corresponds to a value for the tested attribute.
3.Each leaf node provides a classication for the instance.
I
Decision trees represent a disjunction of conjunctions of
constraints on the attribute values of instances.
1.Each path from root to leaf species a conjunction of tests.
2.The tree itself represents the disjunction of all paths.
Machine Learning 3(33)
Example:Name Recognition
Capitalized?
No Sentenceinitial?
Yes No
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
0 1
0 1
Machine Learning 4(33)
Appropriate Problems for Decision Tree
Learning
I
Instances are represented by attributevalue pairs.
I
The target function has discrete output values.
I
Disjunctive descriptions may be required.
I
The training data may contain errors.
I
The training data may contain missing attribute values.
Machine Learning 5(33)
The ID3 Learning Algorithm
I
ID3(X = instances,Y = classes,A = attributes):
1.Create a root node R for the tree.
2.If all instances in X are in class y,return R with label y.
3.Else let the decision attribute for R be the attribute a 2 A that
best classies X and for each value v
i
of a:
3.1 Add a branch below R for the test a = v
i
.
3.2 Let X
i
be the subset of X that have a = v
i
.If X
i
is empty
then add a leaf labeled with the most common class in X;else
add the subtree ID3(X
i
,Y,Aa).
4.Return R.
Machine Learning 6(33)
Selecting the Best Attribute (1)
I
ID3 uses the measure Information Gain (IG) to decide which
attribute A best classies a set of examples X:
IG(X;A) = H(X) H(XjA)
The entropy and conditional entropy are dened as follows:
H(X) =
X
y2Y
P(Y = y) log
2
P(Y = y);
H(XjA) =
X
a2V
A
P(A = a)H(XjA = a)
where V
A
is the set of possible values for attribute A and the
specic conditional entropy is dened as:
H(XjA = a) =
X
y2Y
P(Y = yjA = a) log
2
P(Y = yjA = a)
Machine Learning 7(33)
Selecting the Best Attribute (2)
I
An alternative measure is Gain Ratio (GR):
GR(X;A) =
IG(X;A)
H(A)
;
where the attribute entropy is given by:
H(A) =
X
a2V
A
P(A = a) log
2
P(A = a)
I
The probabilities are usually estimated by simple frequency
counts:
P(A = a) =
jX
a
j
jXj
;
where X
a
is the set of examples with attribute A = a.
Machine Learning 8(33)
Hypothesis Space Search and Inductive Bias
I
Characteristics of ID3:
1.Searches a complete hypothesis space of target functions
2.Maintains a single current hypothesis throughout the search
3.Performs a hillclimbing search (susceptible to local minima)
4.Uses all training examples at each step in the search
I
Inductive bias:
1.Prefers shorter trees over longer ones
2.Prefers trees with informative attributes close to the root
3.Preference bias (incomplete search of complete space)
Machine Learning 9(33)
Overtting
I
A hypothesis h is overtted to the training data if there exists
an alternative hypothesis h
0
with higher training error but
lower test error.
I
Caused by learning structure in the data that occur by chance
I
Less training data increases the problem of overtting
I
Noisy training data increases the problem of overtting
I
Complex hypotheses more prone to overtting
Machine Learning 10(33)
Regularization
I
Overtting can be prevented by limiting hypothesis complexity
I
Generally a tradeo between t to data and complexity
I
Main methods for regularization:
1.Early stopping:stop learning before overtting
2.Pruning:learn"everything in the data"and then simplify
3.Averaging:average dierent hypotheses
4.Regularization term:hypothesis score/tness score and
inverse complexity score
5.Prior distribution:a priori prefer simpler hypotheses
I
Can be viewed as adding a preference bias to the algorithm
I
Generally involves hyper parameter controlling the tradeo
and requires validation using heldout data.
I
Decision tree learning is often regularised by early stopping or
pruning
Machine Learning 11(33)
Articial Neural Networks
I
Learning methods based on articial neural networks (ANN)
are suitable for problems with the following characteristics:
1.Instances are represented by many attributevalue pairs.
2.The target function may be discretevalued,realvalued,or a
vector of real or discretevalued attributes.
3.The training examples may contain errors.
4.Long training times are acceptable.
5.Fast evaluation of the learned target function may be required.
6.The ability of humans to understand the learned target
function is not important.
Machine Learning 12(33)
Perceptron Learning
I
A perceptron takes a vector of realvalued inputs x
1
;:::;x
n
and outputs 1 or 1:
o(x
1
;:::;x
n
) =
1 if w
0
+w
1
x
1
+ +w
n
x
n
> 0
1 otherwise
I
Learning a perceptron involves learning the weights
w
0
;:::;w
n
.
I
Begin with random weights,iteratively apply the perceptron
to each training instance with target output t,modifying the
weights on misclassication (with learning rate ):
w
i
w
i
+w
i
w
i
= (t o)x
i
I
Provably converges when data is linearly separable,though
not necessarily with good generalization ability
Machine Learning 13(33)
Perceptron illustration (1)
Machine Learning 14(33)
Perceptron illustration (2)
Machine Learning 15(33)
Perceptron illustration (3)
Machine Learning 16(33)
Perceptron illustration (4)
Machine Learning 17(33)
Linearly nonseparable problems
I
Always try with linear models before nonlinear ones
I
Linear models less prone to overt
I
Linear models typically much faster at training and prediction
Machine Learning 18(33)
MultiLayer Networks and Backpropagation
I
While the simple perceptron can only learn linearly separable
functions,multilayer networks can approximate any function.
I
The most common learning algorithm is backpropagation,
which learns the weights for a multilayer network with a xed
set of units and interconnections,using gradient descent in an
attempt to minimize the squared error loss.
I
Backpropagation over multilayer networks is only guaranteed
to converge toward a local minimum.
I
The inductive bias of backpropagation learning can be roughly
characterized as smooth interpolation between data points.
I
Regularization usually by early stopping or by pruning through
weight decay.
Machine Learning 19(33)
Bayesian Learning
I
Two reasons for studying Bayesian learning methods:
1.Ecient learning algorithms for certain kinds of problems
2.Analysis framework for other kinds of learning algorithms
I
Features of Bayesian learning methods:
1.Assign probablities to hypotheses (not accept or reject)
2.Combine prior knowledge with observed data
3.Permit hypotheses that make probabilistic predictions
4.Permit predictions based on multiple hypotheses,weighted by
their probabilities
I
Regularization can be performed by smoothing or more
generally by assuming a suitable prior distribution over the
parameters
Machine Learning 20(33)
Learning as Estimation
I
Bayes theorem (hypothesis space H,h 2 H,training data D):
P(hjD) =
P(Djh)P(h)
P(D)
I
Maximum a posteriori hypothesis (MAP):
h
MAP
arg max
h2H
P(hjD)
= arg max
h2H
P(Djh)P(h)
P(D)
= arg max
h2H
P(Djh)P(h)
I
Maximum likelihood hypothesis (ML) (MAP with uniform
prior):
h
ML
arg max
h2H
P(Djh)
Machine Learning 21(33)
Bayes Optimal Classier
I
Let H be a hypothesis space dened over the instance space
X,where the task is to learn a target function f:X!Y and
Y is a nite set of classes used to classify instances in X.
I
The Bayes optimal classication of a new instance is:
arg max
y2Y
X
h2H
P(yjh)P(hjD)
I
No other classication method using the same hypothesis
space and same prior knowledge can outperform this method
on average.
Machine Learning 22(33)
Naive Bayes Classier
I
Let H be a hypothesis space dened over the instance space
X,where the task is to learn a target function f:X!Y,Y
is a nite set of classes used to classify instances in X,and
a
1
;:::;a
n
are the attributes used to represent an instance
x 2 X:
I
The naive Bayes classication of a new instance is:
arg max
y2Y
P(y)
Y
i
P(a
i
jy)
I
This coincides with y
MAP
under the assumption that the
attribute values are conditionally independent given the target
value.
Machine Learning 23(33)
Learning a Naive Bayes Classier
I
Estimate probabilities from training data using ML estimation:
^
P
ML
(y) =
jfx2Dj f (x)=ygj
jDj
^
P
ML
(a
i
jy) =
jfx2Dj f (x)=y;a
i
2xgj
jfx2Dj f (x)=ygj
I
Smoothe probability estimates to compensate for sparse data,
e.g.using an mestimate:
^
P(y) =
jfx2Dj f (x)=ygj +mp
jDj +m
^
P(a
i
jy) =
jfx2Dj f (x)=y;a
i
2xgj +mp
jfx2Dj f (x)=ygj +m
where m is a constant (called the equivalent sample size) and
p is a prior probability (usually assumed to be uniform).
Machine Learning 24(33)
Bayesion Networks (a.k.a.Graphical Models)
I
A Bayesian network represents the joint probability distribution
for a set of variables given a set of conditional independence
assumptions,represented by a directed acyclic graph:
1.Each node X represents a random variable.
2.Each arc X!Y represents the assertion that Y is
conditionally independent of its nondescendants,given its
predecessors (i.e.nodes Z such that Z!Y).
3.Each node is associated with a table dening its conditional
probability distribution given its predecessors.
Machine Learning 25(33)
Example:Hidden Markov Model
&%
'$
&%
'$
&%
'$
&%
'$
&%
'$
&%
'$
&%
'$
&%
'$
&%
'$
&%
'$






?
?
?
?
?
1
2
3
4
5
o
1
o
2
o
3
o
4
o
5
P(
t
j
t1
)
1
k
P(o
t
j
t
)
o
1
o
m
1
p
11
p
1k
1
q
11
q
1m
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
k
p
k1
p
kk
k
q
k1
q
km
p
ij
= P(
t
=
j
j
t1
=
i
)
q
ij
= P(o
t
= o
j
j
t
=
i
)
Machine Learning 26(33)
Inference and Learning
I
Inference in a Bayesian network:
P(X
1
= x
1
;:::;X
n
= x
n
) =
n
Y
i =1
P(X
i
= x
i
j predecessors(X
i
))
I
Learning a Bayesian network:
1.Known structure,full observability:Estimation (ML,MAP)
2.Known structure,partial observability:Iterative approximation
algorithms,e.g.ExpectationMaximization
3.Unknown structure,full observability:Greedy approximations
of MAP (NP hard problem)
Machine Learning 27(33)
InstanceBased Learning
I
Let an instance x be described by a feature vector (x
1
;:::;x
p
)
and let D(x
i
;x
j
) be the distance between instances x
i
and x
j
:
D(x
i
;x
j
) =
p
X
m=1
d(x
i
m
;x
j
m
)
where d(x
i
m
;x
j
m
) is the distance between x
i
m
and x
j
m
.
I
Given a new instance x,let x
1
;:::;x
k
denote the k training
instances nearest to x.The knearest neighbor classication
is:
^
f (x)!arg max
y2Y
k
X
i =1
(y;f (x
i
))
where (a;b) = 1 if a = b and (a;b) = 0 otherwise.
Machine Learning 28(33)
Variations on kNearest Neighbor
I
Algorithm parameters:
1.k (1 k n)
2.Distance functions (D,d)
3.Feature weighting (IG,GR):
D(x
i
;x
j
) =
p
X
m=1
w
m
d(x
i
m
;x
j
m
)
4.Distance weighting:
^
f (x)!argmax
y2Y
k
X
i =1
w(D(x;x
i
)) (y;f (x
i
))
I
kNN is a nonparametric model,which makes regularization
not straightforward.One way to regularize is to learn a new
distance function that varies smooth w.r.t the training data.
Machine Learning 29(33)
Lazy and Eager Learning
I
Instancebased learning is an example of lazy learning:
1.Learning simply consists in storing training instances (no
explicit general hypothesis is constructed).
2.Classication is based on retrieval of instances and
similaritybased reasoning
I
Comparison with eager learning:
1.Lazy learners can construct many local approximations of the
target function,which may reduce complexity.
2.Lazy learners perform nearly all computation at classication
time,which may compromise eciency.
Machine Learning 30(33)
Genetic Algorithms
I
Learning as survival of the ttest
I
"Biologically inspired"stochastic search heuristic
I
A prototypical genetic algorithm has the following form,with
each hypothesis h represented by a bit string:
1.Initialize population:P Generate p hypotheses at random.
2.Evaluate:For each h 2 P,compute Fitness(h).
3.While [max
h
Fitness(h)] < Threshold:
I
Update:P (Select(P) [Crossover(P) [Mutate(P))
I
Evaluate:For each h 2 P,compute Fitness(h).
I
A regularization term may be added to the tness function
Machine Learning 31(33)
Genetic Operators
I
Select(P):Probabilistically select (1 r)p members of P
given the model:
P(h
i
) =
Fitness(h
i
)
P
p
j=1
Fitness(h
j
)
I
Crossover(P):Probabilistically select
r2
p
pairs of hypotheses
from P according to the model above.For each pair (h
1
;h
2
),
produce two ospring by applying a Crossover operator.
I
Mutate(P):Choose m percent of the members of P,with
uniform probability.For each,invert one randomly selected bit
in its representation.
Machine Learning 32(33)
Tools
I
Weka
I
Java based
I
Most functions accessible from GUI
I
vast assortment of learning algorithms
I
http://www.cs.waikato.ac.nz/ml/weka
I
Mallet
I
Java based
I
Limited CLI and somewhat lacking API documentation
I
Algorithms for Graphical Models (Conditional Random Fields)
I
http://mallet.cs.umass.edu
I
Elefant
I
Python based
I
Limited GUI and CLI
I
Rather unstable beta version (as of August 2009)
I
Well suited for implementing your own algorithms
I
http://elefant.developer.nicta.com.au
Machine Learning 33(33)
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment