# Machine Learning - SICS

AI and Robotics

Oct 14, 2013 (4 years and 8 months ago)

111 views

Machine Learning
Major Approaches
Joakim Nivre
Modied and extended by
Oscar T

ackstr

om
Uppsala University and Vaxjo University,Sweden
nivre@msi.vxu.se
oscar@sics.se
Machine Learning 1(33)
Approaches to Machine Learning
I
Decision trees
I
Articial neural networks
I
Bayesian learning
I
Instance-based learning
I
Genetic algorithms
Machine Learning 2(33)
Decision Tree Learning
I
Decision trees classify instances by sorting them down the tree
from the root to some leaf node,where:
1.Each internal node species a test of some attribute.
2.Each branch corresponds to a value for the tested attribute.
3.Each leaf node provides a classication for the instance.
I
Decision trees represent a disjunction of conjunctions of
constraints on the attribute values of instances.
1.Each path from root to leaf species a conjunction of tests.
2.The tree itself represents the disjunction of all paths.
Machine Learning 3(33)
Example:Name Recognition
Capitalized?
No Sentence-initial?
Yes No

P
P
P
P
P
P
P
P
P
P
P
P
P
P
P

P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
0 1
0 1
Machine Learning 4(33)
Appropriate Problems for Decision Tree
Learning
I
Instances are represented by attribute-value pairs.
I
The target function has discrete output values.
I
Disjunctive descriptions may be required.
I
The training data may contain errors.
I
The training data may contain missing attribute values.
Machine Learning 5(33)
The ID3 Learning Algorithm
I
ID3(X = instances,Y = classes,A = attributes):
1.Create a root node R for the tree.
2.If all instances in X are in class y,return R with label y.
3.Else let the decision attribute for R be the attribute a 2 A that
best classies X and for each value v
i
of a:
3.1 Add a branch below R for the test a = v
i
.
3.2 Let X
i
be the subset of X that have a = v
i
.If X
i
is empty
then add a leaf labeled with the most common class in X;else
i
,Y,Aa).
4.Return R.
Machine Learning 6(33)
Selecting the Best Attribute (1)
I
ID3 uses the measure Information Gain (IG) to decide which
attribute A best classies a set of examples X:
IG(X;A) = H(X) H(XjA)
The entropy and conditional entropy are dened as follows:
H(X) = 
X
y2Y
P(Y = y) log
2
P(Y = y);
H(XjA) = 
X
a2V
A
P(A = a)H(XjA = a)
where V
A
is the set of possible values for attribute A and the
specic conditional entropy is dened as:
H(XjA = a) = 
X
y2Y
P(Y = yjA = a) log
2
P(Y = yjA = a)
Machine Learning 7(33)
Selecting the Best Attribute (2)
I
An alternative measure is Gain Ratio (GR):
GR(X;A) =
IG(X;A)
H(A)
;
where the attribute entropy is given by:
H(A) =
X
a2V
A
P(A = a) log
2
P(A = a)
I
The probabilities are usually estimated by simple frequency
counts:
P(A = a) =
jX
a
j
jXj
;
where X
a
is the set of examples with attribute A = a.
Machine Learning 8(33)
Hypothesis Space Search and Inductive Bias
I
Characteristics of ID3:
1.Searches a complete hypothesis space of target functions
2.Maintains a single current hypothesis throughout the search
3.Performs a hill-climbing search (susceptible to local minima)
4.Uses all training examples at each step in the search
I
Inductive bias:
1.Prefers shorter trees over longer ones
2.Prefers trees with informative attributes close to the root
3.Preference bias (incomplete search of complete space)
Machine Learning 9(33)
Overtting
I
A hypothesis h is overtted to the training data if there exists
an alternative hypothesis h
0
with higher training error but
lower test error.
I
Caused by learning structure in the data that occur by chance
I
Less training data increases the problem of overtting
I
Noisy training data increases the problem of overtting
I
Complex hypotheses more prone to overtting
Machine Learning 10(33)
Regularization
I
Overtting can be prevented by limiting hypothesis complexity
I
Generally a trade-o between t to data and complexity
I
Main methods for regularization:
1.Early stopping:stop learning before overtting
2.Pruning:learn"everything in the data"and then simplify
3.Averaging:average dierent hypotheses
4.Regularization term:hypothesis score/tness score and
inverse complexity score
5.Prior distribution:a priori prefer simpler hypotheses
I
Can be viewed as adding a preference bias to the algorithm
I
Generally involves hyper parameter controlling the trade-o
and requires validation using held-out data.
I
Decision tree learning is often regularised by early stopping or
pruning
Machine Learning 11(33)
Articial Neural Networks
I
Learning methods based on articial neural networks (ANN)
are suitable for problems with the following characteristics:
1.Instances are represented by many attribute-value pairs.
2.The target function may be discrete-valued,real-valued,or a
vector of real- or discrete-valued attributes.
3.The training examples may contain errors.
4.Long training times are acceptable.
5.Fast evaluation of the learned target function may be required.
6.The ability of humans to understand the learned target
function is not important.
Machine Learning 12(33)
Perceptron Learning
I
A perceptron takes a vector of real-valued inputs x
1
;:::;x
n
and outputs 1 or 1:
o(x
1
;:::;x
n
) =

1 if w
0
+w
1
x
1
+   +w
n
x
n
> 0
1 otherwise
I
Learning a perceptron involves learning the weights
w
0
;:::;w
n
.
I
Begin with random weights,iteratively apply the perceptron
to each training instance with target output t,modifying the
weights on misclassication (with learning rate ):
w
i
w
i
+w
i
w
i
= (t o)x
i
I
Provably converges when data is linearly separable,though
not necessarily with good generalization ability
Machine Learning 13(33)
Perceptron illustration (1)
Machine Learning 14(33)
Perceptron illustration (2)
Machine Learning 15(33)
Perceptron illustration (3)
Machine Learning 16(33)
Perceptron illustration (4)
Machine Learning 17(33)
Linearly non-separable problems
I
Always try with linear models before non-linear ones
I
Linear models less prone to overt
I
Linear models typically much faster at training and prediction
Machine Learning 18(33)
Multi-Layer Networks and Backpropagation
I
While the simple perceptron can only learn linearly separable
functions,multi-layer networks can approximate any function.
I
The most common learning algorithm is backpropagation,
which learns the weights for a multilayer network with a xed
set of units and interconnections,using gradient descent in an
attempt to minimize the squared error loss.
I
Backpropagation over multilayer networks is only guaranteed
to converge toward a local minimum.
I
The inductive bias of backpropagation learning can be roughly
characterized as smooth interpolation between data points.
I
Regularization usually by early stopping or by pruning through
weight decay.
Machine Learning 19(33)
Bayesian Learning
I
Two reasons for studying Bayesian learning methods:
1.Ecient learning algorithms for certain kinds of problems
2.Analysis framework for other kinds of learning algorithms
I
Features of Bayesian learning methods:
1.Assign probablities to hypotheses (not accept or reject)
2.Combine prior knowledge with observed data
3.Permit hypotheses that make probabilistic predictions
4.Permit predictions based on multiple hypotheses,weighted by
their probabilities
I
Regularization can be performed by smoothing or more
generally by assuming a suitable prior distribution over the
parameters
Machine Learning 20(33)
Learning as Estimation
I
Bayes theorem (hypothesis space H,h 2 H,training data D):
P(hjD) =
P(Djh)P(h)
P(D)
I
Maximum a posteriori hypothesis (MAP):
h
MAP
 arg max
h2H
P(hjD)
= arg max
h2H
P(Djh)P(h)
P(D)
= arg max
h2H
P(Djh)P(h)
I
Maximum likelihood hypothesis (ML) (MAP with uniform
prior):
h
ML
 arg max
h2H
P(Djh)
Machine Learning 21(33)
Bayes Optimal Classier
I
Let H be a hypothesis space dened over the instance space
X,where the task is to learn a target function f:X!Y and
Y is a nite set of classes used to classify instances in X.
I
The Bayes optimal classication of a new instance is:
arg max
y2Y
X
h2H
P(yjh)P(hjD)
I
No other classication method using the same hypothesis
space and same prior knowledge can outperform this method
on average.
Machine Learning 22(33)
Naive Bayes Classier
I
Let H be a hypothesis space dened over the instance space
X,where the task is to learn a target function f:X!Y,Y
is a nite set of classes used to classify instances in X,and
a
1
;:::;a
n
are the attributes used to represent an instance
x 2 X:
I
The naive Bayes classication of a new instance is:
arg max
y2Y
P(y)
Y
i
P(a
i
jy)
I
This coincides with y
MAP
under the assumption that the
attribute values are conditionally independent given the target
value.
Machine Learning 23(33)
Learning a Naive Bayes Classier
I
Estimate probabilities from training data using ML estimation:
^
P
ML
(y) =
jfx2Dj f (x)=ygj
jDj
^
P
ML
(a
i
jy) =
jfx2Dj f (x)=y;a
i
2xgj
jfx2Dj f (x)=ygj
I
Smoothe probability estimates to compensate for sparse data,
e.g.using an m-estimate:
^
P(y) =
jfx2Dj f (x)=ygj +mp
jDj +m
^
P(a
i
jy) =
jfx2Dj f (x)=y;a
i
2xgj +mp
jfx2Dj f (x)=ygj +m
where m is a constant (called the equivalent sample size) and
p is a prior probability (usually assumed to be uniform).
Machine Learning 24(33)
Bayesion Networks (a.k.a.Graphical Models)
I
A Bayesian network represents the joint probability distribution
for a set of variables given a set of conditional independence
assumptions,represented by a directed acyclic graph:
1.Each node X represents a random variable.
2.Each arc X!Y represents the assertion that Y is
conditionally independent of its nondescendants,given its
predecessors (i.e.nodes Z such that Z!Y).
3.Each node is associated with a table dening its conditional
probability distribution given its predecessors.
Machine Learning 25(33)
Example:Hidden Markov Model
&%
'\$
&%
'\$
&%
'\$
&%
'\$
&%
'\$
&%
'\$
&%
'\$
&%
'\$
&%
'\$
&%
'\$
-
-
-
-
-
-
?
?
?
?
?

1

2

3

4

5
o
1
o
2
o
3
o
4
o
5
P(
t
j
t1
)

1
   
k
P(o
t
j
t
)
o
1
   o
m

1
p
11
   p
1k

1
q
11
   q
1m
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

k
p
k1
   p
kk

k
q
k1
   q
km
p
ij
= P(
t
= 
j
j
t1
= 
i
)
q
ij
= P(o
t
= o
j
j
t
= 
i
)
Machine Learning 26(33)
Inference and Learning
I
Inference in a Bayesian network:
P(X
1
= x
1
;:::;X
n
= x
n
) =
n
Y
i =1
P(X
i
= x
i
j predecessors(X
i
))
I
Learning a Bayesian network:
1.Known structure,full observability:Estimation (ML,MAP)
2.Known structure,partial observability:Iterative approximation
algorithms,e.g.Expectation-Maximization
3.Unknown structure,full observability:Greedy approximations
of MAP (NP hard problem)
Machine Learning 27(33)
Instance-Based Learning
I
Let an instance x be described by a feature vector (x
1
;:::;x
p
)
and let D(x
i
;x
j
) be the distance between instances x
i
and x
j
:
D(x
i
;x
j
) =
p
X
m=1
d(x
i
m
;x
j
m
)
where d(x
i
m
;x
j
m
) is the distance between x
i
m
and x
j
m
.
I
Given a new instance x,let x
1
;:::;x
k
denote the k training
instances nearest to x.The k-nearest neighbor classication
is:
^
f (x)!arg max
y2Y
k
X
i =1
(y;f (x
i
))
where (a;b) = 1 if a = b and (a;b) = 0 otherwise.
Machine Learning 28(33)
Variations on k-Nearest Neighbor
I
Algorithm parameters:
1.k (1  k  n)
2.Distance functions (D,d)
3.Feature weighting (IG,GR):
D(x
i
;x
j
) =
p
X
m=1
w
m
d(x
i
m
;x
j
m
)
4.Distance weighting:
^
f (x)!argmax
y2Y
k
X
i =1
w(D(x;x
i
)) (y;f (x
i
))
I
k-NN is a non-parametric model,which makes regularization
not straightforward.One way to regularize is to learn a new
distance function that varies smooth w.r.t the training data.
Machine Learning 29(33)
Lazy and Eager Learning
I
Instance-based learning is an example of lazy learning:
1.Learning simply consists in storing training instances (no
explicit general hypothesis is constructed).
2.Classication is based on retrieval of instances and
similarity-based reasoning
I
Comparison with eager learning:
1.Lazy learners can construct many local approximations of the
target function,which may reduce complexity.
2.Lazy learners perform nearly all computation at classication
time,which may compromise eciency.
Machine Learning 30(33)
Genetic Algorithms
I
Learning as survival of the ttest
I
"Biologically inspired"stochastic search heuristic
I
A prototypical genetic algorithm has the following form,with
each hypothesis h represented by a bit string:
1.Initialize population:P Generate p hypotheses at random.
2.Evaluate:For each h 2 P,compute Fitness(h).
3.While [max
h
Fitness(h)] < Threshold:
I
Update:P (Select(P) [Crossover(P) [Mutate(P))
I
Evaluate:For each h 2 P,compute Fitness(h).
I
A regularization term may be added to the tness function
Machine Learning 31(33)
Genetic Operators
I
Select(P):Probabilistically select (1 r)p members of P
given the model:
P(h
i
) =
Fitness(h
i
)
P
p
j=1
Fitness(h
j
)
I
Crossover(P):Probabilistically select
r2
p
pairs of hypotheses
from P according to the model above.For each pair (h
1
;h
2
),
produce two ospring by applying a Crossover operator.
I
Mutate(P):Choose m percent of the members of P,with
uniform probability.For each,invert one randomly selected bit
in its representation.
Machine Learning 32(33)
Tools
I
Weka
I
Java based
I
Most functions accessible from GUI
I
vast assortment of learning algorithms
I
http://www.cs.waikato.ac.nz/ml/weka
I
Mallet
I
Java based
I
Limited CLI and somewhat lacking API documentation
I
Algorithms for Graphical Models (Conditional Random Fields)
I
http://mallet.cs.umass.edu
I
Elefant
I
Python based
I
Limited GUI and CLI
I
Rather unstable beta version (as of August 2009)
I
Well suited for implementing your own algorithms
I
http://elefant.developer.nicta.com.au
Machine Learning 33(33)