Machine Learning – A Biased
and Incomplete Overview
Justus H. Piater
University of Liège
Department of Electrical Engineering and Computer Science
INTELSIG – Signal and Image Exploitation
Belgium
Why learn?
2 / 48
[from YouTube]
What is Machine Learning?
3 / 48
• Programs that improve their performance with
experience
• Programs that automatically choose parameters that
perform well:
• optimization
• function approximation
• Programs that automatically choose rules that
perform well:
• logic
Search
Example: ALVINN
4 / 48
Neural network that learns
to steer a car on public
highways [Pomerleau 1995]
Output Units
Input Retina
Person's
Steering
Direction
Sensor
Image
Example: TDGammon
5 / 48
Reinforcementlearning
system that plays
Backgammon at grand
master level [Tesauro 1995]
Program Training Games Results
TDG 1.0 300,000 −13 pts /
51 games
(−0.25 ppg)
TDG 2.0 800,000 −7 pts /
38 games
(−0.18 ppg)
TDG 2.1 1,500,000 −1 pt / 40 games
(−0.02 ppg)
RealWorld Applications
6 / 48
• Speech recognition
• Financial fraud detection
• Targeted advertisement
• Spam filtering
All of these are typically based on statistical analyses.
Outline
7 / 48
• Inductive Learning
• Fundamental Concepts
• Selected Methods
• Analytical Learning
• Reinforcement Learning
Inductive Learning
Learning a function
Inductive Learning 9 / 48
Given: A training set
containing
training
examples
.
Objective: Given
, predict
.
Note
Learning is about generalization.
The bias/variance dilemma
Inductive Learning 10 / 48
• A model with too few degrees of freedom will make
large errors due to its large bias.
• A model with too many degrees of freedom will
make large errors due to the large variance of its
predictions (for a given
).
In other words:
• A model with a low bias has a large variance.
• A model with a low variance has a large bias.
Note
• For a given model, we will have to choose our
bias/variance tradeoff.
• Models differ in their ability to keep both low.
Over and underfitting
Inductive Learning 11 / 48
Overfitting. Variance too large ⇒ excellent
performance on training data, but poor generalization
to test data.
Underfitting. Bias too strong ⇒ poor performance
on both training and test data.
Note
• With noisefree training data, we may usefully
reduce the bias as more training data come in.
• The noisier the training data, the more learning
bias we must impose.
• If we know the model, we can tolerate a large
amount of noise.
The Curse of Dimensionality
Inductive Learning 12 / 48
How much training data do we need to fit a model
reliably?
For a given bias, we need a certain data density to fit a
model, along all dimensions.
Thus, the required amount of training data is generally
exponential in the dimensionality of the model.
A Taxonomy of Induc
tive Learning Problems
Inductive Learning 13 / 48
Inductive Learning Tasks
Numerical attributes
Categorical Attributes
Regression problems
Classification problems
Example: A Classification Task
Inductive Learning 14 / 48
Suppose we have two normallydistributed populations
and
, and we would like to
guess which of these a random observation
was
drawn from.
Generative Models
Inductive Learning 15 / 48
MaximumLikelihood estimation: Choose the
class
that maximizes the likelihood
of the observation.
Maximum APosteriori estimation: Choose the
most probable class
with
.
Discriminative Models
Inductive Learning 16 / 48
Rather than likelihoods and probabilities, compute a
decision surface.
Fisher’s Linear Discriminant: We guess
iff
, where
.
Perceptrons
Inductive Learning 17 / 48
∑
x
0
= 1
w
0
x
1
x
2
x
n
w
1
w
2
w
n
This implements the same type of decision rule
as above, with
.
The Perceptron Training Rule
Inductive Learning 18 / 48
After each presentation of a training example
:
where
is the desired output.
Note
If the training data are linearly separable and if a
sufficiently small learning rate
is used, this will
converge within a finite number of steps. (Otherwise
there are no guarantees.)
Gradient Descent
Inductive Learning 19 / 48
Let’s use a linear unit (a perceptron without the
function) and a squared error function:
Then, the error function is a parabola, and we can
find the global minimum by gradient descent using
.
After each round of presenting all training examples
:
Gradient Descent (Continued)
Inductive Learning 20 / 48
Note
Given a sufficiently small
, this will converge to a
minimumerror solution, even for data that are not
linearly separable.
The Delta Rule
Inductive Learning 21 / 48
There are good reasons to approximate gradient
descent by updating the weight vector after each
individual training example (cf. The Perceptron Training
Rule) [Widrow and Hoff 1960].
After each presentation of a training example
:
Artificial Neural Networks
Inductive Learning 22 / 48
Inputs
Hidden Units
Output Units
• Units often use sigmoid squashing functions,
enabling the learning of almost arbitrary nonlinear
functions.
• Weights are popularly trained using the error
backpropagation algorithm, like the delta rule based
on gradient descent.
Caveats of Artificial Neural Networks
Inductive Learning 23 / 48
Neural networks with 2 hidden layers can in principle
represent arbitrary functions to arbitrary accuracy, but:
• Many parameter and design choices must be made
to avoid over and underfitting.
• Learning is often quite slow.
• Learning can easily get stuck in local minima of the
error function.
• The resulting model is not easily interpretable.
Nevertheless, neural networks have been used
successfully in many applications.
MaximumMargin Hyperplanes
Inductive Learning 24 / 48
x
x
2
1
b
w
2
w
− b
= 1
T
− b
= 0
T
− b
= −1
T
w x
= 1
T
w x
w x
(Linear) Support Vector Machine
Inductive Learning 25 / 48
A twoclass,
or
maximummargin classifier
that optionally uses slack
variables to allow for non
separable data.
Nonlinear Support Vector Machine
Inductive Learning 26 / 48
SVMs generally use the kernel trick (in the dual form
of the optimization problem) to project the data into
a higherdimensional feature space, where they are
more easily separable:
Only a few of the resulting
are nonzero, and their
corresponding
are called the support vectors.
The Kernel Trick
Inductive Learning 27 / 48
[from YouTube]
Remarks about SVMs
Inductive Learning 28 / 48
• Popular kernels include polynomials, radial basis
functions, and others.
• Training an SVM amounts to solving a standard
quadratic programming problem.
• SVMs are well regularized and work well with high
dimensional data (in fact, the feature space is often
infinitedimensional).
• The maximummargin property tends to yield good
generalization from relatively small training sets.
• SVMs are more easily set up than, say, neural
networks.
See a demo (by Guillaume Caron).
Popular Techniques
Inductive Learning 29 / 48
• Support Vector Machines [Burges 1998]
Kernelbased methods are currently hot (Support
Vector Regression [Smola and Schölkopf 2004], Kernel PCA, …).
• (Randomized) Decision Tree Ensembles [Breiman 2001, Geurts et
al. 2006]
• Probabilistic models
Analytical Learning
What Are We Missing?
Analytical Learning 31 / 48
Humans often learn from a single example, but
inductive learning is in bad shape (The bias/variance
dilemma, Over and underfitting, The Curse of
Dimensionality).
(Pure) Explanationbased Learning
Analytical Learning 32 / 48
EBL uses a domain theory
, e.g. in the form of
logical assertions.
What is there left to learn then? (Consider chess…)
An Example
Analytical Learning 33 / 48
Target concept:
A positive training example:
An Example (Continued)
Analytical Learning 34 / 48
Domain theory:
Learning
Analytical Learning 35 / 48
1.Explain:
2.Generalize:
Remarks
Analytical Learning 36 / 48
• Thanks to the domain theory, the relevant features
were identified.
• We just learned a new feature not present in the
domain theory nor in the training example: The
product of volume and density is less than five.
• We don’t really need any training examples at all –
but they guide the rule generation process towards
cases that arise in practice (if test data resemble
training data – this is an inductive hypothesis!).
Extensions
Analytical Learning 37 / 48
We would like to
• learn from incomplete and imperfect domain
theories,
• learn domain theories!
Examples of both exist [Thrun 1996, Kimmig et al. 2007], but they have
not yet lead to a unified theory of learning.
Reinforcement Learning
Learning With Less Knowledge
Reinforcement Learning 39 / 48
All of the above methods are supervised.
Q: What if we cannot provide training data in the
form of desired outputs?
A: Let the learner explore by trial and error!
Reinforcement Learning: Try something, and if it is
rewarded, do it again.
PerceptionActionLoop
Reinforcement Learning 40 / 48
Percept ion Act ion
Mapping
Agent
Environment
learn
Evaluat ion
Reinforcement Learning
Reinforcement Learning 41 / 48
Scenario: A (finite or infinite) sequence of states,
actions, rewards:
Objective: Learn a policy
such that, at each
time
, taking action
maximizes the expected
return
.
[Sutton and Barto 1998]
TemporalDifference Learning
Reinforcement Learning 42 / 48
Maintain a state value function
that estimates, for
each state, the expected return under policy
:
This is an onpolicy method. Moreover, it requires a
world model
.
Learning
Reinforcement Learning 43 / 48
Maintain a stateaction value function
that
estimates, for each stateaction pair, the expected
return in the case that all subsequent actions are
chosen optimally:
This is an offpolicy method.
WrapUp
Painfully Omitted
WrapUp 45 / 48
Unsupervised Learning:
• clustering
• dimensionality reduction
• data mining
Evolutionary Learning:
• randomized search guided by geneticallyinspired
heuristics
Conclusions
WrapUp 46 / 48
Current Successes:
• Machine Learning is everywhere (mostly
classification, regression, data mining).
• Reinforcement Learning shines on highly stochastic
tasks where training data are easily synthesized.
• Analytical learning is used for search control.
Open Challenges:
• Incremental learning
• Learning by physical systems (robots)
• Unifying empirical and analytical learning
Other types of learning exist (such as correlation
based learning), but are currently less important in
the computational camp.
Purely associative learning…
WrapUp 47 / 48
… can be disastrous. Use your domain knowledge!
[from YouTube]
References
WrapUp 48 / 48
L. Breiman, “Random Forests”. Machine Learning 45(1), pp. 5–32, 2001.
C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition”. Data Mining and Knowledge
Discovery 2(2), pp. 121–167, 1998.
P. Geurts, D. Ernst, L. Wehenkel, “Extremely Randomized Trees”. Machine Learning 36(1), pp. 3–42, 2006.
A. Kimmig, L. De Raedt, H. Toivonen, “Probabilistic Explanation Based Learning”. 18th European Conference on
Machine Learning, pp. 176–187, 2007.
D. Pomerleau, “Neural Network Vision for Robot Driving”. The Handbook of Brain Theory and Neural Networks,
1995.
A. Smola, B. Schölkopf, “A Tutorial on Support Vector Regression”. Statistics and Computing 14, pp. 199–222,
2004.
R. Sutton, A. Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
G. Tesauro, “Temporal Difference Learning and TDGammon”. Communications of the ACM 38(3), pp. 58–68,
1995.
S. Thrun, ExplanationBased Neural Network Learning: A Lifelong Learning Approach, Kluwer Academic
Publishers 1996.
B. Widrow, M. Hoff, “Adaptive switching circuits”. IRE WESCON Convention Record 4, pp. 96–104, 1960.
These notes are online at http://
www.montefiore.ulg.ac.be/~piater/presentations/ML
PACO.php.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment