Machine Learning – A Biased

and Incomplete Overview

Justus H. Piater

University of Liège

Department of Electrical Engineering and Computer Science

INTELSIG – Signal and Image Exploitation

Belgium

Why learn?

2 / 48

[from YouTube]

What is Machine Learning?

3 / 48

• Programs that improve their performance with

experience

• Programs that automatically choose parameters that

perform well:

• optimization

• function approximation

• Programs that automatically choose rules that

perform well:

• logic

Search

Example: ALVINN

4 / 48

Neural network that learns

to steer a car on public

highways [Pomerleau 1995]

Output Units

Input Retina

Person's

Steering

Direction

Sensor

Image

Example: TD-Gammon

5 / 48

Reinforcement-learning

system that plays

Backgammon at grand-

master level [Tesauro 1995]

Program Training Games Results

TDG 1.0 300,000 −13 pts /

51 games

(−0.25 ppg)

TDG 2.0 800,000 −7 pts /

38 games

(−0.18 ppg)

TDG 2.1 1,500,000 −1 pt / 40 games

(−0.02 ppg)

Real-World Applications

6 / 48

• Speech recognition

• Financial fraud detection

• Targeted advertisement

• Spam filtering

All of these are typically based on statistical analyses.

Outline

7 / 48

• Inductive Learning

• Fundamental Concepts

• Selected Methods

• Analytical Learning

• Reinforcement Learning

Inductive Learning

Learning a function

Inductive Learning 9 / 48

Given: A training set

containing

training

examples

.

Objective: Given

, predict

.

Note

Learning is about generalization.

The bias/variance dilemma

Inductive Learning 10 / 48

• A model with too few degrees of freedom will make

large errors due to its large bias.

• A model with too many degrees of freedom will

make large errors due to the large variance of its

predictions (for a given

).

In other words:

• A model with a low bias has a large variance.

• A model with a low variance has a large bias.

Note

• For a given model, we will have to choose our

bias/variance trade-off.

• Models differ in their ability to keep both low.

Over- and underfitting

Inductive Learning 11 / 48

Overfitting. Variance too large ⇒ excellent

performance on training data, but poor generalization

to test data.

Underfitting. Bias too strong ⇒ poor performance

on both training and test data.

Note

• With noise-free training data, we may usefully

reduce the bias as more training data come in.

• The noisier the training data, the more learning

bias we must impose.

• If we know the model, we can tolerate a large

amount of noise.

The Curse of Dimensionality

Inductive Learning 12 / 48

How much training data do we need to fit a model

reliably?

For a given bias, we need a certain data density to fit a

model, along all dimensions.

Thus, the required amount of training data is generally

exponential in the dimensionality of the model.

A Taxonomy of Induc-

tive Learning Problems

Inductive Learning 13 / 48

Inductive Learning Tasks

Numerical attributes

Categorical Attributes

Regression problems

Classification problems

Example: A Classification Task

Inductive Learning 14 / 48

Suppose we have two normally-distributed populations

and

, and we would like to

guess which of these a random observation

was

drawn from.

Generative Models

Inductive Learning 15 / 48

Maximum-Likelihood estimation: Choose the

class

that maximizes the likelihood

of the observation.

Maximum A-Posteriori estimation: Choose the

most probable class

with

.

Discriminative Models

Inductive Learning 16 / 48

Rather than likelihoods and probabilities, compute a

decision surface.

Fisher’s Linear Discriminant: We guess

iff

, where

.

Perceptrons

Inductive Learning 17 / 48

∑

x

0

= 1

w

0

x

1

x

2

x

n

w

1

w

2

w

n

This implements the same type of decision rule

as above, with

.

The Perceptron Training Rule

Inductive Learning 18 / 48

After each presentation of a training example

:

where

is the desired output.

Note

If the training data are linearly separable and if a

sufficiently small learning rate

is used, this will

converge within a finite number of steps. (Otherwise

there are no guarantees.)

Gradient Descent

Inductive Learning 19 / 48

Let’s use a linear unit (a perceptron without the

function) and a squared error function:

Then, the error function is a parabola, and we can

find the global minimum by gradient descent using

.

After each round of presenting all training examples

:

Gradient Descent (Continued)

Inductive Learning 20 / 48

Note

Given a sufficiently small

, this will converge to a

minimum-error solution, even for data that are not

linearly separable.

The Delta Rule

Inductive Learning 21 / 48

There are good reasons to approximate gradient

descent by updating the weight vector after each

individual training example (cf. The Perceptron Training

Rule) [Widrow and Hoff 1960].

After each presentation of a training example

:

Artificial Neural Networks

Inductive Learning 22 / 48

Inputs

Hidden Units

Output Units

• Units often use sigmoid squashing functions,

enabling the learning of almost arbitrary nonlinear

functions.

• Weights are popularly trained using the error

backpropagation algorithm, like the delta rule based

on gradient descent.

Caveats of Artificial Neural Networks

Inductive Learning 23 / 48

Neural networks with 2 hidden layers can in principle

represent arbitrary functions to arbitrary accuracy, but:

• Many parameter and design choices must be made

to avoid over- and underfitting.

• Learning is often quite slow.

• Learning can easily get stuck in local minima of the

error function.

• The resulting model is not easily interpretable.

Nevertheless, neural networks have been used

successfully in many applications.

Maximum-Margin Hyperplanes

Inductive Learning 24 / 48

x

x

2

1

b

w

2

|w|

− b

= 1

T

− b

= 0

T

− b

= −1

T

w x

= 1

T

w x

w x

(Linear) Support Vector Machine

Inductive Learning 25 / 48

A two-class,

or

maximum-margin classifier

that optionally uses slack

variables to allow for non-

separable data.

Nonlinear Support Vector Machine

Inductive Learning 26 / 48

SVMs generally use the kernel trick (in the dual form

of the optimization problem) to project the data into

a higher-dimensional feature space, where they are

more easily separable:

Only a few of the resulting

are nonzero, and their

corresponding

are called the support vectors.

The Kernel Trick

Inductive Learning 27 / 48

[from YouTube]

Remarks about SVMs

Inductive Learning 28 / 48

• Popular kernels include polynomials, radial basis

functions, and others.

• Training an SVM amounts to solving a standard

quadratic programming problem.

• SVMs are well regularized and work well with high-

dimensional data (in fact, the feature space is often

infinite-dimensional).

• The maximum-margin property tends to yield good

generalization from relatively small training sets.

• SVMs are more easily set up than, say, neural

networks.

See a demo (by Guillaume Caron).

Popular Techniques

Inductive Learning 29 / 48

• Support Vector Machines [Burges 1998]

Kernel-based methods are currently hot (Support

Vector Regression [Smola and Schölkopf 2004], Kernel PCA, …).

• (Randomized) Decision Tree Ensembles [Breiman 2001, Geurts et

al. 2006]

• Probabilistic models

Analytical Learning

What Are We Missing?

Analytical Learning 31 / 48

Humans often learn from a single example, but

inductive learning is in bad shape (The bias/variance

dilemma, Over- and underfitting, The Curse of

Dimensionality).

(Pure) Explanation-based Learning

Analytical Learning 32 / 48

EBL uses a domain theory

, e.g. in the form of

logical assertions.

What is there left to learn then? (Consider chess…)

An Example

Analytical Learning 33 / 48

Target concept:

A positive training example:

An Example (Continued)

Analytical Learning 34 / 48

Domain theory:

Learning

Analytical Learning 35 / 48

1.Explain:

2.Generalize:

Remarks

Analytical Learning 36 / 48

• Thanks to the domain theory, the relevant features

were identified.

• We just learned a new feature not present in the

domain theory nor in the training example: The

product of volume and density is less than five.

• We don’t really need any training examples at all –

but they guide the rule generation process towards

cases that arise in practice (if test data resemble

training data – this is an inductive hypothesis!).

Extensions

Analytical Learning 37 / 48

We would like to

• learn from incomplete and imperfect domain

theories,

• learn domain theories!

Examples of both exist [Thrun 1996, Kimmig et al. 2007], but they have

not yet lead to a unified theory of learning.

Reinforcement Learning

Learning With Less Knowledge

Reinforcement Learning 39 / 48

All of the above methods are supervised.

Q: What if we cannot provide training data in the

form of desired outputs?

A: Let the learner explore by trial and error!

Reinforcement Learning: Try something, and if it is

rewarded, do it again.

Perception-Action-Loop

Reinforcement Learning 40 / 48

Percept ion- Act ion

Mapping

Agent

Environment

learn

Evaluat ion

Reinforcement Learning

Reinforcement Learning 41 / 48

Scenario: A (finite or infinite) sequence of states,

actions, rewards:

Objective: Learn a policy

such that, at each

time

, taking action

maximizes the expected

return

.

[Sutton and Barto 1998]

Temporal-Difference Learning

Reinforcement Learning 42 / 48

Maintain a state value function

that estimates, for

each state, the expected return under policy

:

This is an on-policy method. Moreover, it requires a

world model

.

-Learning

Reinforcement Learning 43 / 48

Maintain a state-action value function

that

estimates, for each state-action pair, the expected

return in the case that all subsequent actions are

chosen optimally:

This is an off-policy method.

Wrap-Up

Painfully Omitted

Wrap-Up 45 / 48

Unsupervised Learning:

• clustering

• dimensionality reduction

• data mining

Evolutionary Learning:

• randomized search guided by genetically-inspired

heuristics

Conclusions

Wrap-Up 46 / 48

Current Successes:

• Machine Learning is everywhere (mostly

classification, regression, data mining).

• Reinforcement Learning shines on highly stochastic

tasks where training data are easily synthesized.

• Analytical learning is used for search control.

Open Challenges:

• Incremental learning

• Learning by physical systems (robots)

• Unifying empirical and analytical learning

Other types of learning exist (such as correlation-

based learning), but are currently less important in

the computational camp.

Purely associative learning…

Wrap-Up 47 / 48

… can be disastrous. Use your domain knowledge!

[from YouTube]

References

Wrap-Up 48 / 48

L. Breiman, “Random Forests”. Machine Learning 45(1), pp. 5–32, 2001.

C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition”. Data Mining and Knowledge

Discovery 2(2), pp. 121–167, 1998.

P. Geurts, D. Ernst, L. Wehenkel, “Extremely Randomized Trees”. Machine Learning 36(1), pp. 3–42, 2006.

A. Kimmig, L. De Raedt, H. Toivonen, “Probabilistic Explanation Based Learning”. 18th European Conference on

Machine Learning, pp. 176–187, 2007.

D. Pomerleau, “Neural Network Vision for Robot Driving”. The Handbook of Brain Theory and Neural Networks,

1995.

A. Smola, B. Schölkopf, “A Tutorial on Support Vector Regression”. Statistics and Computing 14, pp. 199–222,

2004.

R. Sutton, A. Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

G. Tesauro, “Temporal Difference Learning and TD-Gammon”. Communications of the ACM 38(3), pp. 58–68,

1995.

S. Thrun, Explanation-Based Neural Network Learning: A Lifelong Learning Approach, Kluwer Academic

Publishers 1996.

B. Widrow, M. Hoff, “Adaptive switching circuits”. IRE WESCON Convention Record 4, pp. 96–104, 1960.

These notes are online at http://

www.montefiore.ulg.ac.be/~piater/presentations/ML-

PACO.php.

## Comments 0

Log in to post a comment