CS 446:
Machine Learning
Gerald DeJong
mrebl@.uiuc.edu
3

0491
3320 SC
Recent approval for a TA to be named later
INTRODUCTION
CS446

Fall 06
2
Office hours: after most classes and Thur @ 3
Text: Mitchell’s Machine Learning
Midterm:
Oct. 4
Final:
Dec. 12
each a third
Homeworks / projects
Submit at the beginning of class
Late penalty: 20% / day up to 3 days
‰
Programming, some in

class assignments
Class web site soon
Cheating: none allowed! We adopt dept. policy
INTRODUCTION
CS446

Fall 06
3
Please answer these and hand in now
Name
Department
Where (If?*) you had Intro AI course
Who taught it (esp. if not here)
1) Why interested in Machine Learning?
2) Any topics you would like to see covered?
* may require significant additional effort
INTRODUCTION
CS446

Fall 06
4
Approx. Course Overview / Topics
Introduction:
Basic problems and questions
A detailed examples:
Linear threshold units
Basic Paradigms:
PAC (Risk Minimization); Bayesian Theory; SRM (Structural Risk
Minimization); Compression; Maximum Entropy;…
‰
Generative/Discriminative; Classification/Skill;…
Learning Protocols
Online/Batch; Supervised/Unsupervised/Semi

supervised; Delayed supervision
Algorithms:
Decision Trees (C4.5)
‰
[Rules and ILP (Ripper, Foil)]
‰
Linear Threshold Units (Winnow, Perceptron; Boosting; SVMs; Kernels)
‰
Probabilistic Representations (naïve Bayes, Bayesian trees; density estimation)
Delayed supervision: RL
Unsupervised/Semi

supervised: EM
Clustering, Dimensionality Reduction, or others of student interest
INTRODUCTION
CS446

Fall 06
5
What to Learn
Classifiers:
Learn a hidden function
Concept Learning: chair ? face ? game ?
Diagnosis: medical; risk assessment
Models:
Learn a map (and use it to navigate)
Learn a distribution (and use it to answer queries)
Learn a language model; Learn an Automaton
Skills:
Learn to play games; Learn a Plan / Policy
Learn to Reason; Learn to Plan
Clusterings:
Shapes of objects; Functionality; Segmentation
Abstraction
Focus on
classification
(importance, theoretical richness, generality,…)
INTRODUCTION
CS446

Fall 06
6
What to Learn?
Direct Learning: (discriminative, model

free[bad
name])
Learn a function that maps an input instance to the sought
after property.
Model Learning: (indirect, generative)
Learning a model of the domain; then use it to answer
various questions about the domain
In both cases, several protocols can be used
–
Supervised
–
learner is given examples and answers
‰
Unsupervised
–
examples, but no answers
Semi

supervised
–
some examples w/answers, others w/o
Delayed supervision
INTRODUCTION
CS446

Fall 06
7
Supervised Learning
Given:
Examples
(x,f
(
x))
of some unknown function
f
Find:
A good approximation to
f
x
provides some representation of the input
The process of mapping a domain element into a representation
is called
Feature Extraction. (Hard; ill

understood; important)
x
2
{0,1}
n
or x
2
<
n
‰
The target function (label)
f(x)
2
{

1,+1}
Binary Classification
f(x)
2
{1,2,3,.,k

1}
Multi

class classification
‰
f(x)
2
<
Regression
INTRODUCTION
CS446

Fall 06
8
Example and Hypothesis Spaces


+
+
+

X
H


+
X: Example Space
–
set of all well

formed inputs [w/a distribution]
H: Hypothesis Space
–
set of all well

formed outputs
INTRODUCTION
CS446

Fall 06
9
Supervised Learning: Examples
Disease diagnosis
x: Properties of patient (symptoms, lab tests)
‰
f : Disease (or maybe: recommended therapy)
Part

of

Speech tagging
x: An English sentence (e.g., The
can
will rust)
‰
f : The part of speech of a word in the sentence
Face recognition
x: Bitmap picture of person’s face
‰
f : Name the person (or maybe: a property of)
Automatic Steering
x: Bitmap picture of road surface in front of car
‰
f : Degrees to turn the steering wheel
INTRODUCTION
CS446

Fall 06
10
y =
f
(x
1
, x
2
, x
3
, x
4
)
Unknown
function
x
1
x
2
x
3
x
4
A Learning Problem
X
H
?
?
(Boolean: x1, x2, x3, x4,
f
)
INTRODUCTION
CS446

Fall 06
11
y =
f
(x
1
, x
2
, x
3
, x
4
)
Unknown
function
x
1
x
2
x
3
x
4
Example
x
1
x
2
x
3
x
4
y
1
0 0 1 0 0
3
0 0 1 1 1
4 1 0 0 1 1
5
0 1 1 0 0
6
1 1 0 0 0
7
0 1 0 1 0
2
0 1 0 0 0
Training Set
INTRODUCTION
CS446

Fall 06
12
Hypothesis Space
Complete Ignorance:
How many possible functions?
2
16
= 56536 over four input features.
After seven examples how many
possibilities for f?
2
9
possibilities remain for
f
How many examples until we figure out
which is correct?
We need to see labels for all 16 examples!
Is Learning Possible?
Example
x
1
x
2
x
3
x
4
y
1 1 1 1 ?
0 0 0 0 ?
1 0 0 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 0 1 0 ?
1 0 0 1 1
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 ?
0 0 1 1 1
0 0 1 0 0
0 0 0 1 ?
1 1 1 0 ?
INTRODUCTION
CS446

Fall 06
13
Another Hypothesis Space
Simple Rules:
There are only 16 simple
conjunctive rules of the form
y=x
i
Æ
x
j
Æ
x
k
...
No simple rule explains the data. The same is true for simple clauses
1
0 0 1 0
0
3
0 0 1 1
1
4 1 0 0 1
1
5
0 1 1 0
0
6
1 1 0 0
0
7
0 1 0 1
0
2
0 1 0 0
0
y
=c
x
1
1100 0
x
2
0100 0
x
3
0110 0
x
4
0101 1
x
1
†
x
2
1100 0
x
1
x
3
0011 1
x
1
†
x
4
0011 1
Rule Counterexample
x
2
†
x
3
0011 1
x
2
x
4
0011 1
x
3
x
4
1001 1
x
1
x
2
x
3
0011 1
x
1
†
x
2
x
4
0011 1
x
1
†
x
3
x
4
0011 1
x
2
†
x
3
x
4
0011 1
x
1
†
x
2
†
x
3
†
x
4
0011 1
Rule Counterexample
INTRODUCTION
CS446

Fall 06
14
Third Hypothesis Space
m

of

n rules:
There are 29 possible rules
of the form
”y = 1
if and only if at least
m
of the following
n
variables are
1”
Found a consistent hypothesis.
1
0 0 1 0
0
3
0 0 1 1
1
4 1 0 0 1
1
5
0 1 1 0
0
6
1 1 0 0
0
7
0 1 0 1
0
2
0 1 0 0
0
x
1
††††††††††
3



x
2
††††††††††
2



x
3
††††††††††
1



x
4
††††††††††
7



x
1,
x
2
††††††
2 3


x
1,
x
3
1 3


x
1,
x
4
6 3


x
2,
x
3
††††††††
2 3


variables
1

of
2

of
3

of
4

of
x
2,
x
4
2 3


x
3,
x
4
4 4


x
1,
x
2,
x
3
1 3 3

x
1,
x
2,
x
4
2 3 3

x
1,
x
3,
x
4
1
3

x
2,
x
3,
x
4
1 5 3

x
1,
x
2,
x
3,
x
4
1 5 3 3
variables
1

of
2

of
3

of
4

of
INTRODUCTION
CS446

Fall 06
15
Views of Learning
Learning is the removal of our remaining uncertainty:
Suppose we
knew
that the unknown function was an m

of

n
Boolean function, then we could use the training data to infer which
function it is.
Learning requires guessing a good, small hypothesis class
:
We can start with a very small class and enlarge it until it contains an
hypothesis that fits the data.
We could be wrong !
Our prior knowledge might be wrong:
y=x4
one

of (x1, x3) is
also consistent
Our guess of the hypothesis class could be wrong
If this is the unknown function, then we will make errors when we are
given new examples, and are asked to predict the value of the function
INTRODUCTION
CS446

Fall 06
16
General strategy for Machine Learning
H should respect our prior understanding:
Excess expressivity makes learning difficult
Expressivity of H should match our ignorance
Understand flexibility of std. hypothesis spaces:
Decision trees, neural networks, rule grammars, stochastic models
Hypothesis spaces of flexible size;
Nested collections of hypotheses.
ML succeeds when these interrelate
Develop algorithms for finding a hypothesis h that fits the data
h will likely perform well when the richness of H is less than the
information in the training set
INTRODUCTION
CS446

Fall 06
17
Terminology
Training example:
An pair of the form (x, f (x))
Target function (concept):
The true function f (?)
Hypothesis:
A proposed function h, believed to be similar to f.
Concept:
Boolean function. Example for which f (x)= 1 are
positive
examples; those for which f (x)= 0 are
negative
examples (instances)
(sometimes used interchangeably w/ “Hypothesis”)
Classifier:
A discrete valued function. The possible value of f: {1,2,…K}
are the classes or
class labels
.
Hypothesis space:
The space of all hypotheses that can, in principle, be
output by the learning algorithm.
Version Space:
The space of all hypothesis in the hypothesis space that
have not yet been ruled out.
INTRODUCTION
CS446

Fall 06
18
Key Issues in Machine Learning
Modeling
How to formulate application problems as machine learning
problems ?
Learning Protocols (where is the data coming from, how?)
Project examples:
[complete products]
EMAIL
Given a seminar announcement, place the relevant information in my
outlook
Given a message, place it in the appropriate folder
Image processing:
Given a folder with pictures; automatically rotate all those that need it.
My office:
have my office greet me in the morning and unlock the door (but do it
only for me!)
Context Sensitive Spelling:
Incorporate into Word
INTRODUCTION
CS446

Fall 06
19
Key Issues in Machine Learning
Modeling
How to formulate application problems as machine learning
problems ?
Learning Protocols (where is the data coming from, how?)
Representation:
What are good hypothesis spaces ?
Any rigorous way to find these? Any general approach?
Algorithms:
What are good algorithms?
How do we define success?
Generalization Vs. over fitting
The computational problem
INTRODUCTION
CS446

Fall 06
20
Example: Generalization vs Overfitting
What is a Tree ?
A botanist
Her brother
A tree is something with
A tree is a
green
thing
leaves I’ve seen before
Neither will generalize well
INTRODUCTION
CS446

Fall 06
21
Self

organize into Groups of 4 or 5
Assignment 1
The Badges Game
……
Prediction or Modeling?
Representation
‰
Background Knowledge
When did learning take place?
Learning Protocol?
‰
What is the problem?
‰
Algorithms
INTRODUCTION
CS446

Fall 06
22
Linear Discriminators
I don’t know {
whether,
weather}
to laugh or cry
How can we make this a learning problem?
We will look for a function
F: Sentences
{
whether,
weather}
We need to define the domain of this function better.
An option
: For each word
w
in English define a
Boolean
feature x
w
:
[x
w
=1] iff w is in the sentence
This maps a sentence to a point in {0,1}
50,000
In this space: some points are
whether
points
some are
weather
points
Learning Protocol?
Supervised? Unsupervised?
INTRODUCTION
CS446

Fall 06
23
What’s Good?
Learning problem
:
Find a function that
best separates the data
What function?
What’s best?
How to find it?
A possibility: Define the learning problem to be:
Find a (linear) function that best separates the data
INTRODUCTION
CS446

Fall 06
24
Exclusive

OR (XOR)
(x
1
Æ
x
2)
Ç
(
:
{x
1
}
Æ
:
{x
2
})
In general: a parity function.
x
i
2
{0,1}
f(x
1
, x
2
,…, x
n
) = 1
iff
x
i
is even
This function is not
linearly separable
.
x
1
x
2
INTRODUCTION
CS446

Fall 06
25
Sometimes Functions Can be Made Linear
x
1
x
2
x
4
Ç
x
2
x
4
x
5
Ç
x
1
x
3
x
7
Space: X= x
1
, x
2
,…, x
n
input Transformation
New Space: Y = {y
1
,y
2
,…} = {x
i
,x
i
x
j
, x
i
x
j
x
j
}
Weather
Whether
y
3
Ç
y
4
Ç
y
7
New discriminator is
functionally simpler
INTRODUCTION
CS446

Fall 06
26
Data are not separable in one dimension
Not separable if you insist on using a specific
class of functions
x
Feature Space
INTRODUCTION
CS446

Fall 06
27
Blown Up Feature Space
Data are separable in <x, x
2
> space
x
x
2
Key issue: what features to
use.
Computationally, can be
done implicitly (kernels)
INTRODUCTION
CS446

Fall 06
28
A General Framework for Learning
Goal:
predict an unobserved output value y
2
Y
based on an observed input vector x
2
X
Estimate a functional relationship
y~f(x)
from a set
{(x,y)
i
}
i=1,n
Most relevant

Classification
:
y
{0,1}
(or
y
{1,2,…k}
)
(But, within the same framework can also talk about
Regression, y
2
<
What do we want f(x) to satisfy?
We want to minimize the Loss (Risk):
L(f()) = E
X,Y
( [f(x)
y] )
Where:
E
X,Y
denotes the expectation with respect to the true distribution
.
Simply: # of mistakes
[…] is a indicator function
INTRODUCTION
CS446

Fall 06
29
A General Framework for Learning (II)
We want to minimize the Loss:
L(f()) = E
X,Y
( [f(X)
Y] )
Where:
E
X,Y
denotes the expectation with respect to the true distribution
.
We cannot do that. Why not?
Instead, we
try
to minimize the empirical classification error.
For a set of training examples
{(X
i
,
Y
i
)}
i=1,n
Try to minimize the observed loss
(Issue
I
: when is this good enough? Not now)
This minimization problem is typically NP hard.
To alleviate this computational problem, minimize a new function
–
a convex
upper bound of the classification error function
I
(f(x),y) =[f(x)
y]
= {1 when f(x)
y; 0 otherwise}
INTRODUCTION
CS446

Fall 06
30
Learning as an Optimization Problem
A Loss Function
L(f(x),y)
measures the penalty
incurred by a classifier
f
on example
(x,y).
There are many different loss functions one could
define:
Misclassification Error:
L(f(x),y) = 0 if f(x) = y; 1 otherwise
Squared Loss:
L(f(x),y) = (f(x)
–
y)
2
Input dependent loss:
L(f(x),y) = 0 if f(x)= y; c(x)otherwise.
A continuous convex
loss function also allows
a conceptually simple
optimization algorithm.
f(x)
–
y
INTRODUCTION
CS446

Fall 06
31
How to Learn?
Local
search:
Start with a linear threshold function.
See how well you are doing.
Correct
‰
Repeat until you converge.
There are other ways that do not
search directly in the
hypotheses space
Directly compute the hypothesis?
INTRODUCTION
CS446

Fall 06
32
Learning Linear Separators (LTU)
f(x) = sgn {x
¢
w

} = sgn{
i=1
n
w
i
x
i

}
x= (
x
1
,x
2
,… ,x
n
)
2
{0,1}
n
is the feature based
encoding of the data point
w= (
w
1
,w
2
,… ,w
n
)
2
<
n
is the target function.
determines the shift with
respect to the origin
w
INTRODUCTION
CS446

Fall 06
33
Expressivity
f(x) = sgn {x
¢
w

} = sgn{
i=1
n
w
i
x
i

}
Many functions are Linear
Conjunctions:
y = x
1
Æ
x
3
Æ
x
5
‰
y = sgn{1
¢
x
1
+ 1
¢
x
3
+ 1
¢
x
5

3}
At least m of n:
y = at least 2 of {
x
1
,x
3
,
x
5
}
y = sgn{1
¢
x
1
+ 1
¢
x
3
+ 1
¢
x
5

2}
Many functions are not
Xor:
y = x
1
Æ
x
2
Ç
x
1
Æ
x
2
Non trivial DNF:
y = x
1
Æ
x
2
Ç
x
3
Æ
x
4
But some can be made linear
Probabilistic Classifiers as well
INTRODUCTION
CS446

Fall 06
34
Canonical Representation
f(x) = sgn {x
¢
w

} = sgn{
i=1
n
w
i
x
i

}
sgn {x
¢
w

}
´
sgn {x’
¢
w’}
Where:
x’ = (x,

) and w’ = (w,1)
Moved from an
n
dimensional representation to an
(n+1)
dimensional representation, but now can look for
hyperplans that go through the origin.
INTRODUCTION
CS446

Fall 06
35
LMS: An online, local search algorithm
A local search learning algorithm requires:
Hypothesis Space:
Linear Threshold Units
Loss function:
Squared loss
LMS (Least Mean Square, L
2
)
Search procedure:
Gradient Descent
w
INTRODUCTION
CS446

Fall 06
36
LMS: An online, local search algorithm
•
Let
w
(j)
be our current weight vector
•
Our prediction on the d

th example
x
is therefore:
•
Let t
d
be the target value for this example (
real value; represents u
¢
x
)
•
A convenient
error
function of the data set is:
(i (subscript)
–
vector component; j (superscript)

time; d
–
example #)
Assumption:
x
2
R
n
;
u
2
R
n
is the target weight vector; the target (label) is
t
d
= u
¢
x
Noise has been added; so, possibly, no weight vector is consistent with the data.
INTRODUCTION
CS446

Fall 06
37
Gradient Descent
We use gradient descent to determine the weight vector that
minimizes
Err (w)
;
Fixing the set D of examples, E is a function of
w
j
‰
At each step, the weight vector is modified in the direction that
produces the steepest descent along the error surface
.
E(w)
w
w
4
w
3
w
2
w
1
INTRODUCTION
CS446

Fall 06
38
•
To find the best direction in the
weight space
we compute the gradient
of
E
with respect to each of the components of
•
This vector specifies the direction that produces the steepest
increase in E;
•
We want to modify in the direction of
•
Where:
Gradient Descent
INTRODUCTION
CS446

Fall 06
39
•
We have:
•
Therefore:
Gradient Descent: LMS
INTRODUCTION
CS446

Fall 06
40
•
Weight update rule:
Gradient Descent: LMS
INTRODUCTION
CS446

Fall 06
41
•
Weight update rule:
•
Gradient descent algorithm for training linear units:

Start with an initial random weight vector

For every example d with target value :

Evaluate the linear unit

update by adding to each component

Continue until E below some threshold
Gradient Descent: LMS
INTRODUCTION
CS446

Fall 06
42
•
Weight update rule:
•
Gradient descent algorithm for training linear units:

Start with an initial random weight vector

For every example d with target value :

Evaluate the linear unit

update by adding to each component

Continue until E below some threshold
Because the surface contains only a single global minimum
the algorithm will converge to a weight vector with minimum
error, regardless of whether the examples are linearly separable
Gradient Descent: LMS
INTRODUCTION
CS446

Fall 06
43
•
Weight update rule:
Incremental Gradient Descent: LMS
INTRODUCTION
CS446

Fall 06
44
Incremental Gradient Descent

LMS
•
Weight update rule:
•
Gradient descent algorithm for training linear units:

Start with an initial random weight vector

For every example d with target value :

Evaluate the linear unit

update by
incrementally
adding
to each component

Continue until E below some threshold
In general

does not converge to global minimum
Decreasing R with time guarantees convergence
Incremental algorithms are sometimes advantageous
…
INTRODUCTION
CS446

Fall 06
45
Learning Rates and Convergence
•
In the general (non

separable) case the learning rate R
must decrease to zero to guarantee convergence. It cannot
decrease too quickly nor too slowly.
•
The learning rate is called the
step size.
There are more
sophisticates algorithms (Conjugate Gradient) that choose
the step size automatically and converge faster.
•
There is only one “basin” for linear threshold units, so a
local minimum is the global minimum. However, choosing
a starting point can make the algorithm converge much
faster.
INTRODUCTION
CS446

Fall 06
46
Computational Issues
Assume the data is linearly separable.
Sample complexity:
Suppose we want to ensure that our LTU has an error rate
(on new examples) of less than
with high probability(at least (1

))
How large must m (the number of examples) be in order to
achieve this? It can be shown that for
n
dimensional problems
m = O(1/
[ln(1/
) + (n+1) ln(1/
) ].
Computational complexity:
What can be said?
It can be shown that there exists a polynomial time algorithm for
finding consistent LTU (by reduction from linear programming).
(On

line algorithms have inverse quadratic dependence on the margin)
INTRODUCTION
CS446

Fall 06
47
Other methods for LTUs
•
Direct Computation:
Set
J(
w
) =
0 and solve for
w
. Can be accomplished using SVD
methods.
•
Fisher Linear Discriminant:
A direct computation method.
•
Probabilistic methods (naive Bayes):
Produces a stochastic classifier that can be viewed as a linear
threshold unit.
•
Winnow:
A multiplicative update algorithm with the property that it can
handle large numbers of irrelevant attributes.
INTRODUCTION
CS446

Fall 06
48
Summary of LMS algorithms for LTUs
Local search:
Begins with initial weight vector. Modifies iteratively to minimize
and error function. The error function is
loosely
related to the goal of
minimizing the number of classification errors.
Memory:
The classifier is constructed from the training examples.
The examples can then be discarded.
Online or Batch:
Both online and batch variants of the algorithms can be used.
INTRODUCTION
CS446

Fall 06
49
Fisher Linear Discriminant
This is a classical method for discriminant analysis.
It is based on dimensionality reduction
–
finding a better
representation for the data.
Notice that just finding good representations for the data
may
not always be good for discrimination
. [E.g., O, Q]
Intuition:
Consider projecting data from
d
dimensions to the line.
‰
Likely results in a mixed set of points and
poor separation.
‰
However, by
moving the line around
we might be able to find an
orientation for which the projected samples are well separated.
INTRODUCTION
CS446

Fall 06
50
Fisher Linear Discriminant
Sample S= {x
1
, x
2
, … x
n
}
2
<
d
P, N are the positive, negative examples, resp.
Let
w
2
<
d
. And assume
w=1.
Then:
The projection of a vector
x
on a line in the direction w,
is
w
t
¢
x
.
If the data is linearly separable, there exists a good
direction
w
.
(all vectors are column vectors)
INTRODUCTION
CS446

Fall 06
51
Finding a Good Direction
Sample mean (positive, P; Negative, N):
M
p
= 1/P
P
x
i
The mean of the projected (positive, negative) points
m
p
= 1/P
P
w
t
¢
x
i
= 1/P
P
y
i
= w
t
¢
M
p
Is simply the projection of the sample mean.
Therefore, the distance between the projected means is:
m
p

m
N
= w
t
¢
(M
p

M
N
)
Want large
difference
INTRODUCTION
CS446

Fall 06
52
Finding a Good Direction (2)
Scaling
w
isn’t the solution. We want the difference to be large
relative to some measure of standard deviation for each class.
S
2
p
=
P
(y

m
p
)
2
s
2
N
=
N
(y

m
N
)
2
1/ (
S
2
p
+
s
2
N
)
within class scatter
: estimates the variances of the
sample.
The
Fischer linear discriminant
employs the linear function w
t
¢
x for
which
J(w) =  m
P
–
m
N

2
/ S
2
p
+ s
2
N
is maximized.
How to make this a classifier?
How to find the optimal w?
Some Algebra
INTRODUCTION
CS446

Fall 06
53
J as an explicit function of w (1)
Compute the scatter matrices:
S
p
=
P
(x

M
p
)(x

M
p
)
t
S
N
=
N
(x

M
N
)(x

M
N
)
t
and
S
W
= S
p
+ S
p
We can write:
S
2
p
=
P
(y

m
p
)
2
=
P
(w
t
x

w
t
M
p
)
2
=
=
P
w
t
(x

M
p
)
(x

M
p
)
t
w = w
t
S
p
w
Therefore:
S
2
p
+ S
2
N
= w
t
S
W
w
S
W
is the within

class scatter matrix. It is proportional to the
sample covariance matrix for the d

dimensional sample.
INTRODUCTION
CS446

Fall 06
54
J as an explicit function of w (2)
We can do a similar computation for the means:
S
B
= (M
P

M
N
)(M
P

M
N
)
t
and we can write:
(m
P

m
N
)
2
= (w
t
M
P

w
t
M
N
)
2
=
= w
t
(M
P

M
N
)
(M
P

M
N
)
t
w = w
t
S
B
w
Therefore:
S
B
is the
between

class scatter matrix
. It is the outer product
of two vectors and therefore its rank is at most 1.
S
B
w
is always in the direction of (M
P

M
N
)
INTRODUCTION
CS446

Fall 06
55
J as an explicit function of w (3)
Now we can compute explicitly: We can do a similar computation for
the means:
J(w) =
 m
P
–
m
N

2
/ S
2
p
+ s
2
N
= w
t
S
B
w / w
t
S
W
w
We are looking for a the value of
w
that maximizes this expression.
This is a generalized eigenvalue problem; when
S
W
is nonsingular, it is
just a eigenvalue problem. The solution can be written without
solving the problem, as:
w=S

1
W
(M
P

M
N
)
This is the Fisher Linear Discriminant
.
1
: We converted a d

dimensional problem to a 1

dimensional problem
and suggested a solution that makes some sense.
2:
We have a solution that makes sense; how to make it a classifier? And,
how good it is?
INTRODUCTION
CS446

Fall 06
56
Fisher Linear Discriminant

Summary
It turns out that both problems can be solved if we make
assumptions. E.g., if the data consists of two classes of points,
generated according to a normal distribution, with the same
covariance. Then:
The solution is optimal.
Classification can be done by choosing a threshold, which can be
computed.
Is this satisfactory?
INTRODUCTION
CS446

Fall 06
57
Introduction

Summary
We introduced the technical part of the class by giving two examples
for (very different) approaches to linear discrimination.
There are many other solutions.
Questions 1
: But this assumes that we are linear. Can we learn a
function that is more flexible in terms of what it does with the
features space?
Question 2
: Can we say something about the quality of what we learn
(sample complexity, time complexity; quality)
Comments 0
Log in to post a comment