INTRODUCTION
CS446 Fall ’12
CS
446:
Machine
Learning
Dan Roth
University of Illinois, Urbana
-
Champaign
danr@illinois.edu
http://L2R.cs.uiuc.edu/~danr
3322 SC
INTRODUCTION
CS446 Fall ’12
A
nnouncements
Class Registration: Still closed; stay tuned.
My office hours:
Tuesday, Thursday
10:45
-
11:30
E
-
mail; Piazza; Follow the Web
site
Homework
No need to submit Hw0
;
Later
:
submit
electronically.
2
INTRODUCTION
CS446 Fall ’12
A
nnouncements
Sections:
RM 3405
–
Monday
at 5:00
[A
-
F] (not on Sept. 3
rd
)
RM 3401
–
Tuesday
at
5:00 [G
-
L]
RM 3405
–
Wednesday
at
5:30 [M
-
S]
RM 3403
–
Thursday
at
5:00 [T
-
Z]
Next week, in class:
Hands
-
on classification
. Follow the web site
—
install Java;
bring your laptop if possible
3
INTRODUCTION
CS446 Fall ’12
Course Overview
Introduction:
Basic problems and questions
A detailed example:
Linear threshold
units
Hands
-
on classification
Two Basic Paradigms:
PAC (Risk Minimization); Bayesian Theory
Learning Protocols
Online/Batch;
Supervised/Unsupervised/Semi
-
supervised
Algorithms:
Decision Trees (C4.5)
[Rules and ILP (Ripper, Foil)]
Linear Threshold Units (Winnow, Perceptron; Boosting; SVMs;
Kernels)
Probabilistic Representations (naïve Bayes, Bayesian trees; density
estimation)
Unsupervised/Semi
-
supervised: EM
Clustering, Dimensionality Reduction
4
INTRODUCTION
CS446 Fall ’12
What is Learning
The Badges Game
……
This is an example of the key learning protocol:
supervised
learning
Prediction or Modeling?
Representation
Problem setting
Background Knowledge
When did learning take place?
Algorithm
Are you sure you got it right?
5
INTRODUCTION
CS446 Fall ’12
Supervised
Learning
Given:
Examples
(
x,f
(x
))
of some unknown function
f
Find:
A good approximation of
f
x
provides some representation of the input
The process of mapping a domain element into a
representation is called
Feature Extraction. (Hard; ill
-
understood; important)
x
2
{0,1}
n
or x
2
<
n
The target function (label)
f(x)
2
{
-
1,+1}
Binary Classification
f(x)
2
{1,2,3,.,k
-
1}
Multi
-
class classification
f(x)
2
<
Regression
6
INTRODUCTION
CS446 Fall ’12
Supervised Learning : Examples
Disease diagnosis
x: Properties of patient (symptoms, lab tests)
f : Disease (or maybe: recommended therapy)
Part
-
of
-
Speech tagging
x: An English sentence (e.g., The
can
will rust)
f : The part of speech of a word in the sentence
Face recognition
x: Bitmap picture of person’s face
f : Name the person (or maybe: a property of)
Automatic Steering
x: Bitmap picture of road surface in front of car
f : Degrees to turn the steering wheel
Many problems
that do
not
seem
like classification
problems can be
decomposed to
classification
problems.
E.g
,
Semantic Role
Labeling
7
INTRODUCTION
CS446 Fall ’12
A Learning Problem
y =
f
(x
1
, x
2
, x
3
, x
4
)
Unknown
function
x
1
x
2
x
3
x
4
Example
x
1
x
2
x
3
x
4
y
1
0 0 1 0 0
3
0 0 1 1 1
4 1 0 0 1 1
5
0 1 1 0 0
6
1 1 0 0 0
7
0 1 0 1 0
2
0 1 0 0 0
Can you learn this
function?
What is it?
8
INTRODUCTION
CS446 Fall ’12
Hypothesis Space
Complete Ignorance:
There are 2
16
= 65536 possible functions
over four input features.
We can’t figure out which one is
correct until we’ve seen every
possible input
-
output pair.
After seven examples we still
have 2
9
possibilities for
f
Is Learning Possible?
Example
x
1
x
2
x
3
x
4
y
1 1 1 1 ?
0 0 0 0 ?
1 0 0 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 0 1 0 ?
1 0 0 1 1
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 ?
0 0 1 1 1
0 0 1 0 0
0 0 0 1 ?
1 1 1 0 ?
9
INTRODUCTION
CS446 Fall ’12
Hypothesis Space (2)
Simple Rules:
There are only 16 simple
conjunctive rules of the form
y=x
i
Æ
x
j
Æ
x
k
No simple rule explains the data. The same is true for
simple clauses
.
1
0 0 1 0
0
3
0 0 1 1
1
4 1 0 0 1
1
5
0 1 1 0
0
6
1 1 0 0
0
7
0 1 0 1
0
2
0 1 0 0
0
y
=c
x
1
1100 0
x
2
0100 0
x
3
0110 0
x
4
0101 1
x
1
†
x
2
1100 0
x
1
=
x
3
0011 1
x
1
†
x
4
0011 1
Rule Counterexample
x
2
†
x
3
0011 1
x
2
=
=
x
4
0011 1
x
3
=
=
x
4
1001 1
x
1
=
=
x
2
=
x
3
0011 1
x
1
†
x
2
=
x
4
0011 1
x
1
†
x
3
=
x
4
0011 1
x
2
†
x
3
=
x
4
0011 1
x
1
†
x
2
†
x
3
†
x
4
0011 1
Rule Counterexample
10
INTRODUCTION
CS446 Fall ’12
Hypothesis Space (3)
m
-
of
-
n rules:
There are 20 possible rules
of the form
”y = 1
if and only if at least
m
of the following
n
variables are
1”
Found a consistent hypothesis
.
1
0 0 1 0
0
3
0 0 1 1
1
4 1 0 0 1
1
5
0 1 1 0
0
6
1 1 0 0
0
7
0 1 0 1
0
2
0 1 0 0
0
x
1
==††=†††=††=††=†=
3
-
-
-
x
2
==††=††††=††=††==
2
-
-
-
x
3
==††=†††=††=††=†=
1
-
-
-
x
4
==††=†††=††=††=†=
7
-
-
-
x
1,
x
2
††=††=††=
=
2 3
-
-
x
1,
x
3
=
1 3
-
-
x
1,
x
4
=
6 3
-
-
x
2,
x
3
††=††=††=††=
2 3
-
-
variables
1
-
of
2
-
of
3
-
of
4
-
of
x
2,
x
4
=
2 3
-
-
x
3,
x
4
=
4 4
-
-
x
1,
x
2,
x
3
=
1 3 3
-
x
1,
x
2,
x
4
=
2 3 3
-
x
1,
x
3,
x
4
=
1
=
=
3
-
x
2,
x
3,
x
4
=
1 5 3
-
x
1,
x
2,
x
3,
x
4
=
1 5 3 3
variables
1
-
of
2
-
of
3
-
of
4
-
of
11
INTRODUCTION
CS446 Fall ’12
Views of Learning
Learning is the removal of our
remaining
uncertainty:
Suppose we
knew
that the unknown function was an m
-
of
-
n
Boolean function, then we could use the training data to
infer which function it is.
Learning requires guessing a good, small hypothesis
class
:
We can start with a very small class and enlarge it until it
contains an hypothesis that fits the data.
We could be wrong !
Our prior knowledge might be wrong:
y=x4
one
-
of (x1,
x3) is also consistent
Our guess of the hypothesis class could be wrong
If
this
is the unknown function, then we will make errors when
we are given new examples, and are asked to predict the value
of the function
12
INTRODUCTION
CS446 Fall ’12
General strategies for Machine
Learning
Develop representation languages for expressing
concepts
Serve to limit the expressivity of the target models
E.g., Functional representation (n
-
of
-
m); Grammars;
stochastic models;
Develop flexible hypothesis spaces:
Nested collections of hypotheses. Decision trees, neural
networks
Hypothesis spaces of flexible size
In either case:
Develop algorithms for finding a hypothesis
in our
hypothesis space
, that
fits
the data
And
hope
that they will generalize well
13
INTRODUCTION
CS446 Fall ’12
Terminology
Target function (concept):
The true function f :X
{…Labels…}
Concept:
Boolean function. Example for which f (x)= 1 are
positive
examples; those for which f (x)= 0 are
negative
examples (instances)
Hypothesis:
A proposed function h, believed to be similar to f.
The output of our learning algorithm.
Hypothesis space:
The space of all hypotheses that can, in
principle, be output by the learning algorithm.
Classifier:
A discrete valued function produced by the learning
algorithm. The possible value of f: {1,2,…K} are the classes or
class labels
. (In most algorithms the classifier will actually
return a real valued function that we’ll have to interpret
).
Training examples:
A set of examples of the form {(x, f (x))}
14
INTRODUCTION
CS446 Fall ’12
Key Issues in Machine Learning
Modeling
How to formulate application problems as machine
learning problems ? How to represent the data?
Learning Protocols (where is the data & labels coming
from?)
Representation:
What are good hypothesis spaces ?
Any rigorous way to find these? Any general approach?
Algorithms:
What are good algorithms?
(The Bio Exam
ple
)
How do we define success?
Generalization Vs. over fitting
The computational problem
15
INTRODUCTION
CS446 Fall ’12
Example: Generalization
vs
Overfitting
What is a Tree ?
A botanist
Her brother
A tree is something with
A tree is a
green
thing
leaves I’ve seen before
Neither will generalize well
16
INTRODUCTION
CS446 Fall ’12
A
nnouncements
Class Registration:
Almost all the waiting list is in
My office hours:
Tuesday 10:45
-
11:30
Thursday 1
-
1:45 PM
E
-
mail; Piazza; Follow the Web
site
Homework
Hw1 will be made available today
Start today
17
INTRODUCTION
CS446 Fall ’12
Key Issues in Machine Learning
Modeling
How to formulate application problems as machine
learning problems ? How to represent the data?
Learning Protocols (where is the data & labels coming
from?)
Representation
What are good hypothesis spaces ?
Any rigorous way to find these? Any general approach?
Algorithms
What are good algorithms?
How do we define success?
Generalization Vs. over fitting
The computational problem
18
INTRODUCTION
CS446 Fall ’12
An Example
I don’t know {
whether,
weather
}
to laugh or cry
How can we make this a learning problem?
We will look for a function
F: Sentences
{
whether,
weather
}
We need to define the domain of this function better.
An option
: For each word
w
in English define a
Boolean
feature
x
w
:
[
x
w
=1]
iff
w is in the sentence
This maps a sentence to a point in {0,1}
50,000
In this space: some points are
whether
points
some are
weather
points
Learning Protocol?
Supervised? Unsupervised?
This is the
Modeling Step
19
INTRODUCTION
CS446 Fall ’12
Representation Step: What’s
Good?
Learning problem:
Find a function that
best
separates the data
What function?
What’s best
?
(
How to find it
?)
A possibility: Define the learning problem to be:
Find a (linear) function that best separates the data
Linear = linear in the
feature space
x
= data representation;
w
= the classifier
y
=
sgn
{
w
T
x
}
20
•
Memorizing vs. Learning
•
How well will you do?
•
Doing well on what?
INTRODUCTION
CS446 Fall ’12
Expressivity
f(x) =
sgn
{x
¢
w
-
} =
sgn
{
i=1
n
w
i
x
i
-
}
Many functions are Linear
Conjunctions:
y = x
1
Æ
x
3
Æ
x
5
y =
sgn
{1
¢
x
1
+ 1
¢
x
3
+ 1
¢
x
5
-
3
}; w = (1, 0, 1, 0, 1)
=3
At least m of n:
y = at least 2 of {
x
1
,x
3
,
x
5
}
y =
sgn
{1
¢
x
1
+ 1
¢
x
3
+ 1
¢
x
5
-
2}
};
w
= (1, 0, 1, 0, 1)
=2
Many functions are not
Xor
:
y = x
1
Æ
x
2
Ç
:
x
1
Æ
:
x
2
Non trivial DNF:
y = x
1
Æ
x
2
Ç
x
3
Æ
x
4
But can be made linear
Probabilistic Classifiers as well
21
INTRODUCTION
CS446 Fall ’12
Exclusive
-
OR (XOR)
(x
1
Æ
x
2)
Ç
(
:
{x
1
}
Æ
:
{x
2
})
In general: a parity function.
x
i
2
{0,1}
f(x
1
, x
2
,…,
x
n
) = 1
iff
x
i
is even
This function is not
linearly separable
.
x
1
x
2
22
INTRODUCTION
CS446 Fall ’12
Functions Can be Made Linear
Data are not separable in one dimension
Not separable if you insist on using a specific class of
functions
x
23
INTRODUCTION
CS446 Fall ’12
Blown Up Feature Space
Data are separable in <x, x
2
> space
x
x
2
•
Key issue:
Representation
what
features to use.
•
Computationally, can be
done implicitly (kernels)
But there are warnings.
24
INTRODUCTION
CS446 Fall ’12
Functions Can be Made Linear
Weather
Whether
y
3
Ç
y
4
Ç
y
7
New discriminator is
functionally simpler
A real Weather/Whether
example
25
x
1
x
2
x
4
Ç
x
2
x
4
x
5
Ç
x
1
x
3
x
7
Space: X= x
1
, x
2
,…,
x
n
Input
Transformation
New
Space: Y = {y
1
,y
2
,…} = {
x
i
,x
i
x
j
, x
i
x
j
x
j
}
INTRODUCTION
CS446 Fall ’12
Third Step: How to Learn?
A possibility: Local
search
Start with a linear threshold function.
See how well you are doing.
Correct
Repeat until you converge.
There are other
ways that
do
not
search
directly
in
the
hypotheses
space
Directly compute the
hypothesis
26
INTRODUCTION
CS446 Fall ’12
A General Framework for
Learning
Goal: predict an unobserved output value y
2
Y
based on an observed input vector x
2
X
Estimate a functional relationship
y~f
(x)
from a set {(
x,y
)
i
}
i=1,n
Most relevant
-
Classification
: y
{0,1} (or
y
{1,2,…k} )
(But, within the same framework can also talk about
Regression, y
2
<
)
What
do we want f(x) to satisfy?
We want to minimize the Loss (Risk):
L(f()) = E
X,Y
( [f(x)
y] )
Where:
E
X,Y
denotes the expectation with respect to the true
distribution
.
Simply: # of mistakes
[…] is a indicator function
27
INTRODUCTION
CS446 Fall ’12
A General Framework for
Learning (II)
We want to minimize the Loss:
L(f()) = E
X,Y
( [f(X)
Y
] )
Where: E
X,Y
denotes the expectation with respect to the true
distribution
.
We cannot do that.
Instead, we
try
to minimize the
empirical
classification error.
For a set of training examples {(
X
i
,
Y
i
)}
i=1,n
Try to minimize: L’(f()) = 1/n
i
[
f(X
i
)
Y
i
]
(Issue
I
: why/when is this good enough? Not now
)
This minimization problem is typically NP hard.
To alleviate this computational problem, minimize a new function
–
a
convex upper bound of the classification error function
I
(f(x),y) =[f(x)
y]
= {1 when f(x)
y; 0 otherwise}
Side note: If the distribution over X
£
Y is known,
predict:
y =
argmax
y
P(
y|x
)
This produces the optimal Bayes' error.
28
INTRODUCTION
CS446 Fall ’12
Algorithmic View of Learning: an
Optimization Problem
A Loss Function L(f(x),y) measures the penalty
incurred by a classifier f on example (
x,y
).
There are many different loss functions one could
define:
Misclassification Error:
L(f(x),y) = 0 if f(x) = y; 1 otherwise
Squared Loss:
L(f(x),y) = (f(x)
–
y)
2
Input dependent loss:
L(f(x),y) = 0 if f(x)= y; c(x)otherwise.
A continuous convex
loss
function allows
a
simpler
optimization algorithm.
f(x)
–
y
L
29
INTRODUCTION
CS446 Fall ’12
Loss
Here
f(x)
is the
prediction
2
<
y
2
{
-
1,1}
is the correct value
0
-
1 Loss
L(
y,f
(x
))= ½ (1
-
sgn(
yf
(x)))
Log Loss
1/ln2
log (1+exp{
-
yf
(x)})
Hinge Loss
L(y, f(x)) = max(0, 1
-
y f(x))
Square Loss
L(y, f(x)) = (
y
-
f(x))
2
0
-
1 Loss
x axis =
yf
(x)
Log Loss = x axis =
yf
(x)
Hinge Loss: x axis =
yf
(x)
Square Loss: x axis
= (y
-
f(x
)+1)
30
INTRODUCTION
CS446 Fall ’12
Example
Putting it all together:
A Learning Algorithm
INTRODUCTION
CS446 Fall ’12
Third Step: How to Learn?
A possibility: Local
search
Start with a linear threshold function.
See how well you are doing.
Correct
Repeat until you converge.
There are other
ways that
do
not
search
directly
in
the
hypotheses
space
Directly compute the
hypothesis
32
INTRODUCTION
CS446 Fall ’12
Learning Linear Separators (LTU)
f(x) =
sgn
{
x
T
¢
w
-
} =
sgn
{
i=1
n
w
i
x
i
-
}
x
T
= (
x
1
,x
2
,… ,
x
n
)
2
{0,1}
n
is the feature based
encoding of the data point
w
T
= (
w
1
,w
2
,… ,
w
n
)
2
<
n
is the target function.
determines the
shift
with
respect
to the origin
w
33
INTRODUCTION
CS446 Fall ’12
Canonical Representation
f(x) =
sgn
{
w
T
¢
x
-
} =
sgn
{
i=1
n
w
i
x
i
-
}
sgn
{
w
T
¢
x
-
}
´
sgn
{(w’)
T
¢
x’}
Where:
x’ = (x,
-
1
) and w’ = (w,
)
Moved from an
n
dimensional representation to an
(n+1)
dimensional representation, but now can look
for
hyperplanes
that go through the origin.
34
INTRODUCTION
CS446 Fall ’12
LMS: An Optimization Algorithm
A local search learning algorithm requires:
Hypothesis Space:
Linear Threshold Units
Loss function:
Squared loss
LMS (Least Mean Square, L
2
)
Search procedure:
Gradient Descent
w
A real Weather/Whether example
35
INTRODUCTION
CS446 Fall ’12
LMS: An Optimization Algorithm
(i (subscript)
–
vector component; j (superscript)
-
time; d
–
example
#)
Let
w
(j)
be the current weight vector we
have
Our
prediction on the
d
-
th
example
x
is:
Let
t
d
be the target value for this example (
real value; represents u
¢
x
)
The
error
the current hypothesis makes on the data set is:
Assumption:
x
2
R
n
;
u
2
R
n
is the target weight vector;
the target (label) is
t
d
= u
¢
x
Noise has been added; so,
possibly, no weight vector is consistent with the data.
36
INTRODUCTION
CS446 Fall ’12
Gradient
Descent
We
use gradient descent to determine the weight vector that
minimizes
Err (w)
;
Fixing
the set D of examples, E is a function of
w
j
At
each step, the weight vector is modified in the direction that
produces the steepest descent along the error surface
.
E(w)
w
w
4
w
3
w
2
w
1
37
INTRODUCTION
CS446 Fall ’12
Gradient Descent
To find the best direction in the
weight space
we
compute the gradient of E with respect to each of the
components of
This vector specifies the direction that produces the
steepest increase in E;
We want to modify in the direction of
Where:
38
INTRODUCTION
CS446 Fall ’12
Gradient Descent: LMS
We have:
Therefore:
39
INTRODUCTION
CS446 Fall ’12
Gradient Descent: LMS
Weight
u
pdate rule:
40
INTRODUCTION
CS446 Fall ’12
Gradient Descent: LMS
Weight update rule:
Gradient descent algorithm for training linear units:
Start with an initial random weight vector
For every example d with target value
t
d
do:
Evaluate the linear unit
Update by adding to each component
Continue until E below some threshold
Because the surface contains only a single global minimum, the algorithm will
converge to a weight vector with minimum error, regardless of whether the
examples are linearly separable. (This is true for the case of LMS for linear
regression; the surface may have local minimum if the loss function is different or
when the regression isn’t linear.)
A real Weather/Whether example
41
INTRODUCTION
CS446 Fall ’12
Algorithm II:
Incremental
(Stochastic) Gradient Descent
Weight update rule:
42
INTRODUCTION
CS446 Fall ’12
Weight update rule:
Gradient
descent algorithm for training linear
units:
Start
with an initial random weight
vector
For
every example d with target
value
t
d
do:
Evaluate the linear
unit
update
by
incrementally
adding
to each component
(update without summing over all
data)
Continue
until E below some threshold
Incremental (Stochastic)
Gradient Descent: LMS
43
INTRODUCTION
CS446 Fall ’12
Incremental (Stochastic)
Gradient Descent: LMS
44
Weight update rule:
Gradient
descent algorithm for training linear
units:
Start
with an initial random weight
vector
For
every example d with target
value:
Evaluate the linear
unit
update
by
incrementally
adding
to each component
(update without summing over all
data)
Continue
until E below some threshold
In general
-
does not converge to global minimum
Decreasing R with time guarantees convergence
But,
on
-
line
algorithms
are
sometimes advantageous…
INTRODUCTION
CS446 Fall ’12
In the general (non
-
separable) case the learning rate
R must decrease to zero to guarantee convergence.
The learning rate is called the
step size.
There are
more sophisticated algorithms (Conjugate Gradient)
that choose the step size automatically and converge
faster.
There is only one “basin” for linear threshold unites,
so a local minimum is the global minimum. However,
choosing a starting point can make the algorithm
converge much faster.
Learning Rates and Convergence
45
INTRODUCTION
CS446 Fall ’12
Computational Issues
Assume the data is linearly separable.
Sample complexity:
Suppose we want to ensure that our LTU has an error rate
(
on new
examples) of less than
with high
probability (
at least (
1
-
))
How large does
m
(the number of examples) must be in order to
achieve
this ? It can be shown that for
n
dimensional problems
m = O(1/
[
ln
(1/
) + (n+1)
ln
(1/
)
].
Computational
complexity:
What
can be said?
It can be shown that there exists a polynomial time algorithm for
finding
consistent LTU (by reduction from linear programming).
[Contrast with the NP hardness for 0
-
1 loss optimization]
(On
-
line algorithms have inverse quadratic dependence on the margin)
46
INTRODUCTION
CS446 Fall ’12
Other Methods for LTUs
Fisher Linear Discriminant:
A direct computation method
Probabilistic methods (naïve Bayes):
Produces a stochastic classifier that can be viewed as a
linear threshold unit.
Winnow/Perceptron
A multiplicative/additive update algorithm with some
sparsity
properties in the function space (a large number of
irrelevant attributes) or features space (sparse examples)
Logistic Regression, SVM…many other algorithms
47
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο