# CS 446: Machine Learning

Τεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 9 μήνες)

96 εμφανίσεις

INTRODUCTION

CS446 Fall ’12

CS
446:
Machine
Learning

Dan Roth

University of Illinois, Urbana
-
Champaign

danr@illinois.edu

http://L2R.cs.uiuc.edu/~danr

3322 SC

INTRODUCTION

CS446 Fall ’12

A
nnouncements

Class Registration: Still closed; stay tuned.

My office hours:

Tuesday, Thursday
10:45
-
11:30

E
-
site

Homework

No need to submit Hw0
;

Later
:
submit
electronically.

2

INTRODUCTION

CS446 Fall ’12

A
nnouncements

Sections:

RM 3405

Monday

at 5:00

[A
-
F] (not on Sept. 3
rd
)

RM 3401

Tuesday

at
5:00 [G
-
L]

RM 3405

Wednesday

at
5:30 [M
-
S]

RM 3403

Thursday
at
5:00 [T
-
Z]

Next week, in class:

Hands
-
on classification

install Java;

3

INTRODUCTION

CS446 Fall ’12

Course Overview

Introduction:
Basic problems and questions

A detailed example:
Linear threshold
units

Hands
-
on classification

PAC (Risk Minimization); Bayesian Theory

Learning Protocols

Online/Batch;
Supervised/Unsupervised/Semi
-
supervised

Algorithms:

Decision Trees (C4.5)

[Rules and ILP (Ripper, Foil)]

Linear Threshold Units (Winnow, Perceptron; Boosting; SVMs;
Kernels)

Probabilistic Representations (naïve Bayes, Bayesian trees; density
estimation)

Unsupervised/Semi
-
supervised: EM

Clustering, Dimensionality Reduction

4

INTRODUCTION

CS446 Fall ’12

What is Learning

……

This is an example of the key learning protocol:

supervised
learning

Prediction or Modeling?

Representation

Problem setting

Background Knowledge

When did learning take place?

Algorithm

Are you sure you got it right?

5

INTRODUCTION

CS446 Fall ’12

Supervised
Learning

Given:

Examples
(
x,f
(x
))

of some unknown function

f

Find:

A good approximation of
f

x

provides some representation of the input

The process of mapping a domain element into a
representation is called
Feature Extraction. (Hard; ill
-
understood; important)

x

2

{0,1}
n

or x

2

<
n

The target function (label)

f(x)

2

{
-
1,+1}

Binary Classification

f(x)

2

{1,2,3,.,k
-
1}

Multi
-
class classification

f(x)

2

<

Regression

6

INTRODUCTION

CS446 Fall ’12

Supervised Learning : Examples

Disease diagnosis

x: Properties of patient (symptoms, lab tests)

f : Disease (or maybe: recommended therapy)

Part
-
of
-
Speech tagging

x: An English sentence (e.g., The
can

will rust)

f : The part of speech of a word in the sentence

Face recognition

x: Bitmap picture of person’s face

f : Name the person (or maybe: a property of)

Automatic Steering

x: Bitmap picture of road surface in front of car

f : Degrees to turn the steering wheel

Many problems
that do
not
seem
like classification
problems can be
decomposed to
classification
problems.
E.g
,
Semantic Role
Labeling

7

INTRODUCTION

CS446 Fall ’12

A Learning Problem

y =

f
(x
1
, x
2
, x
3
, x
4
)

Unknown

function

x
1

x
2

x
3

x
4

Example

x
1

x
2

x
3

x
4
y

1

0 0 1 0 0

3

0 0 1 1 1

4 1 0 0 1 1

5

0 1 1 0 0

6

1 1 0 0 0

7

0 1 0 1 0

2

0 1 0 0 0

Can you learn this
function?

What is it?

8

INTRODUCTION

CS446 Fall ’12

Hypothesis Space

Complete Ignorance:

There are 2
16

= 65536 possible functions

over four input features.

We can’t figure out which one is

correct until we’ve seen every

possible input
-
output pair.

After seven examples we still

have 2
9

possibilities for

f

Is Learning Possible?

Example

x
1

x
2

x
3

x
4
y

1 1 1 1 ?

0 0 0 0 ?

1 0 0 0 ?

1 0 1 1 ?

1 1 0 0 0

1 1 0 1 ?

1 0 1 0 ?

1 0 0 1 1

0 1 0 0 0

0 1 0 1 0

0 1 1 0 0

0 1 1 1 ?

0 0 1 1 1

0 0 1 0 0

0 0 0 1 ?

1 1 1 0 ?

9

INTRODUCTION

CS446 Fall ’12

Hypothesis Space (2)

Simple Rules:
There are only 16 simple

conjunctive rules of the form
y=x
i

Æ

x
j

Æ

x
k

No simple rule explains the data. The same is true for
simple clauses
.

1

0 0 1 0
0

3

0 0 1 1

1

4 1 0 0 1
1

5

0 1 1 0
0

6

1 1 0 0
0

7

0 1 0 1
0

2

0 1 0 0
0

y
=c

x
1

1100 0

x
2

0100 0

x
3

0110 0

x
4

0101 1

x
1

x
2

1100 0

x
1

=
x
3

0011 1

x
1

x
4

0011 1

Rule Counterexample

x
2

x
3

0011 1

x
2

=
=
x
4

0011 1

x
3

=
=
x
4

1001 1

x
1

=
=
x
2

=
x
3

0011 1

x
1

x
2

=
x
4

0011 1

x
1

x
3

=
x
4

0011 1

x
2

x
3

=
x
4

0011 1

x
1

x
2

x
3

x
4

0011 1

Rule Counterexample

10

INTRODUCTION

CS446 Fall ’12

Hypothesis Space (3)

m
-
of
-
n rules:

There are 20 possible rules

of the form
”y = 1
if and only if at least

m

of the following

n
variables are

1”

Found a consistent hypothesis
.

1

0 0 1 0
0

3

0 0 1 1

1

4 1 0 0 1
1

5

0 1 1 0
0

6

1 1 0 0
0

7

0 1 0 1
0

2

0 1 0 0
0

x
1

==††=†††=††=††=†=
3
-

-

-

x
2

==††=††††=††=††==
2
-

-

-

x
3

==††=†††=††=††=†=
1
-

-

-

x
4

==††=†††=††=††=†=
7
-

-

-

x
1,
x
2

††=††=††=
=

2 3
-

-

x
1,
x
3

=

1 3
-

-

x
1,
x
4

=

6 3
-

-

x
2,
x
3

††=††=††=††=

2 3
-

-

variables
1
-
of
2
-
of
3
-
of
4
-
of

x
2,

x
4

=

2 3
-

-

x
3,
x
4

=

4 4
-

-

x
1,
x
2,
x
3

=

1 3 3
-

x
1,
x
2,
x
4

=

2 3 3
-

x
1,
x
3,
x
4

=

1

=

=

3
-

x
2,
x
3,
x
4

=

1 5 3
-

x
1,
x
2,
x
3,
x
4

=

1 5 3 3

variables
1
-
of
2
-
of
3
-
of
4
-
of

11

INTRODUCTION

CS446 Fall ’12

Views of Learning

Learning is the removal of our
remaining

uncertainty:

Suppose we
knew

that the unknown function was an m
-
of
-
n
Boolean function, then we could use the training data to
infer which function it is.

Learning requires guessing a good, small hypothesis
class
:

We can start with a very small class and enlarge it until it
contains an hypothesis that fits the data.

We could be wrong !

Our prior knowledge might be wrong:
y=x4

one
-
of (x1,
x3) is also consistent

Our guess of the hypothesis class could be wrong

If
this

is the unknown function, then we will make errors when
we are given new examples, and are asked to predict the value
of the function

12

INTRODUCTION

CS446 Fall ’12

General strategies for Machine
Learning

Develop representation languages for expressing
concepts

Serve to limit the expressivity of the target models

E.g., Functional representation (n
-
of
-
m); Grammars;
stochastic models;

Develop flexible hypothesis spaces:

Nested collections of hypotheses. Decision trees, neural
networks

Hypothesis spaces of flexible size

In either case:

Develop algorithms for finding a hypothesis
in our
hypothesis space
, that
fits

the data

And
hope

that they will generalize well

13

INTRODUCTION

CS446 Fall ’12

Terminology

Target function (concept):
The true function f :X

{…Labels…}

Concept:
Boolean function. Example for which f (x)= 1 are
positive

examples; those for which f (x)= 0 are
negative

examples (instances)

Hypothesis:
A proposed function h, believed to be similar to f.
The output of our learning algorithm.

Hypothesis space:
The space of all hypotheses that can, in
principle, be output by the learning algorithm.

Classifier:
A discrete valued function produced by the learning
algorithm. The possible value of f: {1,2,…K} are the classes or
class labels
. (In most algorithms the classifier will actually
return a real valued function that we’ll have to interpret
).

Training examples:
A set of examples of the form {(x, f (x))}

14

INTRODUCTION

CS446 Fall ’12

Key Issues in Machine Learning

Modeling

How to formulate application problems as machine
learning problems ? How to represent the data?

Learning Protocols (where is the data & labels coming
from?)

Representation:

What are good hypothesis spaces ?

Any rigorous way to find these? Any general approach?

Algorithms:

What are good algorithms?

(The Bio Exam
ple
)

How do we define success?

Generalization Vs. over fitting

The computational problem

15

INTRODUCTION

CS446 Fall ’12

Example: Generalization
vs

Overfitting

What is a Tree ?

A botanist

Her brother

A tree is something with

A tree is a
green

thing

leaves I’ve seen before

Neither will generalize well

16

INTRODUCTION

CS446 Fall ’12

A
nnouncements

Class Registration:
Almost all the waiting list is in

My office hours:

Tuesday 10:45
-
11:30

Thursday 1
-
1:45 PM

E
-
site

Homework

Hw1 will be made available today

Start today

17

INTRODUCTION

CS446 Fall ’12

Key Issues in Machine Learning

Modeling

How to formulate application problems as machine
learning problems ? How to represent the data?

Learning Protocols (where is the data & labels coming
from?)

Representation

What are good hypothesis spaces ?

Any rigorous way to find these? Any general approach?

Algorithms

What are good algorithms?

How do we define success?

Generalization Vs. over fitting

The computational problem

18

INTRODUCTION

CS446 Fall ’12

An Example

I don’t know {
whether,

weather
}

to laugh or cry

How can we make this a learning problem?

We will look for a function

F: Sentences

{
whether,

weather
}

We need to define the domain of this function better.

An option
: For each word

w

in English define a
Boolean

feature
x
w

:

[
x
w

=1]
iff

w is in the sentence

This maps a sentence to a point in {0,1}
50,000

In this space: some points are
whether

points

some are
weather

points

Learning Protocol?

Supervised? Unsupervised?

This is the
Modeling Step

19

INTRODUCTION

CS446 Fall ’12

Representation Step: What’s
Good?

Learning problem:

Find a function that

best

separates the data

What function?

What’s best
?

(
How to find it
?)

A possibility: Define the learning problem to be:

Find a (linear) function that best separates the data

Linear = linear in the
feature space

x
= data representation;
w
= the classifier

y

=
sgn

{
w
T
x
}

20

Memorizing vs. Learning

How well will you do?

Doing well on what?

INTRODUCTION

CS446 Fall ’12

Expressivity

f(x) =
sgn

{x
¢

w
-

} =
sgn
{

i=1
n

w
i

x
i
-

}

Many functions are Linear

Conjunctions:

y = x
1

Æ

x
3

Æ

x
5

y =
sgn
{1
¢

x
1

+ 1
¢

x
3

+ 1
¢

x
5

-

3
}; w = (1, 0, 1, 0, 1)

=3

At least m of n:

y = at least 2 of {
x
1

,x
3
,

x
5

}

y =
sgn
{1
¢

x
1

+ 1
¢

x
3

+ 1
¢

x
5

-

2}
};
w
= (1, 0, 1, 0, 1)

=2

Many functions are not

Xor
:
y = x
1

Æ

x
2
Ç

:
x
1

Æ

:
x
2

Non trivial DNF:
y = x
1

Æ

x
2
Ç

x
3

Æ

x
4

Probabilistic Classifiers as well

21

INTRODUCTION

CS446 Fall ’12

Exclusive
-
OR (XOR)

(x
1

Æ

x
2)

Ç

(
:
{x
1
}
Æ

:
{x
2
})

In general: a parity function.

x
i

2

{0,1}

f(x
1
, x
2
,…,
x
n
) = 1

iff

x
i

is even

This function is not

linearly separable
.

x
1

x
2

22

INTRODUCTION

CS446 Fall ’12

Data are not separable in one dimension

Not separable if you insist on using a specific class of
functions

x

23

INTRODUCTION

CS446 Fall ’12

Blown Up Feature Space

Data are separable in <x, x
2
> space

x

x
2

Key issue:
Representation
what
features to use.

Computationally, can be
done implicitly (kernels)

But there are warnings.

24

INTRODUCTION

CS446 Fall ’12

Weather

Whether

y
3

Ç

y
4

Ç

y
7

New discriminator is
functionally simpler

A real Weather/Whether
example

25

x
1

x
2

x
4

Ç

x
2

x
4
x
5

Ç

x
1

x
3

x
7

Space: X= x
1
, x
2
,…,
x
n

Input
Transformation

New
Space: Y = {y
1
,y
2
,…} = {
x
i
,x
i

x
j
, x
i

x
j

x
j
}

INTRODUCTION

CS446 Fall ’12

Third Step: How to Learn?

A possibility: Local
search

See how well you are doing.

Correct

Repeat until you converge.

There are other
ways that

do
not
search
directly
in

the
hypotheses
space

Directly compute the

hypothesis

26

INTRODUCTION

CS446 Fall ’12

A General Framework for
Learning

Goal: predict an unobserved output value y
2

Y

based on an observed input vector x
2

X

Estimate a functional relationship
y~f
(x)

from a set {(
x,y
)
i
}
i=1,n

Most relevant
-

Classification
: y

{0,1} (or
y

{1,2,…k} )

(But, within the same framework can also talk about
Regression, y
2

<

)

What
do we want f(x) to satisfy?

We want to minimize the Loss (Risk):

L(f()) = E
X,Y
( [f(x)

y] )

Where:

E
X,Y
denotes the expectation with respect to the true
distribution
.

Simply: # of mistakes

[…] is a indicator function

27

INTRODUCTION

CS446 Fall ’12

A General Framework for
Learning (II)

We want to minimize the Loss:

L(f()) = E
X,Y
( [f(X)

Y
] )

Where: E
X,Y
denotes the expectation with respect to the true
distribution
.

We cannot do that.

try

to minimize the
empirical

classification error.

For a set of training examples {(
X
i
,
Y
i
)}
i=1,n

Try to minimize: L’(f()) = 1/n

i

[
f(X
i
)

Y
i
]

(Issue
I
: why/when is this good enough? Not now
)

This minimization problem is typically NP hard.

To alleviate this computational problem, minimize a new function

a
convex upper bound of the classification error function

I
(f(x),y) =[f(x)

y]

= {1 when f(x)

y; 0 otherwise}

Side note: If the distribution over X
£
Y is known,
predict:
y =
argmax
y

P(
y|x
)

This produces the optimal Bayes' error.

28

INTRODUCTION

CS446 Fall ’12

Algorithmic View of Learning: an
Optimization Problem

A Loss Function L(f(x),y) measures the penalty
incurred by a classifier f on example (
x,y
).

There are many different loss functions one could
define:

Misclassification Error:

L(f(x),y) = 0 if f(x) = y; 1 otherwise

Squared Loss:

L(f(x),y) = (f(x)

y)
2

Input dependent loss:

L(f(x),y) = 0 if f(x)= y; c(x)otherwise.

A continuous convex

loss
function allows

a
simpler
optimization algorithm.

f(x)

y

L

29

INTRODUCTION

CS446 Fall ’12

Loss

Here
f(x)

is the
prediction
2

<

y
2

{
-
1,1}

is the correct value

0
-
1 Loss
L(
y,f
(x
))= ½ (1
-
sgn(
yf
(x)))

Log Loss
1/ln2

log (1+exp{
-
yf
(x)})

Hinge Loss
L(y, f(x)) = max(0, 1
-

y f(x))

Square Loss
L(y, f(x)) = (
y
-

f(x))
2

0
-
1 Loss
x axis =
yf
(x)

Log Loss = x axis =
yf
(x)

Hinge Loss: x axis =
yf
(x)

Square Loss: x axis

= (y
-

f(x
)+1)

30

INTRODUCTION

CS446 Fall ’12

Example

Putting it all together:

A Learning Algorithm

INTRODUCTION

CS446 Fall ’12

Third Step: How to Learn?

A possibility: Local
search

See how well you are doing.

Correct

Repeat until you converge.

There are other
ways that

do
not
search
directly
in

the
hypotheses
space

Directly compute the

hypothesis

32

INTRODUCTION

CS446 Fall ’12

Learning Linear Separators (LTU)

f(x) =
sgn

{
x
T

¢

w
-

} =
sgn
{

i=1
n

w
i

x
i
-

}

x
T
= (
x
1

,x
2
,… ,
x
n
)
2

{0,1}
n

is the feature based

encoding of the data point

w
T
= (
w
1

,w
2
,… ,
w
n
)
2

<
n

is the target function.

determines the
shift

with

respect
to the origin

w

33

INTRODUCTION

CS446 Fall ’12

Canonical Representation

f(x) =
sgn

{
w
T

¢

x
-

} =
sgn
{

i=1
n

w
i

x
i
-

}

sgn

{
w
T

¢

x
-

}
´

sgn

{(w’)
T

¢

x’}

Where:

x’ = (x,
-
1
) and w’ = (w,

)

Moved from an
n

dimensional representation to an
(n+1)

dimensional representation, but now can look
for
hyperplanes

that go through the origin.

34

INTRODUCTION

CS446 Fall ’12

LMS: An Optimization Algorithm

A local search learning algorithm requires:

Hypothesis Space:

Linear Threshold Units

Loss function:

Squared loss

LMS (Least Mean Square, L
2
)

Search procedure:

w

A real Weather/Whether example

35

INTRODUCTION

CS446 Fall ’12

LMS: An Optimization Algorithm

(i (subscript)

vector component; j (superscript)
-

time; d

example
#)

Let
w
(j)

be the current weight vector we
have

Our
prediction on the
d
-
th

example
x

is:

Let
t
d

be the target value for this example (
real value; represents u
¢

x
)

The
error
the current hypothesis makes on the data set is:

Assumption:
x
2

R
n
;
u
2

R
n

is the target weight vector;
the target (label) is
t
d

= u
¢

x

possibly, no weight vector is consistent with the data.

36

INTRODUCTION

CS446 Fall ’12

Descent

We
use gradient descent to determine the weight vector that
minimizes
Err (w)

;

Fixing
the set D of examples, E is a function of
w
j

At
each step, the weight vector is modified in the direction that
produces the steepest descent along the error surface
.

E(w)

w

w
4

w
3

w
2

w
1

37

INTRODUCTION

CS446 Fall ’12

To find the best direction in the
weight space

we
compute the gradient of E with respect to each of the
components of

This vector specifies the direction that produces the
steepest increase in E;

We want to modify in the direction of

Where:

38

INTRODUCTION

CS446 Fall ’12

We have:

Therefore:

39

INTRODUCTION

CS446 Fall ’12

Weight
u
pdate rule:

40

INTRODUCTION

CS446 Fall ’12

Weight update rule:

Gradient descent algorithm for training linear units:

For every example d with target value
t
d
do:

Evaluate the linear unit

Update by adding to each component

Continue until E below some threshold

Because the surface contains only a single global minimum, the algorithm will
converge to a weight vector with minimum error, regardless of whether the
examples are linearly separable. (This is true for the case of LMS for linear
regression; the surface may have local minimum if the loss function is different or
when the regression isn’t linear.)

A real Weather/Whether example

41

INTRODUCTION

CS446 Fall ’12

Algorithm II:
Incremental

Weight update rule:

42

INTRODUCTION

CS446 Fall ’12

Weight update rule:

descent algorithm for training linear
units:

Start
with an initial random weight
vector

For
every example d with target
value
t
d
do:

Evaluate the linear
unit

update
by
incrementally

to each component
(update without summing over all
data)

Continue
until E below some threshold

Incremental (Stochastic)

43

INTRODUCTION

CS446 Fall ’12

Incremental (Stochastic)

44

Weight update rule:

descent algorithm for training linear
units:

Start
with an initial random weight
vector

For
every example d with target
value:

Evaluate the linear
unit

update
by
incrementally

to each component
(update without summing over all
data)

Continue
until E below some threshold

In general
-

does not converge to global minimum

Decreasing R with time guarantees convergence

But,
on
-
line
algorithms
are

INTRODUCTION

CS446 Fall ’12

In the general (non
-
separable) case the learning rate
R must decrease to zero to guarantee convergence.

The learning rate is called the
step size.
There are
that choose the step size automatically and converge
faster.

There is only one “basin” for linear threshold unites,
so a local minimum is the global minimum. However,
choosing a starting point can make the algorithm
converge much faster.

Learning Rates and Convergence

45

INTRODUCTION

CS446 Fall ’12

Computational Issues

Assume the data is linearly separable.

Sample complexity:

Suppose we want to ensure that our LTU has an error rate
(
on new
examples) of less than

with high
probability (
at least (
1
-

))

How large does
m

(the number of examples) must be in order to
achieve
this ? It can be shown that for
n

dimensional problems

m = O(1/

[
ln
(1/

) + (n+1)
ln
(1/

)
].

Computational
complexity:
What
can be said?

It can be shown that there exists a polynomial time algorithm for
finding
consistent LTU (by reduction from linear programming).

[Contrast with the NP hardness for 0
-
1 loss optimization]

(On
-
line algorithms have inverse quadratic dependence on the margin)

46

INTRODUCTION

CS446 Fall ’12

Other Methods for LTUs

Fisher Linear Discriminant:

A direct computation method

Probabilistic methods (naïve Bayes):

Produces a stochastic classifier that can be viewed as a
linear threshold unit.

Winnow/Perceptron

A multiplicative/additive update algorithm with some
sparsity

properties in the function space (a large number of
irrelevant attributes) or features space (sparse examples)

Logistic Regression, SVM…many other algorithms

47