CS 446: Machine Learning

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

82 εμφανίσεις

INTRODUCTION



CS446 Fall ’12






CS
446:
Machine
Learning


Dan Roth

University of Illinois, Urbana
-
Champaign


danr@illinois.edu

http://L2R.cs.uiuc.edu/~danr

3322 SC


INTRODUCTION



CS446 Fall ’12





A
nnouncements

Class Registration: Still closed; stay tuned.


My office hours:


Tuesday, Thursday
10:45
-
11:30


E
-
mail; Piazza; Follow the Web
site

Homework


No need to submit Hw0
;


Later
:
submit
electronically.


2

INTRODUCTION



CS446 Fall ’12





A
nnouncements

Sections:



RM 3405


Monday

at 5:00


[A
-
F] (not on Sept. 3
rd
)


RM 3401


Tuesday

at
5:00 [G
-
L]


RM 3405


Wednesday

at
5:30 [M
-
S]


RM 3403


Thursday
at
5:00 [T
-
Z]


Next week, in class:


Hands
-
on classification
. Follow the web site

install Java;
bring your laptop if possible


3

INTRODUCTION



CS446 Fall ’12





Course Overview


Introduction:
Basic problems and questions



A detailed example:
Linear threshold
units

Hands
-
on classification


Two Basic Paradigms:


PAC (Risk Minimization); Bayesian Theory

Learning Protocols


Online/Batch;
Supervised/Unsupervised/Semi
-
supervised


Algorithms:


Decision Trees (C4.5)


[Rules and ILP (Ripper, Foil)]


Linear Threshold Units (Winnow, Perceptron; Boosting; SVMs;
Kernels)


Probabilistic Representations (naïve Bayes, Bayesian trees; density
estimation)


Unsupervised/Semi
-
supervised: EM


Clustering, Dimensionality Reduction

4

INTRODUCTION



CS446 Fall ’12





What is Learning

The Badges Game
……


This is an example of the key learning protocol:

supervised
learning



Prediction or Modeling?


Representation


Problem setting


Background Knowledge


When did learning take place?


Algorithm


Are you sure you got it right?

5

INTRODUCTION



CS446 Fall ’12





Supervised
Learning


Given:

Examples
(
x,f
(x
))

of some unknown function

f


Find:

A good approximation of
f



x

provides some representation of the input


The process of mapping a domain element into a
representation is called
Feature Extraction. (Hard; ill
-
understood; important)


x

2

{0,1}
n

or x

2

<
n


The target function (label)


f(x)

2

{
-
1,+1}


Binary Classification


f(x)

2

{1,2,3,.,k
-
1}

Multi
-
class classification


f(x)

2

<



Regression

6

INTRODUCTION



CS446 Fall ’12





Supervised Learning : Examples


Disease diagnosis



x: Properties of patient (symptoms, lab tests)



f : Disease (or maybe: recommended therapy)

Part
-
of
-
Speech tagging



x: An English sentence (e.g., The
can

will rust)



f : The part of speech of a word in the sentence

Face recognition



x: Bitmap picture of person’s face



f : Name the person (or maybe: a property of)

Automatic Steering



x: Bitmap picture of road surface in front of car



f : Degrees to turn the steering wheel

Many problems
that do
not
seem
like classification
problems can be
decomposed to
classification
problems.
E.g
,
Semantic Role
Labeling

7

INTRODUCTION



CS446 Fall ’12





A Learning Problem


y =

f
(x
1
, x
2
, x
3
, x
4
)

Unknown

function

x
1

x
2

x
3

x
4



Example


x
1

x
2

x
3

x
4
y


1

0 0 1 0 0


3

0 0 1 1 1


4 1 0 0 1 1


5

0 1 1 0 0


6

1 1 0 0 0


7


0 1 0 1 0


2

0 1 0 0 0

Can you learn this
function?

What is it?

8

INTRODUCTION



CS446 Fall ’12





Hypothesis Space

Complete Ignorance:


There are 2
16

= 65536 possible functions

over four input features.



We can’t figure out which one is

correct until we’ve seen every

possible input
-
output pair.


After seven examples we still

have 2
9

possibilities for

f


Is Learning Possible?




Example


x
1

x
2

x
3

x
4
y





1 1 1 1 ?



0 0 0 0 ?



1 0 0 0 ?




1 0 1 1 ?



1 1 0 0 0




1 1 0 1 ?




1 0 1 0 ?



1 0 0 1 1




0 1 0 0 0



0 1 0 1 0




0 1 1 0 0




0 1 1 1 ?




0 0 1 1 1



0 0 1 0 0




0 0 0 1 ?




1 1 1 0 ?

9

INTRODUCTION



CS446 Fall ’12





Hypothesis Space (2)

Simple Rules:
There are only 16 simple


conjunctive rules of the form
y=x
i

Æ

x
j

Æ

x
k











No simple rule explains the data. The same is true for
simple clauses
.


1

0 0 1 0
0


3

0 0 1 1

1


4 1 0 0 1
1


5

0 1 1 0
0


6

1 1 0 0
0


7


0 1 0 1
0


2

0 1 0 0
0

y
=c



x
1

1100 0

x
2

0100 0

x
3

0110 0

x
4

0101 1

x
1


x
2

1100 0

x
1

=
x
3

0011 1

x
1


x
4

0011 1


Rule Counterexample

x
2


x
3


0011 1

x
2

=
=
x
4


0011 1

x
3

=
=
x
4

1001 1

x
1

=
=
x
2

=
x
3

0011 1

x
1


x
2

=
x
4

0011 1

x
1


x
3

=
x
4

0011 1

x
2


x
3

=
x
4

0011 1

x
1


x
2


x
3


x
4

0011 1


Rule Counterexample

10

INTRODUCTION



CS446 Fall ’12





Hypothesis Space (3)

m
-
of
-
n rules:

There are 20 possible rules

of the form
”y = 1
if and only if at least

m

of the following

n
variables are

1”












Found a consistent hypothesis
.


1

0 0 1 0
0


3

0 0 1 1

1


4 1 0 0 1
1


5

0 1 1 0
0


6

1 1 0 0
0


7


0 1 0 1
0


2

0 1 0 0
0


x
1

==††=†††=††=††=†=
3
-

-

-


x
2

==††=††††=††=††==
2
-

-

-


x
3

==††=†††=††=††=†=
1
-

-

-


x
4

==††=†††=††=††=†=
7
-

-

-


x
1,
x
2

††=††=††=
=

2 3
-

-


x
1,
x
3

=

1 3
-

-


x
1,
x
4

=

6 3
-

-


x
2,
x
3

††=††=††=††=

2 3
-

-


variables
1
-
of
2
-
of
3
-
of
4
-
of


x
2,

x
4

=

2 3
-

-



x
3,
x
4

=

4 4
-

-


x
1,
x
2,
x
3

=

1 3 3
-


x
1,
x
2,
x
4

=

2 3 3
-


x
1,
x
3,
x
4

=

1



=

=

3
-


x
2,
x
3,
x
4

=

1 5 3
-


x
1,
x
2,
x
3,
x
4

=

1 5 3 3


variables
1
-
of
2
-
of
3
-
of
4
-
of

11

INTRODUCTION



CS446 Fall ’12





Views of Learning


Learning is the removal of our
remaining

uncertainty:


Suppose we
knew

that the unknown function was an m
-
of
-
n
Boolean function, then we could use the training data to
infer which function it is.

Learning requires guessing a good, small hypothesis
class
:



We can start with a very small class and enlarge it until it
contains an hypothesis that fits the data.


We could be wrong !


Our prior knowledge might be wrong:
y=x4


one
-
of (x1,
x3) is also consistent



Our guess of the hypothesis class could be wrong



If
this

is the unknown function, then we will make errors when
we are given new examples, and are asked to predict the value
of the function

12

INTRODUCTION



CS446 Fall ’12





General strategies for Machine
Learning

Develop representation languages for expressing
concepts


Serve to limit the expressivity of the target models


E.g., Functional representation (n
-
of
-
m); Grammars;
stochastic models;

Develop flexible hypothesis spaces:



Nested collections of hypotheses. Decision trees, neural
networks


Hypothesis spaces of flexible size


In either case:

Develop algorithms for finding a hypothesis
in our
hypothesis space
, that
fits

the data

And
hope

that they will generalize well


13

INTRODUCTION



CS446 Fall ’12





Terminology

Target function (concept):
The true function f :X

{…Labels…}

Concept:
Boolean function. Example for which f (x)= 1 are
positive

examples; those for which f (x)= 0 are
negative

examples (instances)


Hypothesis:
A proposed function h, believed to be similar to f.
The output of our learning algorithm.


Hypothesis space:
The space of all hypotheses that can, in
principle, be output by the learning algorithm.


Classifier:
A discrete valued function produced by the learning
algorithm. The possible value of f: {1,2,…K} are the classes or
class labels
. (In most algorithms the classifier will actually
return a real valued function that we’ll have to interpret
).


Training examples:
A set of examples of the form {(x, f (x))}


14

INTRODUCTION



CS446 Fall ’12





Key Issues in Machine Learning


Modeling


How to formulate application problems as machine
learning problems ? How to represent the data?


Learning Protocols (where is the data & labels coming
from?)


Representation:


What are good hypothesis spaces ?


Any rigorous way to find these? Any general approach?



Algorithms:


What are good algorithms?

(The Bio Exam
ple
)


How do we define success?


Generalization Vs. over fitting


The computational problem


15

INTRODUCTION



CS446 Fall ’12





Example: Generalization
vs

Overfitting


What is a Tree ?


A botanist

Her brother



A tree is something with

A tree is a
green

thing



leaves I’ve seen before






Neither will generalize well


16

INTRODUCTION



CS446 Fall ’12





A
nnouncements

Class Registration:
Almost all the waiting list is in


My office hours:


Tuesday 10:45
-
11:30


Thursday 1
-
1:45 PM


E
-
mail; Piazza; Follow the Web
site

Homework


Hw1 will be made available today


Start today


17

INTRODUCTION



CS446 Fall ’12





Key Issues in Machine Learning


Modeling


How to formulate application problems as machine
learning problems ? How to represent the data?


Learning Protocols (where is the data & labels coming
from?)


Representation


What are good hypothesis spaces ?


Any rigorous way to find these? Any general approach?



Algorithms


What are good algorithms?


How do we define success?


Generalization Vs. over fitting


The computational problem


18

INTRODUCTION



CS446 Fall ’12





An Example

I don’t know {
whether,

weather
}

to laugh or cry


How can we make this a learning problem?


We will look for a function


F: Sentences


{
whether,

weather
}


We need to define the domain of this function better.


An option
: For each word

w

in English define a
Boolean

feature
x
w

:

[
x
w

=1]
iff

w is in the sentence

This maps a sentence to a point in {0,1}
50,000

In this space: some points are
whether

points


some are
weather

points

Learning Protocol?

Supervised? Unsupervised?

This is the
Modeling Step

19

INTRODUCTION



CS446 Fall ’12





Representation Step: What’s
Good?

Learning problem:

Find a function that


best

separates the data

What function?

What’s best
?

(
How to find it
?)





A possibility: Define the learning problem to be:

Find a (linear) function that best separates the data

Linear = linear in the
feature space

x
= data representation;
w
= the classifier

y

=
sgn

{
w
T
x
}

20


Memorizing vs. Learning


How well will you do?


Doing well on what?

INTRODUCTION



CS446 Fall ’12





Expressivity


f(x) =
sgn

{x
¢

w
-


} =
sgn
{

i=1
n

w
i

x
i
-



}

Many functions are Linear


Conjunctions:


y = x
1

Æ

x
3

Æ

x
5


y =
sgn
{1
¢

x
1

+ 1
¢

x
3

+ 1
¢

x
5

-

3
}; w = (1, 0, 1, 0, 1)

=3


At least m of n:


y = at least 2 of {
x
1

,x
3
,

x
5

}


y =
sgn
{1
¢

x
1

+ 1
¢

x
3

+ 1
¢

x
5


-

2}
};
w
= (1, 0, 1, 0, 1)

=2

Many functions are not


Xor
:
y = x
1

Æ

x
2
Ç

:
x
1

Æ

:
x
2



Non trivial DNF:
y = x
1

Æ

x
2
Ç

x
3

Æ

x
4


But can be made linear

Probabilistic Classifiers as well

21

INTRODUCTION



CS446 Fall ’12





Exclusive
-
OR (XOR)

(x
1

Æ

x
2)

Ç

(
:
{x
1
}
Æ

:
{x
2
})

In general: a parity function.


x
i

2

{0,1}

f(x
1
, x
2
,…,
x
n
) = 1


iff



x
i

is even


This function is not


linearly separable
.

x
1


x
2

22

INTRODUCTION



CS446 Fall ’12





Functions Can be Made Linear

Data are not separable in one dimension

Not separable if you insist on using a specific class of
functions


x

23

INTRODUCTION



CS446 Fall ’12





Blown Up Feature Space

Data are separable in <x, x
2
> space

x

x
2


Key issue:
Representation
what
features to use.


Computationally, can be
done implicitly (kernels)

But there are warnings.

24

INTRODUCTION



CS446 Fall ’12





Functions Can be Made Linear

Weather

Whether

y
3

Ç

y
4

Ç

y
7


New discriminator is
functionally simpler

A real Weather/Whether
example

25

x
1

x
2

x
4

Ç

x
2

x
4
x
5

Ç

x
1

x
3

x
7













Space: X= x
1
, x
2
,…,
x
n



Input
Transformation


New
Space: Y = {y
1
,y
2
,…} = {
x
i
,x
i

x
j
, x
i

x
j

x
j
}

INTRODUCTION



CS446 Fall ’12





Third Step: How to Learn?


A possibility: Local
search



Start with a linear threshold function.



See how well you are doing.



Correct



Repeat until you converge.


There are other
ways that



do
not
search
directly
in



the
hypotheses
space


Directly compute the


hypothesis

26

INTRODUCTION



CS446 Fall ’12





A General Framework for
Learning

Goal: predict an unobserved output value y
2

Y


based on an observed input vector x
2

X


Estimate a functional relationship
y~f
(x)


from a set {(
x,y
)
i
}
i=1,n


Most relevant
-

Classification
: y


{0,1} (or
y


{1,2,…k} )


(But, within the same framework can also talk about
Regression, y
2

<

)



What
do we want f(x) to satisfy?


We want to minimize the Loss (Risk):

L(f()) = E
X,Y
( [f(x)

y] )


Where:

E
X,Y
denotes the expectation with respect to the true
distribution
.

Simply: # of mistakes

[…] is a indicator function

27

INTRODUCTION



CS446 Fall ’12





A General Framework for
Learning (II)

We want to minimize the Loss:

L(f()) = E
X,Y
( [f(X)

Y
] )

Where: E
X,Y
denotes the expectation with respect to the true
distribution
.


We cannot do that.


Instead, we
try

to minimize the
empirical

classification error.

For a set of training examples {(
X
i
,
Y
i
)}
i=1,n


Try to minimize: L’(f()) = 1/n

i

[
f(X
i
)

Y
i
]



(Issue
I
: why/when is this good enough? Not now
)

This minimization problem is typically NP hard.

To alleviate this computational problem, minimize a new function


a
convex upper bound of the classification error function


I
(f(x),y) =[f(x)

y]

= {1 when f(x)

y; 0 otherwise}

Side note: If the distribution over X
£
Y is known,
predict:
y =
argmax
y

P(
y|x
)


This produces the optimal Bayes' error.

28

INTRODUCTION



CS446 Fall ’12





Algorithmic View of Learning: an
Optimization Problem

A Loss Function L(f(x),y) measures the penalty
incurred by a classifier f on example (
x,y
).

There are many different loss functions one could
define:


Misclassification Error:



L(f(x),y) = 0 if f(x) = y; 1 otherwise


Squared Loss:


L(f(x),y) = (f(x)

y)
2


Input dependent loss:


L(f(x),y) = 0 if f(x)= y; c(x)otherwise.


A continuous convex

loss
function allows

a
simpler
optimization algorithm.

f(x)

y

L

29

INTRODUCTION



CS446 Fall ’12





Loss

Here
f(x)

is the
prediction
2

<


y
2

{
-
1,1}

is the correct value

0
-
1 Loss
L(
y,f
(x
))= ½ (1
-
sgn(
yf
(x)))

Log Loss
1/ln2

log (1+exp{
-
yf
(x)})

Hinge Loss
L(y, f(x)) = max(0, 1
-

y f(x))

Square Loss
L(y, f(x)) = (
y
-

f(x))
2

0
-
1 Loss
x axis =
yf
(x)

Log Loss = x axis =
yf
(x)

Hinge Loss: x axis =
yf
(x)

Square Loss: x axis

= (y
-

f(x
)+1)

30

INTRODUCTION



CS446 Fall ’12





Example

Putting it all together:


A Learning Algorithm

INTRODUCTION



CS446 Fall ’12





Third Step: How to Learn?


A possibility: Local
search



Start with a linear threshold function.



See how well you are doing.



Correct



Repeat until you converge.


There are other
ways that



do
not
search
directly
in



the
hypotheses
space


Directly compute the


hypothesis

32

INTRODUCTION



CS446 Fall ’12





Learning Linear Separators (LTU)


f(x) =
sgn

{
x
T

¢

w
-


} =
sgn
{

i=1
n

w
i

x
i
-



}

x
T
= (
x
1

,x
2
,… ,
x
n
)
2

{0,1}
n



is the feature based


encoding of the data point

w
T
= (
w
1

,w
2
,… ,
w
n
)
2

<
n



is the target function.




determines the
shift



with

respect
to the origin

w



33

INTRODUCTION



CS446 Fall ’12





Canonical Representation


f(x) =
sgn

{
w
T

¢

x
-


} =
sgn
{

i=1
n

w
i

x
i
-



}


sgn

{
w
T

¢

x
-


}
´


sgn

{(w’)
T

¢

x’}

Where:


x’ = (x,
-
1
) and w’ = (w,

)


Moved from an
n

dimensional representation to an
(n+1)

dimensional representation, but now can look
for
hyperplanes

that go through the origin.




34

INTRODUCTION



CS446 Fall ’12





LMS: An Optimization Algorithm



A local search learning algorithm requires:


Hypothesis Space:


Linear Threshold Units


Loss function:


Squared loss



LMS (Least Mean Square, L
2
)


Search procedure:



Gradient Descent




w



A real Weather/Whether example

35

INTRODUCTION



CS446 Fall ’12





LMS: An Optimization Algorithm

(i (subscript)


vector component; j (superscript)
-

time; d


example
#)




Let
w
(j)

be the current weight vector we
have

Our
prediction on the
d
-
th

example
x

is:




Let
t
d

be the target value for this example (
real value; represents u
¢

x
)

The
error
the current hypothesis makes on the data set is:



Assumption:
x
2

R
n
;
u
2

R
n

is the target weight vector;
the target (label) is
t
d

= u
¢

x

Noise has been added; so,
possibly, no weight vector is consistent with the data.

36

INTRODUCTION



CS446 Fall ’12






Gradient
Descent


We
use gradient descent to determine the weight vector that
minimizes
Err (w)

;

Fixing
the set D of examples, E is a function of
w
j

At
each step, the weight vector is modified in the direction that
produces the steepest descent along the error surface
.



E(w)

w

w
4

w
3

w
2

w
1

37

INTRODUCTION



CS446 Fall ’12





Gradient Descent

To find the best direction in the
weight space

we
compute the gradient of E with respect to each of the
components of



This vector specifies the direction that produces the
steepest increase in E;

We want to modify in the direction of


Where:

38

INTRODUCTION



CS446 Fall ’12





Gradient Descent: LMS

We have:


Therefore:

39

INTRODUCTION



CS446 Fall ’12





Gradient Descent: LMS

Weight
u
pdate rule:

40

INTRODUCTION



CS446 Fall ’12





Gradient Descent: LMS

Weight update rule:



Gradient descent algorithm for training linear units:


Start with an initial random weight vector


For every example d with target value
t
d
do:



Evaluate the linear unit


Update by adding to each component


Continue until E below some threshold

Because the surface contains only a single global minimum, the algorithm will
converge to a weight vector with minimum error, regardless of whether the
examples are linearly separable. (This is true for the case of LMS for linear
regression; the surface may have local minimum if the loss function is different or
when the regression isn’t linear.)


A real Weather/Whether example

41

INTRODUCTION



CS446 Fall ’12





Algorithm II:
Incremental
(Stochastic) Gradient Descent

Weight update rule:

42

INTRODUCTION



CS446 Fall ’12





Weight update rule:



Gradient
descent algorithm for training linear
units:


Start
with an initial random weight
vector


For
every example d with target
value
t
d
do:



Evaluate the linear
unit



update
by
incrementally

adding

to each component
(update without summing over all
data)


Continue
until E below some threshold


Incremental (Stochastic)
Gradient Descent: LMS

43

INTRODUCTION



CS446 Fall ’12





Incremental (Stochastic)
Gradient Descent: LMS

44

Weight update rule:



Gradient
descent algorithm for training linear
units:


Start
with an initial random weight
vector


For
every example d with target
value:



Evaluate the linear
unit



update
by
incrementally

adding

to each component
(update without summing over all
data)


Continue
until E below some threshold

In general
-

does not converge to global minimum

Decreasing R with time guarantees convergence

But,
on
-
line
algorithms
are
sometimes advantageous…




INTRODUCTION



CS446 Fall ’12





In the general (non
-
separable) case the learning rate
R must decrease to zero to guarantee convergence.

The learning rate is called the
step size.
There are
more sophisticated algorithms (Conjugate Gradient)
that choose the step size automatically and converge
faster.

There is only one “basin” for linear threshold unites,
so a local minimum is the global minimum. However,
choosing a starting point can make the algorithm
converge much faster.

Learning Rates and Convergence

45

INTRODUCTION



CS446 Fall ’12





Computational Issues

Assume the data is linearly separable.

Sample complexity:


Suppose we want to ensure that our LTU has an error rate
(
on new
examples) of less than


with high
probability (
at least (
1
-

))


How large does
m

(the number of examples) must be in order to
achieve
this ? It can be shown that for
n

dimensional problems


m = O(1/


[
ln
(1/

) + (n+1)
ln
(1/

)
].


Computational
complexity:
What
can be said?


It can be shown that there exists a polynomial time algorithm for
finding
consistent LTU (by reduction from linear programming).


[Contrast with the NP hardness for 0
-
1 loss optimization]


(On
-
line algorithms have inverse quadratic dependence on the margin)



46

INTRODUCTION



CS446 Fall ’12





Other Methods for LTUs

Fisher Linear Discriminant:


A direct computation method

Probabilistic methods (naïve Bayes):


Produces a stochastic classifier that can be viewed as a
linear threshold unit.

Winnow/Perceptron


A multiplicative/additive update algorithm with some
sparsity

properties in the function space (a large number of
irrelevant attributes) or features space (sparse examples)

Logistic Regression, SVM…many other algorithms


47