# Why does it work?

AI and Robotics

Oct 19, 2013 (5 years and 2 months ago)

106 views

Bayesian Learning

CS446
-
FALL ‘12

We have not addressed the question of why does this c
lassifier

performs
well, given that the assumptions are
unlikely
to be

satisfied.

The linear form of the classifiers provides some hints.

Why does it work?

Projects

Presentation on 12/15 9am

Final Exam 12/11, in class

No class on Thursday

Happy Thanksgiving

Bayesian Learning

CS446
-
FALL ‘12

In the case of two classes we have that:

but since

We get (plug in (2) in (1); some algebra):

Which is simply the logistic (sigmoid) function used in the

neural network representation.

Naïve Bayes: Two Classes

We have:

A = 1
-
B; Log(B/A) =
-
C.
Then:

Exp
(
-
C) = B/A =

= (1
-
A)/A = 1/A

1

= +
Exp
(
-
C) = 1/A

A = 1/(1+Exp(
-
C))

Bayesian Learning

CS446
-
FALL ‘12

Another look at Naive Bayes

Graphical model. It encodes
the NB independence
assumption in the edge
structure (siblings are
independent given parents)

Note this is a bit different than
the previous linearization.
Rather than a single function,
here we have
argmax

over
several different functions.

3

Linear Statistical Queries
Model

Bayesian Learning

CS446
-
FALL ‘12

4

Hidden Markov Model
(HMM)

A probabilistic
generative
model:
models the generation of an
observed
sequence.

At
each time step, there are two
variables:
Current
state (hidden
), Observation

Elements

Initial state probability P(s
1
)
(|S| parameters)

Transition probability P(s
t
|s
t
-
1
)

(|S|^2 parameters)

Observation probability P(
o
t
|s
t
)
(|
S|x

|O| parameters)

As before, the graphical model is an encoding of the independence
assumptions:

P(s
t
|s
t
-
1
, s
t
-
2
,…s
1
) =P(s
t
|s
t
-
1
)

P(
o
t
|
s
T
,…,
s
t
,…s
1
,
o
T
,…,
o
t
,…o
1

)=P(
o
t
|s
t
)

Examples: POS tagging, Sequential Segmentation

s
1

o
1

s
2

o
2

s
3

o
3

s
4

o
4

s
5

o
5

s
6

o
6

Bayesian Learning

CS446
-
FALL ‘12

5

HMM for Shallow Parsing

States:

{B, I, O}

Observations:

Actual words and/or part
-
of
-
speech tags

s
1
=B

o
1

Mr.

s
2
=I

o
2

Brown

s
3
=O

o
3

blamed

s
4
=B

o
4

Mr.

s
5
=I

o
5

Bob

s
6
=O

o
6

for

Bayesian Learning

CS446
-
FALL ‘12

6

HMM for Shallow Parsing

Given a sentences, we can ask what the most likely state
sequence is

Initial state probability:

P(s
1
=B),P(s
1
=I),P(s
1
=O)

Transition probabilty:

P(s
t
=B|s
t
-
1
=B),P(s
t
=I|s
t
-
1
=B),P(s
t
=O|s
t
-
1
=B),

P(s
t
=B|s
t
-
1
=I),P(s
t
=I|s
t
-
1
=I),P(s
t
=O|s
t
-
1
=I),

Observation Probability:

P(o
t
=‘Mr.’|s
t
=B),P(o
t
=‘Brown’|s
t
=B),…,

P(o
t
=‘Mr.’|s
t
=I),P(o
t
=‘Brown’|s
t
=I),…,

s
1
=B

o
1

Mr.

s
2
=I

o
2

Brown

s
3
=O

o
3

blamed

s
4
=B

o
4

Mr.

s
5
=I

o
5

Bob

s
6
=O

o
6

for

Bayesian Learning

CS446
-
FALL ‘12

7

Three
Computational Problems

Decoding

finding the most likely path

Have: model, parameters, observations (data)

Want: most likely states sequence

Evaluation

computing observation likelihood

Have: model, parameters, observations (data)

Want: the likelihood to generate the observed data

In both cases

a simple minded solution depends on |S|
T

steps

Training

estimating parameters

Supervised: Have: model,
annotated

data(data + states
sequence)

Unsupervised: Have: model, data

Want: parameters

a

B

I

B

I

I

d

d

c

0.5

0.2

0.5

0.5

0.5

0.4

0.25

0.25

0.25

0.25

a

Bayesian Learning

CS446
-
FALL ‘12

8

Finding most likely state sequence in HMM (1)

P

(

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

;

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

=

P

(

o

k

j

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

;

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

)

¢

P

(

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

;

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

)

=

P

(

o

k

j

s

k

)

¢

P

(

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

;

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

)

=

P

(

o

k

j

s

k

)

¢

P

(

s

k

j

s

k

¡

1

;

s

k

¡

2

;

:

:

:

;

s

1

;

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

)

¢

P

(

s

k

¡

1

;

s

k

¡

2

;

:

:

:

;

s

1

;

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

)

=

P

(

o

k

j

s

k

)

¢

P

(

s

k

j

s

k

¡

1

)

¢

P

(

s

k

¡

1

;

s

k

¡

2

;

:

:

:

;

s

1

;

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

)

=

P

(

o

k

j

s

k

)

¢

[

k

¡

1

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

Bayesian Learning

CS446
-
FALL ‘12

9

Finding most likely state sequence in HMM (2)

a

rg

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

j

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

=

a

rg

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

;

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

P

(

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

=

a

rg

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

;

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

=

a

rg

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

o

k

j

s

k

)

¢

[

k

¡

1

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

Bayesian Learning

CS446
-
FALL ‘12

10

Finding most likely state sequence in HMM (3)

A function of s
k

=

max

s

k

P

(

o

k

j

s

k

)

¢

max

s

k

¡

1

;:::

;s

1

[

k

¡

1

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

=

max

s

k

P

(

o

k

j

s

k

)

¢

max

s

k

¡

1

[

P

(

s

k

j

s

k

¡

1

)

¢

P

(

o

k

¡

1

j

s

k

¡

1

)]

¢

max

s

k

¡

2

;:::

;s

1

[

k

¡

2

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

=

max

s

k

P

(

o

k

j

s

k

)

¢

max

s

k

¡

1

[

P

(

s

k

j

s

k

¡

1

)

¢

P

(

o

k

¡

1

j

s

k

¡

1

)]

¢

max

s

k

¡

2

[

P

(

s

k

¡

1

j

s

k

¡

2

)

¢

P

(

o

k

¡

2

j

s

k

¡

2

)]

¢

:

:

:

¢

max

s

1

[

P

(

s

2

j

s

1

)

¢

P

(

o

1

j

s

1

)]

¢

P

(

s

1

)

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

o

k

j

s

k

)

¢

[

k

¡

1

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

Bayesian Learning

CS446
-
FALL ‘12

11

Finding most likely state sequence in HMM (4)

Viterbi’s Algorithm

Dynamic Programming

max

s

k

P

(

o

k

j

s

k

)

¢

max

s

k

¡

1

[

P

(

s

k

j

s

k

¡

1

)

¢

P

(

o

k

¡

1

j

s

k

¡

1

)]

¢

max

s

k

¡

2

[

P

(

s

k

¡

1

j

s

k

¡

2

)

¢

P

(

o

k

¡

2

j

s

k

¡

2

)]

¢

:

:

:

¢

max

s

2

[

P

(

s

3

j

s

2

)

¢

P

(

o

2

j

s

2

)]

¢

¢

max

s

1

[

P

(

s

2

j

s

1

)

¢

P

(

o

1

j

s

1

)]

¢

P

(

s

1

)

Bayesian Learning

CS446
-
FALL ‘12

12

Learning the Model

Estimate

Initial state probability P

(s
1
)

Transition probability P(s
t
|s
t
-
1
)

Observation probability P(o
t
|s
t
)

Unsupervised Learning (states are not observed)

EM Algorithm

Supervised Learning (states are observed; more common)

ML Estimate of above terms directly from data

Notice that this is completely analogues to the case of naive
Bayes, and essentially all other models.

Bayesian Learning

CS446
-
FALL ‘12

13

Another view of Markov Models

Assumptions:

Prediction: predict
t

T

that maximizes

Input:

States:

Observations:

T

W

Bayesian Learning

CS446
-
FALL ‘12

14

Another View of Markov Models

As for NB:

features are pairs and singletons of t‘s, w’s

Only 3 active features

Input:

States:

Observations:

T

W

This can be extended to an argmax that maximizes the prediction of

the whole state sequence and computed, as before, via Viterbi.

HMM is a linear model

(over pairs of states and states/
obs
)

Bayesian Learning

CS446
-
FALL ‘12

15

Learning with Probabilistic Classifiers

Learning Theory

We showed that probabilistic predictions can be viewed as predictions via
Linear

Statistical Queries Models (Roth’99).

The low expressivity explains
Generalization+Robustness

Is that all?

It does not explain why is it possible to (approximately) fit the data with
these models. Namely, is there a reason to believe that these hypotheses
minimize the empirical error on the sample?

In General, No.

(Unless it corresponds to some probabilistic assumptions
that hold).

Bayesian Learning

CS446
-
FALL ‘12

16

Example: probabilistic classifiers

Features are pairs and singletons of t‘s, w’s

States:

Observations:

T

W

If hypothesis does not fit the
training data
-

augment set of
features

(forget assumptions)

Bayesian Learning

CS446
-
FALL ‘12

17

Learning Protocol: Practice

LSQ hypotheses are computed directly
:

Choose features

Compute
coefficients

If hypothesis does not fit the training data

Augment
set of
features

(Assumptions will not be satisfied)

But now, you actually follow the Learning Theory Protocol:

Try to learn a hypothesis that is consistent with the data

Generalization will be a function of the low expressivity

Bayesian Learning

CS446
-
FALL ‘12

18

Remaining Question:
While
low expressivity explains
generalization, why is it relatively easy to fit the data?

Consider all distributions with the same
marginals

(That is,
a naïve Bayes classifier will predict the same
regardless of which distribution
really generated
the data
.)

Garg&Roth

ECML’01
):

Product distributions are “dense” in the space of all distributions.
Consequently, for
most
generating distributions the
resulting
predictor’s error is close to optimal classifier (that is, given the correct
distribution)

Robustness
of Probabilistic Predictors

Bayesian Learning

CS446
-
FALL ‘12

19

Summary: Probabilistic Modeling

Classifiers derived from probability density

estimation models were viewed as LSQ hypotheses.

Probabilistic
assumptions:

+

Guiding feature selection but
also

-

Not allowing the use of more general features
.

A unified approach:
a lot of classifiers, probabilistic and others can be
viewed as linear classier over an appropriate feature space.

Bayesian Learning

CS446
-
FALL ‘12

20

What’s Next?

(1)

If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear learning
algorithms probabilistically?

Yes

(2)

If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?

Yes.

Classification
: Logistics regression/Max Entropy

HMM: can be learned as a linear model, e.g., with a version of
Perceptron (Structured Models; Spring 2013)

Bayesian Learning

CS446
-
FALL ‘12

Recall: Naïve Bayes, Two Classes

21

In the case of two classes we have:

but since

We get (plug in (2) in (1); some algebra):

Which is simply the logistic (sigmoid) function used in the

neural network representation.

Bayesian Learning

CS446
-
FALL ‘12

Conditional Probabilities

Data: Two class (Open/
NotOpen

Classifier)

22

The plot shows a
(normalized) histogram
of examples as a
function of the dot
product

act = (
w
T
x

+
b)

and a couple other
functions of it.

In particular, we plot
the
positive Sigmoid:

P(y
= +1 |
x,w
)= [1+exp
(
-
(
w
T
x

+ b)]
-
1

Is
this
really a
probability distribution?

Bayesian Learning

CS446
-
FALL ‘12

Plotting:
For
example z:

y=
Prob
(label=1
| f(z)=x
)

(
Histogram:
for 0.8, # (of examples

with f(z)<0.8))

Claim: Yes;
If
Prob
(label=1
| f(z)=x) = x

Then f(z) =
f(z) is a probability dist.

That is,
yes,

if the graph is linear.

Theorem
:

Let X be a RV with

distribution
F.

(1)
F(X
) is uniformly distributed
in
(0,1).

(2)
If
U is
uniform(0,1
), F
-
1
(U) is

distributed
F, where F
-
1
(x) is the value

of
y
s.t.

F(y) =x.

Alternatively:

f(z) is a probability if:
Prob
U

{z|
Prob
[
(f(z)=1
·

y]} = y

Conditional Probabilities

23

Plotted for
SNoW

(Winnow).
Similarly, perceptron; more tuning
is required for SVMs.

Bayesian Learning

CS446
-
FALL ‘12

Conditional Probabilities

(1)

If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear
learning algorithms probabilistically?

Yes

General recipe

Train a classifier f using your favorite algorithm (Perceptron, SVM,
Winnow,
etc
). Then:

Use Sigmoid1/1+exp{
-
(
Aw
T
x

+ B)} to get an estimate for P(y | x)

A, B can be tuned using a held out that was not used for training.

Done in LBJ, for example

24

Bayesian Learning

CS446
-
FALL ‘12

(2)

If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?

The logistic regression model assumes the following model:

P(y= +/
-
1 |
x,w
)= [1+exp(
-
y(
w
T
x

+ b)]
-
1

This is the same model
we derived for naïve Bayes, only
that
now we will not assume any independence assumption.
We
will directly find the best w.

Therefore training will be more difficult. However, the weight
vector derived will be
more expressive.

It can be shown that the naïve Bayes algorithm cannot represent all
linear threshold functions.

On the other hand, NB converges to
its performance

faster.

Logistic Regression

25

How?

Bayesian Learning

CS446
-
FALL ‘12

Logistic Regression (2)

Given the model:

P(y
= +/
-
1 |
x,w
)= [1+exp(
-
y(w
T
x + b)]
-
1

The goal is to
find the (
w, b)

that maximizes
the log likelihood of the data:
{x
1
,x
2

x
m
}.

We
are looking for
(
w,b
)
that minimizes the negative log
-
likelihood

min
w,b

1
m

log
P(y
= +/
-
1 |
x,w
)=
min
w,b

1
m

log[1+exp(
-
y
i
(w
T
x
i

+
b
)]

This optimization problem is called
Logistics Regression

Logistic
Regression
is sometimes called the
Maximum Entropy model
in the
NLP community (since the resulting distribution is the one that has the
largest entropy among all those that activate the same features).

26

Bayesian Learning

CS446
-
FALL ‘12

Logistic Regression (3)

Using the standard mapping to linear separators through the origin, we
would like to minimize:

min
w

1
m

log
P(y= +/
-
1 |
x,w
)=
min
w
,

1
m

log[1+exp
(
-
y
i
(w
T
x
i
)]

To get good generalization, it is common to add a regularization term, and
the regularized logistics regression then becomes:

min
w

f(w) = ½
w
T
w

+ C

1
m

log[1+exp
(
-
y
i
(w
T
x
i
)],

Where C is a user selected parameter that balances the two terms.

27

Empirical loss

Regularization term

Bayesian Learning

CS446
-
FALL ‘12

min
w

f(w)
= ½
w
T
w

+ C

1
m

log[1+exp
(
-
y
i
w
T
x
i
)],

Where C is a user selected parameter that balances the two terms.

Since the
second term can be considered the
loss
function

Therefore
, regularized logistic regression can be related to other learning
methods, e.g., SVMs.

L
1

SVM
solves the following

optimization problem:

min
w

f
1
(w)
= ½
w
T
w

+ C

1
m

max(0,1
-
y
i
(
w
T
x
i
)

L
2

SVM
solves the following

optimization problem:

min
w

f
2
(w
)
= ½
w
T
w

+ C

1
m

(max(0,1
-
y
i
w
T
x
i
))
2

28

Empirical loss

Regularization term

Bayesian Learning

CS446
-
FALL ‘12

Optimization: How to Solve

29

All
methods are iterative methods, that
generate a sequence
w
k

that
converges to the optimal solution of the optimization problem above.

Many options within this category:

Iterative scaling:
Low cost per
each
w component at a
time

Newton methods:
High cost per
iteration, faster convergence

non
-
linear
quasi
-
Newton
methods
; truncated Newton
methods; trust
-
region newton method.

Currently
: Limited memory BFGS is very popular

The
runtime does not depend on
n
n

is very large.

Stopping criteria is a problem: method tends
to be too aggressive at the beginning
and reaches
a moderate accuracy quite fast,
but it’s
convergence becomes
slow if
we
are interested in more accurate solutions
.

Bayesian Learning

CS446
-
FALL ‘12

Summary

(1)

If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear
learning algorithms probabilistically?

Yes

(2)

If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?

Yes.

Classification
:
Logistic
regression/Max Entropy

HMM: can be trained via Perceptron (Spring 2013(

30