Why does it work?

strangerwineΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

62 εμφανίσεις

Bayesian Learning



CS446
-
FALL ‘12



We have not addressed the question of why does this c
lassifier


performs
well, given that the assumptions are
unlikely
to be


satisfied.




The linear form of the classifiers provides some hints.



Why does it work?

Projects


Presentation on 12/15 9am


Updates (see web site)

Final Exam 12/11, in class

No class on Thursday

Happy Thanksgiving

Bayesian Learning



CS446
-
FALL ‘12


In the case of two classes we have that:




but since




We get (plug in (2) in (1); some algebra):





Which is simply the logistic (sigmoid) function used in the


neural network representation.

Naïve Bayes: Two Classes

We have:

A = 1
-
B; Log(B/A) =
-
C.
Then:

Exp
(
-
C) = B/A =



= (1
-
A)/A = 1/A


1



= +
Exp
(
-
C) = 1/A

A = 1/(1+Exp(
-
C))

Bayesian Learning



CS446
-
FALL ‘12

Another look at Naive Bayes

Graphical model. It encodes
the NB independence
assumption in the edge
structure (siblings are
independent given parents)

Note this is a bit different than
the previous linearization.
Rather than a single function,
here we have
argmax

over
several different functions.

3

Linear Statistical Queries
Model

Bayesian Learning



CS446
-
FALL ‘12

4

Hidden Markov Model
(HMM)

A probabilistic
generative
model:
models the generation of an
observed
sequence.

At
each time step, there are two
variables:
Current
state (hidden
), Observation





Elements


Initial state probability P(s
1
)
(|S| parameters)


Transition probability P(s
t
|s
t
-
1
)

(|S|^2 parameters)


Observation probability P(
o
t
|s
t
)
(|
S|x

|O| parameters)

As before, the graphical model is an encoding of the independence
assumptions:


P(s
t
|s
t
-
1
, s
t
-
2
,…s
1
) =P(s
t
|s
t
-
1
)



P(
o
t
|
s
T
,…,
s
t
,…s
1
,
o
T
,…,
o
t
,…o
1

)=P(
o
t
|s
t
)

Examples: POS tagging, Sequential Segmentation

s
1

o
1

s
2

o
2

s
3

o
3

s
4

o
4

s
5

o
5

s
6

o
6

Bayesian Learning



CS446
-
FALL ‘12

5

HMM for Shallow Parsing

States:


{B, I, O}

Observations:


Actual words and/or part
-
of
-
speech tags


s
1
=B

o
1

Mr.

s
2
=I

o
2

Brown

s
3
=O

o
3

blamed

s
4
=B

o
4

Mr.

s
5
=I

o
5

Bob

s
6
=O

o
6

for

Bayesian Learning



CS446
-
FALL ‘12

6

HMM for Shallow Parsing








Given a sentences, we can ask what the most likely state
sequence is

Initial state probability:

P(s
1
=B),P(s
1
=I),P(s
1
=O)

Transition probabilty:

P(s
t
=B|s
t
-
1
=B),P(s
t
=I|s
t
-
1
=B),P(s
t
=O|s
t
-
1
=B),

P(s
t
=B|s
t
-
1
=I),P(s
t
=I|s
t
-
1
=I),P(s
t
=O|s
t
-
1
=I),



Observation Probability:

P(o
t
=‘Mr.’|s
t
=B),P(o
t
=‘Brown’|s
t
=B),…,

P(o
t
=‘Mr.’|s
t
=I),P(o
t
=‘Brown’|s
t
=I),…,



s
1
=B

o
1

Mr.

s
2
=I

o
2

Brown

s
3
=O

o
3

blamed

s
4
=B

o
4

Mr.

s
5
=I

o
5

Bob

s
6
=O

o
6

for

Bayesian Learning



CS446
-
FALL ‘12

7

Three
Computational Problems


Decoding



finding the most likely path


Have: model, parameters, observations (data)


Want: most likely states sequence



Evaluation



computing observation likelihood


Have: model, parameters, observations (data)


Want: the likelihood to generate the observed data



In both cases


a simple minded solution depends on |S|
T

steps


Training


estimating parameters


Supervised: Have: model,
annotated

data(data + states
sequence)


Unsupervised: Have: model, data


Want: parameters

a

B

I

B

I

I

d

d

c

0.5

0.2

0.5

0.5

0.5

0.4

0.25

0.25

0.25

0.25

a

Bayesian Learning



CS446
-
FALL ‘12

8

Finding most likely state sequence in HMM (1)

P

(

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

;

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

=

P

(

o

k

j

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

;

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

)

¢

P

(

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

;

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

)

=

P

(

o

k

j

s

k

)

¢

P

(

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

;

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

)

=

P

(

o

k

j

s

k

)

¢

P

(

s

k

j

s

k

¡

1

;

s

k

¡

2

;

:

:

:

;

s

1

;

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

)

¢

P

(

s

k

¡

1

;

s

k

¡

2

;

:

:

:

;

s

1

;

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

)

=

P

(

o

k

j

s

k

)

¢

P

(

s

k

j

s

k

¡

1

)

¢

P

(

s

k

¡

1

;

s

k

¡

2

;

:

:

:

;

s

1

;

o

k

¡

1

;

o

k

¡

2

;

:

:

:

;

o

1

)

=

P

(

o

k

j

s

k

)

¢

[

k

¡

1

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

Bayesian Learning



CS446
-
FALL ‘12

9

Finding most likely state sequence in HMM (2)

a

rg

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

j

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

=

a

rg

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

;

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

P

(

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

=

a

rg

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

s

k

;

s

k

¡

1

;

:

:

:

;

s

1

;

o

k

;

o

k

¡

1

;

:

:

:

;

o

1

)

=

a

rg

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

o

k

j

s

k

)

¢

[

k

¡

1

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

Bayesian Learning



CS446
-
FALL ‘12

10

Finding most likely state sequence in HMM (3)

A function of s
k

=

max

s

k

P

(

o

k

j

s

k

)

¢

max

s

k

¡

1

;:::

;s

1

[

k

¡

1

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

=

max

s

k

P

(

o

k

j

s

k

)

¢

max

s

k

¡

1

[

P

(

s

k

j

s

k

¡

1

)

¢

P

(

o

k

¡

1

j

s

k

¡

1

)]

¢

max

s

k

¡

2

;:::

;s

1

[

k

¡

2

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

=

max

s

k

P

(

o

k

j

s

k

)

¢

max

s

k

¡

1

[

P

(

s

k

j

s

k

¡

1

)

¢

P

(

o

k

¡

1

j

s

k

¡

1

)]

¢

max

s

k

¡

2

[

P

(

s

k

¡

1

j

s

k

¡

2

)

¢

P

(

o

k

¡

2

j

s

k

¡

2

)]

¢

:

:

:

¢

max

s

1

[

P

(

s

2

j

s

1

)

¢

P

(

o

1

j

s

1

)]

¢

P

(

s

1

)

max

s

k

;s

k

¡

1

;:::

;s

1

P

(

o

k

j

s

k

)

¢

[

k

¡

1

Y

t

=1

P

(

s

t

+1

j

s

t

)

¢

P

(

o

t

j

s

t

)]

¢

P

(

s

1

)

Bayesian Learning



CS446
-
FALL ‘12

11

Finding most likely state sequence in HMM (4)







Viterbi’s Algorithm


Dynamic Programming

max

s

k

P

(

o

k

j

s

k

)

¢

max

s

k

¡

1

[

P

(

s

k

j

s

k

¡

1

)

¢

P

(

o

k

¡

1

j

s

k

¡

1

)]

¢

max

s

k

¡

2

[

P

(

s

k

¡

1

j

s

k

¡

2

)

¢

P

(

o

k

¡

2

j

s

k

¡

2

)]

¢

:

:

:

¢

max

s

2

[

P

(

s

3

j

s

2

)

¢

P

(

o

2

j

s

2

)]

¢

¢

max

s

1

[

P

(

s

2

j

s

1

)

¢

P

(

o

1

j

s

1

)]

¢

P

(

s

1

)

Bayesian Learning



CS446
-
FALL ‘12

12

Learning the Model

Estimate


Initial state probability P

(s
1
)


Transition probability P(s
t
|s
t
-
1
)


Observation probability P(o
t
|s
t
)

Unsupervised Learning (states are not observed)


EM Algorithm

Supervised Learning (states are observed; more common)


ML Estimate of above terms directly from data


Notice that this is completely analogues to the case of naive
Bayes, and essentially all other models.

Bayesian Learning



CS446
-
FALL ‘12

13

Another view of Markov Models

Assumptions:

Prediction: predict
t

T

that maximizes

Input:

States:

Observations:

T

W

Bayesian Learning



CS446
-
FALL ‘12

14

Another View of Markov Models

As for NB:

features are pairs and singletons of t‘s, w’s


Only 3 active features

Input:

States:

Observations:

T

W

This can be extended to an argmax that maximizes the prediction of

the whole state sequence and computed, as before, via Viterbi.

HMM is a linear model

(over pairs of states and states/
obs
)

Bayesian Learning



CS446
-
FALL ‘12

15

Learning with Probabilistic Classifiers

Learning Theory


We showed that probabilistic predictions can be viewed as predictions via
Linear

Statistical Queries Models (Roth’99).

The low expressivity explains
Generalization+Robustness

Is that all?


It does not explain why is it possible to (approximately) fit the data with
these models. Namely, is there a reason to believe that these hypotheses
minimize the empirical error on the sample?




In General, No.

(Unless it corresponds to some probabilistic assumptions
that hold).


Bayesian Learning



CS446
-
FALL ‘12

16

Example: probabilistic classifiers


Features are pairs and singletons of t‘s, w’s


Additional features are included

States:

Observations:

T

W

If hypothesis does not fit the
training data
-

augment set of
features

(forget assumptions)

Bayesian Learning



CS446
-
FALL ‘12

17

Learning Protocol: Practice

LSQ hypotheses are computed directly
:


Choose features


Compute
coefficients


If hypothesis does not fit the training data


Augment
set of
features


(Assumptions will not be satisfied)

But now, you actually follow the Learning Theory Protocol:


Try to learn a hypothesis that is consistent with the data


Generalization will be a function of the low expressivity




Bayesian Learning



CS446
-
FALL ‘12

18

Remaining Question:
While
low expressivity explains
generalization, why is it relatively easy to fit the data?

Consider all distributions with the same
marginals


(That is,
a naïve Bayes classifier will predict the same
regardless of which distribution
really generated
the data
.)


Garg&Roth

ECML’01
):


Product distributions are “dense” in the space of all distributions.
Consequently, for
most
generating distributions the
resulting
predictor’s error is close to optimal classifier (that is, given the correct
distribution)

Robustness
of Probabilistic Predictors

Bayesian Learning



CS446
-
FALL ‘12

19

Summary: Probabilistic Modeling

Classifiers derived from probability density


estimation models were viewed as LSQ hypotheses.




Probabilistic
assumptions:



+

Guiding feature selection but
also


-

Not allowing the use of more general features
.



A unified approach:
a lot of classifiers, probabilistic and others can be
viewed as linear classier over an appropriate feature space.


Bayesian Learning



CS446
-
FALL ‘12

20

What’s Next?

(1)

If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear learning
algorithms probabilistically?


Yes

(2)

If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?


Yes.


Classification
: Logistics regression/Max Entropy


HMM: can be learned as a linear model, e.g., with a version of
Perceptron (Structured Models; Spring 2013)


Bayesian Learning



CS446
-
FALL ‘12

Recall: Naïve Bayes, Two Classes

21

In the case of two classes we have:



but since



We get (plug in (2) in (1); some algebra):




Which is simply the logistic (sigmoid) function used in the

neural network representation.

Bayesian Learning



CS446
-
FALL ‘12

Conditional Probabilities

Data: Two class (Open/
NotOpen

Classifier)

22


The plot shows a
(normalized) histogram
of examples as a
function of the dot
product



act = (
w
T
x

+
b)


and a couple other
functions of it.



In particular, we plot
the
positive Sigmoid:






P(y
= +1 |
x,w
)= [1+exp
(
-
(
w
T
x

+ b)]
-
1


Is
this
really a
probability distribution?

Bayesian Learning



CS446
-
FALL ‘12

Plotting:
For
example z:



y=
Prob
(label=1
| f(z)=x
)

(
Histogram:
for 0.8, # (of examples

with f(z)<0.8))

Claim: Yes;
If
Prob
(label=1
| f(z)=x) = x

Then f(z) =
f(z) is a probability dist.

That is,
yes,

if the graph is linear.

Theorem
:

Let X be a RV with

distribution
F.

(1)
F(X
) is uniformly distributed
in
(0,1).

(2)
If
U is
uniform(0,1
), F
-
1
(U) is



distributed
F, where F
-
1
(x) is the value



of
y
s.t.

F(y) =x.

Alternatively:


f(z) is a probability if:
Prob
U

{z|
Prob
[
(f(z)=1
·

y]} = y




Conditional Probabilities

23

Plotted for
SNoW

(Winnow).
Similarly, perceptron; more tuning
is required for SVMs.

Bayesian Learning



CS446
-
FALL ‘12

Conditional Probabilities

(1)

If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear
learning algorithms probabilistically?


Yes

General recipe


Train a classifier f using your favorite algorithm (Perceptron, SVM,
Winnow,
etc
). Then:



Use Sigmoid1/1+exp{
-
(
Aw
T
x

+ B)} to get an estimate for P(y | x)


A, B can be tuned using a held out that was not used for training.


Done in LBJ, for example

24

Bayesian Learning



CS446
-
FALL ‘12

(2)

If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?

The logistic regression model assumes the following model:

P(y= +/
-
1 |
x,w
)= [1+exp(
-
y(
w
T
x

+ b)]
-
1

This is the same model
we derived for naïve Bayes, only
that
now we will not assume any independence assumption.
We
will directly find the best w.

Therefore training will be more difficult. However, the weight
vector derived will be
more expressive.



It can be shown that the naïve Bayes algorithm cannot represent all
linear threshold functions.


On the other hand, NB converges to
its performance

faster.

Logistic Regression

25

How?

Bayesian Learning



CS446
-
FALL ‘12

Logistic Regression (2)

Given the model:

P(y
= +/
-
1 |
x,w
)= [1+exp(
-
y(w
T
x + b)]
-
1

The goal is to
find the (
w, b)

that maximizes
the log likelihood of the data:
{x
1
,x
2

x
m
}.


We
are looking for
(
w,b
)
that minimizes the negative log
-
likelihood

min
w,b


1
m

log
P(y
= +/
-
1 |
x,w
)=
min
w,b


1
m

log[1+exp(
-
y
i
(w
T
x
i

+
b
)]


This optimization problem is called
Logistics Regression


Logistic
Regression
is sometimes called the
Maximum Entropy model
in the
NLP community (since the resulting distribution is the one that has the
largest entropy among all those that activate the same features).

26

Bayesian Learning



CS446
-
FALL ‘12

Logistic Regression (3)

Using the standard mapping to linear separators through the origin, we
would like to minimize:

min
w


1
m

log
P(y= +/
-
1 |
x,w
)=
min
w
,


1
m

log[1+exp
(
-
y
i
(w
T
x
i
)]


To get good generalization, it is common to add a regularization term, and
the regularized logistics regression then becomes:

min
w

f(w) = ½
w
T
w

+ C

1
m

log[1+exp
(
-
y
i
(w
T
x
i
)],


Where C is a user selected parameter that balances the two terms.



27

Empirical loss

Regularization term

Bayesian Learning



CS446
-
FALL ‘12

Comments on discriminative Learning

min
w

f(w)
= ½
w
T
w

+ C

1
m

log[1+exp
(
-
y
i
w
T
x
i
)],


Where C is a user selected parameter that balances the two terms.


Since the
second term can be considered the
loss
function

Therefore
, regularized logistic regression can be related to other learning
methods, e.g., SVMs.

L
1

SVM
solves the following


optimization problem:


min
w

f
1
(w)
= ½
w
T
w

+ C

1
m

max(0,1
-
y
i
(
w
T
x
i
)

L
2

SVM
solves the following


optimization problem:


min
w

f
2
(w
)
= ½
w
T
w

+ C

1
m

(max(0,1
-
y
i
w
T
x
i
))
2



28

Empirical loss

Regularization term

Bayesian Learning



CS446
-
FALL ‘12

Optimization: How to Solve

29

All
methods are iterative methods, that
generate a sequence
w
k

that
converges to the optimal solution of the optimization problem above.

Many options within this category:


Iterative scaling:
Low cost per
iteration, slow convergence, updates
each
w component at a
time


Newton methods:
High cost per
iteration, faster convergence


non
-
linear
conjugate gradient;
quasi
-
Newton
methods
; truncated Newton
methods; trust
-
region newton method.


Currently
: Limited memory BFGS is very popular


Stochastic Gradient Decent methods


The
runtime does not depend on
n
=#(examples); advantage when
n

is very large.


Stopping criteria is a problem: method tends
to be too aggressive at the beginning
and reaches
a moderate accuracy quite fast,
but it’s
convergence becomes
slow if
we
are interested in more accurate solutions
.


Bayesian Learning



CS446
-
FALL ‘12

Summary

(1)

If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear
learning algorithms probabilistically?


Yes


(2)

If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?


Yes.


Classification
:
Logistic
regression/Max Entropy


HMM: can be trained via Perceptron (Spring 2013(



30