Bayesian Learning
CS446

FALL ‘12
•
We have not addressed the question of why does this c
lassifier
performs
well, given that the assumptions are
unlikely
to be
satisfied.
•
The linear form of the classifiers provides some hints.
Why does it work?
Projects
Presentation on 12/15 9am
Updates (see web site)
Final Exam 12/11, in class
No class on Thursday
Happy Thanksgiving
Bayesian Learning
CS446

FALL ‘12
•
In the case of two classes we have that:
•
but since
•
We get (plug in (2) in (1); some algebra):
•
Which is simply the logistic (sigmoid) function used in the
neural network representation.
Naïve Bayes: Two Classes
We have:
A = 1

B; Log(B/A) =

C.
Then:
Exp
(

C) = B/A =
= (1

A)/A = 1/A
–
1
= +
Exp
(

C) = 1/A
A = 1/(1+Exp(

C))
Bayesian Learning
CS446

FALL ‘12
Another look at Naive Bayes
Graphical model. It encodes
the NB independence
assumption in the edge
structure (siblings are
independent given parents)
Note this is a bit different than
the previous linearization.
Rather than a single function,
here we have
argmax
over
several different functions.
3
Linear Statistical Queries
Model
Bayesian Learning
CS446

FALL ‘12
4
Hidden Markov Model
(HMM)
A probabilistic
generative
model:
models the generation of an
observed
sequence.
At
each time step, there are two
variables:
Current
state (hidden
), Observation
Elements
Initial state probability P(s
1
)
(S parameters)
Transition probability P(s
t
s
t

1
)
(S^2 parameters)
Observation probability P(
o
t
s
t
)
(
Sx
O parameters)
As before, the graphical model is an encoding of the independence
assumptions:
P(s
t
s
t

1
, s
t

2
,…s
1
) =P(s
t
s
t

1
)
P(
o
t

s
T
,…,
s
t
,…s
1
,
o
T
,…,
o
t
,…o
1
)=P(
o
t
s
t
)
Examples: POS tagging, Sequential Segmentation
s
1
o
1
s
2
o
2
s
3
o
3
s
4
o
4
s
5
o
5
s
6
o
6
Bayesian Learning
CS446

FALL ‘12
5
HMM for Shallow Parsing
States:
{B, I, O}
Observations:
Actual words and/or part

of

speech tags
s
1
=B
o
1
Mr.
s
2
=I
o
2
Brown
s
3
=O
o
3
blamed
s
4
=B
o
4
Mr.
s
5
=I
o
5
Bob
s
6
=O
o
6
for
Bayesian Learning
CS446

FALL ‘12
6
HMM for Shallow Parsing
Given a sentences, we can ask what the most likely state
sequence is
Initial state probability:
P(s
1
=B),P(s
1
=I),P(s
1
=O)
Transition probabilty:
P(s
t
=Bs
t

1
=B),P(s
t
=Is
t

1
=B),P(s
t
=Os
t

1
=B),
P(s
t
=Bs
t

1
=I),P(s
t
=Is
t

1
=I),P(s
t
=Os
t

1
=I),
…
Observation Probability:
P(o
t
=‘Mr.’s
t
=B),P(o
t
=‘Brown’s
t
=B),…,
P(o
t
=‘Mr.’s
t
=I),P(o
t
=‘Brown’s
t
=I),…,
…
s
1
=B
o
1
Mr.
s
2
=I
o
2
Brown
s
3
=O
o
3
blamed
s
4
=B
o
4
Mr.
s
5
=I
o
5
Bob
s
6
=O
o
6
for
Bayesian Learning
CS446

FALL ‘12
7
Three
Computational Problems
Decoding
–
finding the most likely path
Have: model, parameters, observations (data)
Want: most likely states sequence
Evaluation
–
computing observation likelihood
Have: model, parameters, observations (data)
Want: the likelihood to generate the observed data
In both cases
–
a simple minded solution depends on S
T
steps
Training
–
estimating parameters
Supervised: Have: model,
annotated
data(data + states
sequence)
Unsupervised: Have: model, data
Want: parameters
a
B
I
B
I
I
d
d
c
0.5
0.2
0.5
0.5
0.5
0.4
0.25
0.25
0.25
0.25
a
Bayesian Learning
CS446

FALL ‘12
8
Finding most likely state sequence in HMM (1)
P
(
s
k
;
s
k
¡
1
;
:
:
:
;
s
1
;
o
k
;
o
k
¡
1
;
:
:
:
;
o
1
)
=
P
(
o
k
j
o
k
¡
1
;
o
k
¡
2
;
:
:
:
;
o
1
;
s
k
;
s
k
¡
1
;
:
:
:
;
s
1
)
¢
P
(
o
k
¡
1
;
o
k
¡
2
;
:
:
:
;
o
1
;
s
k
;
s
k
¡
1
;
:
:
:
;
s
1
)
=
P
(
o
k
j
s
k
)
¢
P
(
o
k
¡
1
;
o
k
¡
2
;
:
:
:
;
o
1
;
s
k
;
s
k
¡
1
;
:
:
:
;
s
1
)
=
P
(
o
k
j
s
k
)
¢
P
(
s
k
j
s
k
¡
1
;
s
k
¡
2
;
:
:
:
;
s
1
;
o
k
¡
1
;
o
k
¡
2
;
:
:
:
;
o
1
)
¢
P
(
s
k
¡
1
;
s
k
¡
2
;
:
:
:
;
s
1
;
o
k
¡
1
;
o
k
¡
2
;
:
:
:
;
o
1
)
=
P
(
o
k
j
s
k
)
¢
P
(
s
k
j
s
k
¡
1
)
¢
P
(
s
k
¡
1
;
s
k
¡
2
;
:
:
:
;
s
1
;
o
k
¡
1
;
o
k
¡
2
;
:
:
:
;
o
1
)
=
P
(
o
k
j
s
k
)
¢
[
k
¡
1
Y
t
=1
P
(
s
t
+1
j
s
t
)
¢
P
(
o
t
j
s
t
)]
¢
P
(
s
1
)
Bayesian Learning
CS446

FALL ‘12
9
Finding most likely state sequence in HMM (2)
a
rg
max
s
k
;s
k
¡
1
;:::
;s
1
P
(
s
k
;
s
k
¡
1
;
:
:
:
;
s
1
j
o
k
;
o
k
¡
1
;
:
:
:
;
o
1
)
=
a
rg
max
s
k
;s
k
¡
1
;:::
;s
1
P
(
s
k
;
s
k
¡
1
;
:
:
:
;
s
1
;
o
k
;
o
k
¡
1
;
:
:
:
;
o
1
)
P
(
o
k
;
o
k
¡
1
;
:
:
:
;
o
1
)
=
a
rg
max
s
k
;s
k
¡
1
;:::
;s
1
P
(
s
k
;
s
k
¡
1
;
:
:
:
;
s
1
;
o
k
;
o
k
¡
1
;
:
:
:
;
o
1
)
=
a
rg
max
s
k
;s
k
¡
1
;:::
;s
1
P
(
o
k
j
s
k
)
¢
[
k
¡
1
Y
t
=1
P
(
s
t
+1
j
s
t
)
¢
P
(
o
t
j
s
t
)]
¢
P
(
s
1
)
Bayesian Learning
CS446

FALL ‘12
10
Finding most likely state sequence in HMM (3)
A function of s
k
=
max
s
k
P
(
o
k
j
s
k
)
¢
max
s
k
¡
1
;:::
;s
1
[
k
¡
1
Y
t
=1
P
(
s
t
+1
j
s
t
)
¢
P
(
o
t
j
s
t
)]
¢
P
(
s
1
)
=
max
s
k
P
(
o
k
j
s
k
)
¢
max
s
k
¡
1
[
P
(
s
k
j
s
k
¡
1
)
¢
P
(
o
k
¡
1
j
s
k
¡
1
)]
¢
max
s
k
¡
2
;:::
;s
1
[
k
¡
2
Y
t
=1
P
(
s
t
+1
j
s
t
)
¢
P
(
o
t
j
s
t
)]
¢
P
(
s
1
)
=
max
s
k
P
(
o
k
j
s
k
)
¢
max
s
k
¡
1
[
P
(
s
k
j
s
k
¡
1
)
¢
P
(
o
k
¡
1
j
s
k
¡
1
)]
¢
max
s
k
¡
2
[
P
(
s
k
¡
1
j
s
k
¡
2
)
¢
P
(
o
k
¡
2
j
s
k
¡
2
)]
¢
:
:
:
¢
max
s
1
[
P
(
s
2
j
s
1
)
¢
P
(
o
1
j
s
1
)]
¢
P
(
s
1
)
max
s
k
;s
k
¡
1
;:::
;s
1
P
(
o
k
j
s
k
)
¢
[
k
¡
1
Y
t
=1
P
(
s
t
+1
j
s
t
)
¢
P
(
o
t
j
s
t
)]
¢
P
(
s
1
)
Bayesian Learning
CS446

FALL ‘12
11
Finding most likely state sequence in HMM (4)
Viterbi’s Algorithm
Dynamic Programming
max
s
k
P
(
o
k
j
s
k
)
¢
max
s
k
¡
1
[
P
(
s
k
j
s
k
¡
1
)
¢
P
(
o
k
¡
1
j
s
k
¡
1
)]
¢
max
s
k
¡
2
[
P
(
s
k
¡
1
j
s
k
¡
2
)
¢
P
(
o
k
¡
2
j
s
k
¡
2
)]
¢
:
:
:
¢
max
s
2
[
P
(
s
3
j
s
2
)
¢
P
(
o
2
j
s
2
)]
¢
¢
max
s
1
[
P
(
s
2
j
s
1
)
¢
P
(
o
1
j
s
1
)]
¢
P
(
s
1
)
Bayesian Learning
CS446

FALL ‘12
12
Learning the Model
Estimate
Initial state probability P
(s
1
)
Transition probability P(s
t
s
t

1
)
Observation probability P(o
t
s
t
)
Unsupervised Learning (states are not observed)
EM Algorithm
Supervised Learning (states are observed; more common)
ML Estimate of above terms directly from data
Notice that this is completely analogues to the case of naive
Bayes, and essentially all other models.
Bayesian Learning
CS446

FALL ‘12
13
Another view of Markov Models
Assumptions:
Prediction: predict
t
T
that maximizes
Input:
States:
Observations:
T
W
Bayesian Learning
CS446

FALL ‘12
14
Another View of Markov Models
As for NB:
features are pairs and singletons of t‘s, w’s
Only 3 active features
Input:
States:
Observations:
T
W
This can be extended to an argmax that maximizes the prediction of
the whole state sequence and computed, as before, via Viterbi.
HMM is a linear model
(over pairs of states and states/
obs
)
Bayesian Learning
CS446

FALL ‘12
15
Learning with Probabilistic Classifiers
Learning Theory
We showed that probabilistic predictions can be viewed as predictions via
Linear
Statistical Queries Models (Roth’99).
The low expressivity explains
Generalization+Robustness
Is that all?
It does not explain why is it possible to (approximately) fit the data with
these models. Namely, is there a reason to believe that these hypotheses
minimize the empirical error on the sample?
In General, No.
(Unless it corresponds to some probabilistic assumptions
that hold).
Bayesian Learning
CS446

FALL ‘12
16
Example: probabilistic classifiers
Features are pairs and singletons of t‘s, w’s
Additional features are included
States:
Observations:
T
W
If hypothesis does not fit the
training data

augment set of
features
(forget assumptions)
Bayesian Learning
CS446

FALL ‘12
17
Learning Protocol: Practice
LSQ hypotheses are computed directly
:
Choose features
Compute
coefficients
If hypothesis does not fit the training data
Augment
set of
features
(Assumptions will not be satisfied)
But now, you actually follow the Learning Theory Protocol:
Try to learn a hypothesis that is consistent with the data
Generalization will be a function of the low expressivity
Bayesian Learning
CS446

FALL ‘12
18
Remaining Question:
While
low expressivity explains
generalization, why is it relatively easy to fit the data?
Consider all distributions with the same
marginals
(That is,
a naïve Bayes classifier will predict the same
regardless of which distribution
really generated
the data
.)
Garg&Roth
ECML’01
):
Product distributions are “dense” in the space of all distributions.
Consequently, for
most
generating distributions the
resulting
predictor’s error is close to optimal classifier (that is, given the correct
distribution)
Robustness
of Probabilistic Predictors
Bayesian Learning
CS446

FALL ‘12
19
Summary: Probabilistic Modeling
Classifiers derived from probability density
estimation models were viewed as LSQ hypotheses.
Probabilistic
assumptions:
+
Guiding feature selection but
also

Not allowing the use of more general features
.
A unified approach:
a lot of classifiers, probabilistic and others can be
viewed as linear classier over an appropriate feature space.
Bayesian Learning
CS446

FALL ‘12
20
What’s Next?
(1)
If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear learning
algorithms probabilistically?
Yes
(2)
If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?
Yes.
Classification
: Logistics regression/Max Entropy
HMM: can be learned as a linear model, e.g., with a version of
Perceptron (Structured Models; Spring 2013)
Bayesian Learning
CS446

FALL ‘12
Recall: Naïve Bayes, Two Classes
21
In the case of two classes we have:
but since
We get (plug in (2) in (1); some algebra):
Which is simply the logistic (sigmoid) function used in the
neural network representation.
Bayesian Learning
CS446

FALL ‘12
Conditional Probabilities
Data: Two class (Open/
NotOpen
Classifier)
22
The plot shows a
(normalized) histogram
of examples as a
function of the dot
product
act = (
w
T
x
+
b)
and a couple other
functions of it.
In particular, we plot
the
positive Sigmoid:
P(y
= +1 
x,w
)= [1+exp
(

(
w
T
x
+ b)]

1
Is
this
really a
probability distribution?
Bayesian Learning
CS446

FALL ‘12
Plotting:
For
example z:
y=
Prob
(label=1
 f(z)=x
)
(
Histogram:
for 0.8, # (of examples
with f(z)<0.8))
Claim: Yes;
If
Prob
(label=1
 f(z)=x) = x
Then f(z) =
f(z) is a probability dist.
That is,
yes,
if the graph is linear.
Theorem
:
Let X be a RV with
distribution
F.
(1)
F(X
) is uniformly distributed
in
(0,1).
(2)
If
U is
uniform(0,1
), F

1
(U) is
distributed
F, where F

1
(x) is the value
of
y
s.t.
F(y) =x.
Alternatively:
f(z) is a probability if:
Prob
U
{z
Prob
[
(f(z)=1
·
y]} = y
Conditional Probabilities
23
Plotted for
SNoW
(Winnow).
Similarly, perceptron; more tuning
is required for SVMs.
Bayesian Learning
CS446

FALL ‘12
Conditional Probabilities
(1)
If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear
learning algorithms probabilistically?
Yes
General recipe
Train a classifier f using your favorite algorithm (Perceptron, SVM,
Winnow,
etc
). Then:
Use Sigmoid1/1+exp{

(
Aw
T
x
+ B)} to get an estimate for P(y  x)
A, B can be tuned using a held out that was not used for training.
Done in LBJ, for example
24
Bayesian Learning
CS446

FALL ‘12
(2)
If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?
The logistic regression model assumes the following model:
P(y= +/

1 
x,w
)= [1+exp(

y(
w
T
x
+ b)]

1
This is the same model
we derived for naïve Bayes, only
that
now we will not assume any independence assumption.
We
will directly find the best w.
Therefore training will be more difficult. However, the weight
vector derived will be
more expressive.
It can be shown that the naïve Bayes algorithm cannot represent all
linear threshold functions.
On the other hand, NB converges to
its performance
faster.
Logistic Regression
25
How?
Bayesian Learning
CS446

FALL ‘12
Logistic Regression (2)
Given the model:
P(y
= +/

1 
x,w
)= [1+exp(

y(w
T
x + b)]

1
The goal is to
find the (
w, b)
that maximizes
the log likelihood of the data:
{x
1
,x
2
…
x
m
}.
We
are looking for
(
w,b
)
that minimizes the negative log

likelihood
min
w,b
1
m
log
P(y
= +/

1 
x,w
)=
min
w,b
1
m
log[1+exp(

y
i
(w
T
x
i
+
b
)]
This optimization problem is called
Logistics Regression
Logistic
Regression
is sometimes called the
Maximum Entropy model
in the
NLP community (since the resulting distribution is the one that has the
largest entropy among all those that activate the same features).
26
Bayesian Learning
CS446

FALL ‘12
Logistic Regression (3)
Using the standard mapping to linear separators through the origin, we
would like to minimize:
min
w
1
m
log
P(y= +/

1 
x,w
)=
min
w
,
1
m
log[1+exp
(

y
i
(w
T
x
i
)]
To get good generalization, it is common to add a regularization term, and
the regularized logistics regression then becomes:
min
w
f(w) = ½
w
T
w
+ C
1
m
log[1+exp
(

y
i
(w
T
x
i
)],
Where C is a user selected parameter that balances the two terms.
27
Empirical loss
Regularization term
Bayesian Learning
CS446

FALL ‘12
Comments on discriminative Learning
min
w
f(w)
= ½
w
T
w
+ C
1
m
log[1+exp
(

y
i
w
T
x
i
)],
Where C is a user selected parameter that balances the two terms.
Since the
second term can be considered the
loss
function
Therefore
, regularized logistic regression can be related to other learning
methods, e.g., SVMs.
L
1
SVM
solves the following
optimization problem:
min
w
f
1
(w)
= ½
w
T
w
+ C
1
m
max(0,1

y
i
(
w
T
x
i
)
L
2
SVM
solves the following
optimization problem:
min
w
f
2
(w
)
= ½
w
T
w
+ C
1
m
(max(0,1

y
i
w
T
x
i
))
2
28
Empirical loss
Regularization term
Bayesian Learning
CS446

FALL ‘12
Optimization: How to Solve
29
All
methods are iterative methods, that
generate a sequence
w
k
that
converges to the optimal solution of the optimization problem above.
Many options within this category:
Iterative scaling:
Low cost per
iteration, slow convergence, updates
each
w component at a
time
Newton methods:
High cost per
iteration, faster convergence
non

linear
conjugate gradient;
quasi

Newton
methods
; truncated Newton
methods; trust

region newton method.
Currently
: Limited memory BFGS is very popular
Stochastic Gradient Decent methods
The
runtime does not depend on
n
=#(examples); advantage when
n
is very large.
Stopping criteria is a problem: method tends
to be too aggressive at the beginning
and reaches
a moderate accuracy quite fast,
but it’s
convergence becomes
slow if
we
are interested in more accurate solutions
.
Bayesian Learning
CS446

FALL ‘12
Summary
(1)
If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear
learning algorithms probabilistically?
Yes
(2)
If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?
Yes.
Classification
:
Logistic
regression/Max Entropy
HMM: can be trained via Perceptron (Spring 2013(
30
Comments 0
Log in to post a comment