Author:
David Heckerman
Presented By:
Yan Zhang

2006
Jeremy Gould
–
2013
1
Outline
Bayesian Approach
Bayesian vs. classical probability methods
Examples
Bayesian Network
Structure
Inference
Learning Probabilities
Learning the Network Structure
Two coin toss
–
an example
Conclusions
Exam Questions
2
Bayesian vs. the Classical Approach
The Bayesian probability of an event
x
, represents the
person’s degree of belief or confidence in that event’s
occurrence based on prior and observed facts.
Classical probability refers to the true or
actual
probability of the event and is not concerned with
observed behavior.
3
Example
–
Is this Man a Martian
Spy?
4
Example
We start with two concepts:
1.
Hypothesis (H)
–
He either is or is not a Martian
spy.
2.
Data (D)
–
Some set of information about the
subject. Perhaps financial data, phone records,
maybe we bugged his office…
5
Example
Frequentist Says
Bayesian Says
Given a hypothesis (He IS a
Martian) there is a
probability P of seeing this
data:
P( D  H )
(Considers absolute ground
truth, the uncertainty/noise
is in the data.)
Given this data there is a
probability P of this
hypothesis being true:
P( H  D )
(This probability indicates
our level of
belief
in the
hypothesis.)
6
Bayesian vs. the Classical Approach
Bayesian approach restricts its prediction to the next
(N+1) occurrence of an event given the observed
previous (N) events.
Classical approach is to predict likelihood of any given
event regardless of the number of occurrences.
7
NOTE: The Bayesian approach can be
updated
as new data is observed.
Bayes Theorem
8
where
𝑃
=
𝑃
𝑃
(
)
𝑃
(
)
𝑃
=
𝑃
𝑃
Continuous
Realm
𝑃
=
𝑃
𝑃
(
)
Discrete Realm
For the
continuous case
imagine an
infinite number
of infinitesimally
small partitions.
Example
–
Coin Toss
I want to toss
a coin
n = 100 times. Let’s denote the
random variable
X as the outcome of one
flip:
p(X=head
) =
θ
p(X=tail
) =
1

θ
Before
doing this
experiment
we have some belief in
our
mind: the
prior probability
ξ
. Let’s assume
that this event will have a Beta distribution (a
common assumption):
Sample Beta
Distributions:
Example
–
Coin Toss
𝑃
𝜃

𝜉
=
𝜃

,
=
Γ
+
Γ
Γ
𝜃
−
1
1
−
𝜃
−
1
If we assume a 50

50 coin we can use
α
=
β
= 5 which gives:
𝜃
=
+
𝑉
𝜃
=
+
2
+
+
+
1
𝜃
=
+
=
5
10
=
0
.
5
(Hopefully, what you
were expecting!)
Example
–
Coin Toss
Now I can run my experiment. As I go I can update my beliefs based on
the observed heads (
h
) and tails (
t
) by applying Bayes Law to the Beta
Distribution:
11
𝑃
𝜃

,
𝜉
=
𝑃
𝜃

𝜉
𝑃

𝜃
,
𝜉
𝑃

𝜉
=
𝑃
𝜃

𝜉
𝜃
ℎ
1
−
𝜃
𝑃

𝜉
Example
–
Coin Toss
12
𝑃
𝜃

,
𝜉
=
𝜃

+
ℎ
,
+
=
Γ
+
+
ℎ
+
Γ
+
ℎ
Γ
+
𝜃
+
ℎ
−
1
1
−
𝜃
+
−
1
Since we’re assuming a Beta distribution this becomes:
…our
posterior probability
. Supposing that we observed
h = 65, t = 35, we would get:
𝜃

,
𝜉
=
+
ℎ
+
ℎ
+
+
=
5
+
65
6
+
65
+
5
+
35
=
0
.
64
Example
–
Coin Toss
13
Integration
14
To find the probability that
X
n+1
=
heads
, we
could also integrate
over all possible values
of
θ
to find the average value of
θ
which
yields:
𝑃
𝑋
+
1
=
ℎ

,
𝜉
=
𝑃
𝑋
+
1
=
ℎ

,
𝜉
𝑃
𝜃
,
𝜉
𝜃
=
𝜃𝑃
𝜃

,
𝜉
𝜃
=
𝜃

,
𝜉
This might be necessary if we were working with a
distribution with a less obvious Expected Value.
More than Two Outcomes
In the previous example, we used a Beta distribution to encode the
states of the random variable. This was possible because there were
only 2 states/outcomes of the variable X.
In general, if the observed variable X is discrete, having r possible states
{1,…,r}, the likelihood function is given by:
15
𝑃
𝑋
=

𝜃
,
𝜉
=
𝜃
ℎ
:
=
1
,
2
,
…
.
,
𝜃
∈
𝜃
1
,
…
.
,
𝜃
and
𝜃
=
1
In this general case we can use a
Dirichlet
distribution instead:
P
rior
𝑃
𝜃

𝜉
=
Dir
𝜃

1
,
2
,
…
,
=
Γ
Γ
𝑘
𝑟
𝑘
=
1
𝜃
𝑘
−
1
=
1
Posterior
𝑃
𝜃

,
𝜉
=
Dir
𝜃

1
+
1
,
2
+
2
,
…
,
+
Vocabulary Review
Prior Probability, P(
θ

ξ
): Prior Probability of a
particular value of
θ
given no observed data (our
previous “belief”)
Posterior Probability, P(
θ
 D,
ξ
): Probability of a
particular value of
θ
given that D has been observed
(our final value of
θ
).
Observed Probability or “Likelihood”, P(D
θ
,
ξ
):
Likelihood of sequence of coin tosses D being
observed given that
θ
is a particular value.
P
(D
ξ
): Raw probability of D
16
Outline
Bayesian Approach
Bayes Therom
Bayesian vs. classical probability methods
coin toss
–
an example
Bayesian Network
Structure
Inference
Learning Probabilities
Learning the Network Structure
Two coin toss
–
an example
Conclusions
Exam Questions
17
OK, But So What?
That’s great but this is Data Mining not Philosophy of
Mathematics.
Why should we care about all of this ugly math?
18
Bayesian Advantages
It turns out that the Bayesian technique permits us to do
some very useful things from a mining perspective!
1. We can use the Chain Rule with Bayesian Probabilities:
19
𝑃
=
1
=
𝑃

−
1
=
1
=
1
𝑃
,
,
=
𝑃
,
𝑃

𝑃
Ex.
This isn’t
something we
can’t easily do
with classical
probability!
2. As we’ve already seen using the Bayesian model
permits us to
update our beliefs
based on new data.
Example Network
20
To create a Bayesian network we will ultimately
need 3 things:
A
set of Variables
X
={
X
1
,…,
X
n
}
A
Network Structure
Conditional Probability Table (CPT
)
Note that when we start we may not have any of
these things or a given element may be
incomplete!
Let’s start with a simple case where we are given all three things: a credit
fraud network designed to determine the probability of credit fraud.
21
Set of Variables
22
Each
node
represents a
random variable.
(Let’s assume
discrete for now.)
Network Structure
23
Each
edge
represents a
conditional
dependence
between variables.
Conditional Probability Table
24
Each
rule
represents the
quantification of a
conditional
dependency.
25
Since we’ve been
given
the
network structure we can
easily see the conditional
dependencies:
P(AF,A,S,G) = P(A)
P(SF,A,S,G) = P(S)
P(GF,A,S,G) = P(GF)
P(JF,A,S,G) = P(JF,A,S)
26
Note that the
absence
of an edge
indicates conditional
independence
:
P(AG) = P(A)
27
Important Note:
The presence of a of cycle
will render one or more of
the relationships
intractable!
Inference
28
𝑃

,
,
,
=
𝑃
,
,
,
,
𝑃
,
,
,
=
𝑃
,
,
,
,
𝑃
′
,
,
,
,
′
Now suppose we want to calculate (infer) our
confidence level in a hypothesis on the fraud
variable f given some knowledge about the other
variables. This can be directly calculated via:
(Kind of messy…)
Inference
29
𝑃

,
,
,
=
𝑃
𝑃
𝑎
𝑃
𝑃

𝑃

,
𝑎
,
𝑃
′
𝑃
𝑎
𝑃
𝑃

′
𝑃

′
,
𝑎
,
𝑓
′
=
𝑃
𝑃

𝑃

,
𝑎
,
𝑃
′
𝑃

′
𝑃

′
,
𝑎
,
𝑓
′
Fortunately, we can use the Chain Rule to simplify!
This Simplification is especially powerful when the network is
sparse
which is frequently the case in real world problems.
This shows how we can use a Bayesian Network to
i
nfer
a probability not stored directly in the model.
Now for the Data Mining!
So far we haven’t added much value to the data. So let’s take advantage
of the Bayesian model’s ability to
update our beliefs
and
learn from
new data
.
First we’ll rewrite our joint probability distribution in a more compact
form:
30
𝑃

𝜃
,
ℎ
=
𝑃

,
𝜃
,
ℎ
=
1
𝜃
=
𝜃
1
,
𝜃
2
,
…
,
𝜃
∈
1
,
2
,
…
,
𝜃
ℎ
𝑃

,
𝜃
,
ℎ
ℎ
:
𝑃

,
𝜃
,
ℎ
=
𝜃
>
0
𝑭
𝒘
𝒂
𝑷
𝜽

𝑺
ℎ
ℎ
ℎℎ
(
𝜉
)
Learning Probabilities in a Bayesian Network
First we need to make two assumptions:
There is no missing data (i.e. the data
accurately describes the distribution)
The parameter vectors are
independent (generally a good
assumption, at least locally).
31
Learning Probabilities in a Bayesian Network
If these assumptions hold we can express the probabilities as:
32
Posterior
𝑃
𝜃

,
ℎ
=
𝑃
𝜃

,
ℎ
𝑖
=
1
=
1
Prior
𝑃
𝜃

ℎ
=
𝑃
𝜃

ℎ
𝑖
=
1
=
1
This means we can update each vector of parameters
θ
ij
independently, just as one

variable
case!
•
If
each vector
θ
ij
has the prior distribution
Dir
(
θ
ij
a
ij1
,…,
a
ijr
i
)…
•
…Then the p
osterior distribution is:
𝑃
𝜃

,
ℎ
=
𝜃

1
+
1
,
…
,
𝑖
+
𝑖
Where
n
ijk
is the number of cases in
D
in which
X
i
=
x
i
k
and
Pa
i
=
pa
i
j
Dealing with Unknowns
Whew! Now we know how to use our network to infer
conditional relationships and how to update our
network with new data. But what if we aren’t given a well
defined network? We could start with missing or
incomplete:
1.
Set of Variables
2.
Conditional Relationship Data
3.
Network Structure
33
Unknown Variable Set
Our goal when choosing variables is to:
“Organize…into variables having mutually exclusive and
collectively exhaustive states.”
This is a problem shared by all data mining algorithms: What
should we measure and why? There is not and probably
cannot be an algorithmic solution to this problem as arriving
at any solution requires intelligent and creative thought.
34
Unknown Conditional Relationships
This can be easy.
So long as we can generate a plausible initial belief about
a conditional relationship we can simply start with our
assumption and let our data refine our model via the
mechanism shown in the
Learning Probabilities in a
Bayesian
Network
slide.
35
Unknown Conditional Relationships
However, when our ignorance becomes serious
enough that we no longer even know what is
dependent on what we segue into the
Unknown
Structure
scenario.
36
Learning the Network Structure
Sometimes the conditional relationships are
not obvious. In this case we are uncertain with
the network structure: we don’t know where
the edges should be.
37
Learning the Network Structure
Theoretically, we can use a Bayesian approach to get the
posterior distribution of the network structure:
Unfortunately, the number of possible network
structure increase exponentially with n
–
the number of
nodes. We’re basically asking ourselves to consider every
possible graph with n nodes!
38
𝑃
ℎ

=
𝑃

ℎ
𝑃
ℎ
𝑃

ℎ
=
ℎ
𝑖
𝑃
ℎ
𝑖
=
1
Learning the Network Structure
Heckerman describes two main methods for shortening the
search for a network model:
Model Selection
To select a “good” model (i.e. the network structure) from all
possible models, and use it as if it were the correct model.
Selective Model Averaging
To select a manageable number of good models from among all
possible models and pretend that these models are exhaustive.
The math behind both techniques is quite involved so I’m afraid
we’ll have to content ourselves with a toy example today.
39
Two Coin Toss Example
Experiment: flip two coins and observe the outcome
Propose two network structures:
S
h
1
or
S
h
2
Assume P(
S
h
1
)=P(
S
h
2
)=
0.5
After observing some data, which model is more
accurate for this collection of data?
40
X
1
X
2
X
1
X
2
p(H)=p(T)=0.5
p(H)=p(T)=0.5
p(HH)
= 0.1
p(TH)
= 0.9
p(HT)
= 0.9
p(TT)
= 0.1
S
h
1
S
h
2
P(X
2
X
1
)
Two Coin Toss Example
X
1
X
2
1
T
T
2
T
H
3
H
T
4
H
T
5
T
H
6
H
T
7
T
H
8
T
H
9
H
T
10
H
T
41
𝑃
ℎ

=
𝑃

ℎ
𝑃
ℎ
𝑃

ℎ
=
ℎ
𝑃
ℎ
=
1
𝑃
ℎ

=
𝑃

ℎ
𝑃
ℎ
𝑃

ℎ
=
ℎ
𝑃
ℎ
2
=
1
𝑃

ℎ
=
𝑃
𝑋
𝑑

𝑃
,
ℎ
2
=
1
10
𝑑
=
1
Two Coin Toss Example
X
1
X
2
1
T
T
2
T
H
3
H
T
4
H
T
5
T
H
6
H
T
7
T
H
8
T
H
9
H
T
10
H
T
42
𝑃

ℎ
=
𝑃
𝑋
𝑑

𝑃
,
ℎ
2
=
1
10
𝑑
=
1
1
ℎ
:
=
𝑃
𝑋
1
=
𝑃
𝑋
2
=
𝑃
𝑋
1
=
𝑃
𝑋
2
=
𝐻
….
=
0
.
5
0
.
5
0
.
5
0
.
5
….
=
0
.
5
2
10
Two Coin Toss Example
X
1
X
2
1
T
T
2
T
H
3
H
T
4
H
T
5
T
H
6
H
T
7
T
H
8
T
H
9
H
T
10
H
T
43
𝑃

ℎ
=
𝑃
𝑋
𝑑

𝑃
,
ℎ
2
=
1
10
𝑑
=
1
2
ℎ
:
𝑃
𝑋
1
=
𝑃
𝑋
2
=

𝑋
1
=
=
0
.
5
0
.
1
𝑃
𝑋
1
=
𝑃
𝑋
2
=
𝐻

𝑋
1
=
=
0
.
5
0
.
9
𝑃
𝑋
1
=
𝐻
𝑃
𝑋
2
=

𝑋
1
=
𝐻
=
0
.
5
0
.
9
.
.
.
𝑃

ℎ
=
𝑃
𝑋
𝑑

𝑃
,
ℎ
2
=
1
=
0
.
5
10
0
.
1
1
0
.
9
9
10
𝑑
=
1
Two Coin Toss Example
44
𝑃
1
ℎ

=
𝑃

1
ℎ
𝑃
1
ℎ
𝑃

1
ℎ
𝑃
1
ℎ
+
𝑃

2
ℎ
𝑃
2
ℎ
𝑃
1
ℎ

=
0
.
5
2
10
0
.
5
0
.
5
2
10
0
.
5
+
0
.
5
10
0
.
1
1
0
.
9
9
0
.
5
𝑃
1
ℎ

=
0
.
5
10
0
.
5
10
+
0
.
1
1
0
.
9
9
𝑃
1
ℎ

≈
2
.
5%
Two Coin Toss Example
45
𝑃
2
ℎ

=
𝑃

2
ℎ
𝑃
2
ℎ
𝑃

1
ℎ
𝑃
1
ℎ
+
𝑃

2
ℎ
𝑃
2
ℎ
𝑃
2
ℎ

=
0
.
5
10
0
.
1
1
0
.
9
9
0
.
5
0
.
5
2
10
0
.
5
+
0
.
5
10
0
.
1
1
0
.
9
9
0
.
5
𝑃
2
ℎ

=
0
.
1
1
0
.
9
9
0
.
5
10
+
0
.
1
1
0
.
9
9
𝑃
2
ℎ

≈
97
.
5%
Two Coin Toss Example
46
𝑃
2
ℎ

≈
97
.
5%
>
2
.
5%
≈
𝑃
1
ℎ

ℎ
ℎ
2
ℎ
.
Outline
Bayesian Approach
Bayes Therom
Bayesian vs. classical probability methods
coin toss
–
an example
Bayesian Network
Structure
Inference
Learning Probabilities
Learning the Network Structure
Two coin toss
–
an example
Conclusions
Exam Questions
47
Conclusions
Bayesian method
Bayesian network
Structure
Inference
Learn parameters and structure
Advantages
48
Question1: What is Bayesian Probability?
A person’s degree of belief in a certain event
Your own degree of certainty that a tossed coin will
land “heads”
A degree of confidence in an
outcome
given some data.
49
Question 2:
Compare the
Bayesian and classical
approaches to
probability (any one point).
Bayesian Approach
:
Classical Probability
:
+
Reflects an expert’s
knowledge
+The belief is kept updating
when new data item arrives

Arbitrary (More subjective
)
Wants P
(
H

D
)
+Objective and unbiased

Generally not available
It takes a long time to
measure the object’s physical
characteristics
Wants P
(
D

H
)
50
Question 3: Mention at least 1 Advantage of
Bayesian analysis
Handle incomplete data sets
Learning about causal relationships
Combine domain knowledge and data
Avoid over fitting
51
The End
Any Questions?
52
Comments 0
Log in to post a comment