Bayesian Probabilities
•
Recall that
Bayes
theorem gives us a simple way to
compute likelihoods for hypotheses and thus is useful for
problems like diagnosis, recognition, interpretation, etc
–
P(h  E) = p(E  h) * P(h)
•
As we discussed, there are several problems when
applying
Bayes
–
Bayes
theorem works only when we have independent events
–
We need too many probabilities to account for all of the
combinations of evidence in E
–
Probabilities are derived from statistics which might include
some bias
•
We can get around the first two problems if we assume
independence (sometimes known as Naïve Bayesian
probabilities) or if we can construct a suitable Bayesian
network
•
We can also employ a form of learning so that the
probabilities more suit the problem at hand
Co

dependent Events
•
Recall from our sidewalk being wet example
–
The two causes were that it rained or that we ran the
sprinkler
–
While the two events do not seem to be dependent in fact
we might not run the sprinkler if it is supposed to rain or
we might run the sprinkler if we postponed running it
because it might rain and it didn’t
–
Therefore, we cannot treat rain and sprinkler as
independent events and we need to resolve this by using
what is known as the probability
chain rule
–
P(A & B & C) = P(A) * P(B  A) * P(C  A & B)
•
the & means “each item in the list happens”
–
if we have 10 co

dependent events, say A1, …, A10, then
our last probability in the list is P(A10  A1 & A2 & … &
A9)
More on the Chain Rule
•
In order to apply the chain rule, I will need a large
number of probabilities
–
Assume that I have events A, B, C, D, E
–
P(A & B & C & D & E) = P(A) * P(B  A) * P(C  A &
B) * P(D  A & B & C) * P(E  A & B & C & D)
–
P(A & B & D & E) = P(A) * P(B  A) * P(D  A & B) *
P(E  A & B & D)
–
P(A & C & D) = P(A) * P(C  A) * P(D  A & C)
•
etc
•
So I will need P(event
i
 some combination of
other events j) for all
i
and j
–
if I have 5 co

dependent events, I need 32 conditional
probabilities (along with 5 prior probabilities)
Independence
•
Because of the problem of co

dependence, we might
want to see if two events are independent
•
For two events to be independent, if one event arises
it does not impact the probability of the other event
–
Consider for instance if you roll a 6 on a die, the
probability of rolling a 6 on the same or another die
should not change from 1 in 6
–
Consider drawing a red card from deck of cards and
replacing it (putting it back into the deck) will not change
the probability that the next card drawn is also red
–
However, after drawing the first red card, if you do not
replace it, then the probability of drawing a second red
card will change
•
We say that two events, A and B, are independent if
and only if P(
A
∩
B
) = P(A)P(B)
Continued
•
If we have independent events, then it simplifies what we
need in order to compute our probabilities
–
When reasoning about events A and B, if they are
independent, we do not need probabilities of P(A), P(B) and
P(A ∩ B), just P(A) and P(B)
•
Similarly, if events A and B are independent, then
–
P(A  B) = P(A) because B being present or absent will have
no impact on A
•
With the assumption that events are independent in place,
our previous equation of the form:
–
P(H  e1 & e2 & e3 & …) = P(H) * P(e1 & e2 & e3 & …  H)
= P(H) * P(e1  H) * P(e2  e1 & H) * P(e3  e1 & e2 & H) *
…
•
can rewrite a statement as
–
P(H  e1 & e2 & e3 & …) = P(H) * P(e1  H) * P(e2  H) *
P(e3  H) * …
–
Thus, independence gets past the problem of needing an
exponential number of probabilities
Naïve Bayesian Classifier
•
Assume we have q data sets where each data set comprises
some elements of the set {d1, d2, …,
dn
} and each set
qi
has been classified into one of m categories {c1, c2, …,
cm}
–
Given a new case = {
dj
,
dk
, …} we compute the likelihood
that the new case is in each of the categories as follows:
•
P(
ci
 case) = P(
ci
) * P(
dj

ci
) * P(
dk

ci
) * … for each
i
from 1 to m
–
Where P(
ci
) is simply the number of times
ci
occurs out of q
–
And where P(
dj

ci
) is the number of times datum
dj
occurred
in a data set that was classified as category
ci
•
P(
ci
 case) is a Naïve Bayesian classifier (NBC) for
category
i
–
This only works if the data making up any one case are
independent
–
this assumption is not necessarily true, thus the
word “naïve”
–
The good news? This is easy!
Chain Rule vs. Naïve Bayes
•
Let’s consider an example
–
We want to determine the probability that a person will have spoken
or typed a particular phrase, say “the man bit a dog”
–
We will compute the probability of this by examining several
hundred or thousand training sentences
•
Using Naïve
Bayes
:
–
P(“the man bit a dog”) = P(“the”) * P(“man”) * P(“bit”) * P(“a”) *
P(“dog”)
•
We compute this just by counting the number of times each word occurs in
the training sentences
•
Using the chain rule
–
P(“the man bit the dog”) = P(“the”) * P(“man”  “the”) * P(“bit” 
“the man”) * P(“a”  “the man bit”) * P(“dog”  “the man bit a”)
•
P(“bit”  “the man”) is the number of times the word “bit” followed “the
man” in all of our training sentences
•
The probability computed by the chain rule is far smaller but
also much more realistic, so using Naïve
Bayes
should be done
only with caution and the foreknowledge that we can make an
independence assumption
Spam Filters
•
One of the most common uses of a NBC is to construct a
spam filter
–
the spam filter works by learning a “bag of words”
–
that is, the
words that are typically associated with spam
–
we take all of the words of every email message and discard
any common words (I, of, the, is, etc)
•
Now we “train” our spam filter by computing these
probabilities:
–
P(spam)
–
the number of emails that were spam out of the
training set
–
P(!spam)
–
the number of emails that were not spam out of the
training set
–
P(word1  spam)
–
the number of times word1 appeared in
spam emails
–
P(word1  !spam)
–
the number of times word 1 appears in non
spam emails
–
And so forth for every non common word
Using Our Spam Filter
•
A new email comes in
•
Discard all common words
•
Compute P(spam  words) and P(!spam  words)
–
P(spam  word1 & word2 & … &
wordn
) = P(spam) * P(word1
 spam) * P(word2  spam) * … * P(
wordn
 spam)
–
P(!spam  word1 & word2 & … &
wordn
) = P(!spam) *
P(word1  !spam) * P(word2  !spam) * … * P(
wordn
 !spam)
•
Which probability is higher? That gives you your answer
–
Without the naïve assumption, our computation becomes:
•
P(spam  word1, word2, …,
wordn
) = P(spam) * P(word1, word2, … ,
wordn
 spam) = P(spam) * P(word1  spam) * P(word2, … ,
wordn

spam) = P(spam) * P(word1  spam) * P(word1 & word2  spam) * P( …
,
wordn
 spam) * …
–
English has well over 100,000 words but many are common, so
that a spam filter may only deal with say 5,000 words but that
would require 2
5000
probabilities if we did not use naïve
Bayes
!
Classifier Example: Clustering
•
Clustering is used in data mining to infer boundaries for
classes
–
Here, we assume that we have already clustered a set of data into two
classes
–
We want to use NBC to determine if a new datum, which lies in
between the two clusters, is of one class or another
•
We have two categories, we will call them red and green
–
P(red) = # of red entries / # of total entries = 20 / 60
–
P(green) = # of green entries / # of total entries = 40 / 60
•
We add a new datum (shown in white in the figure) and
identify which class it is most likely a part of
–
P(x  green) = # of green entries nearby / # of green entries = 1 / 40
–
P(x  red) = # of red entries nearby / # of red entries = 3 / 20
•
P(x is green  green entries) = p(green) *
P(x  green) = 40 / 60 * 1 / 40 = 1 / 60
•
P(x is red  red entries) = p(red) * P(x 
red) = 20 / 60 * 3 / 20 = 1 / 20
Bayesian Network
•
Thus far, our computations have been based on a
single mapping from input to output
–
What if our reasoning process is more involved?
•
for instance, we have symptoms that map to intermediate
conclusions that map to disease hypotheses?
•
or we have an explicit causal chain?
–
We can form a network of the values that make up the
chain of reasoning, these values are our random variables
and each variable will be represented as a node
•
the network will be a directed acyclic graph
–
nodes will point
from cause to effect, and we hope that the resulting directed
graph is acyclic (although as we will see this may not be the case)
•
we use prior probabilities for the initial nodes and conditional
probabilities to link those nodes to other nodes
Simple Example
•
Below we have a simple Bayesian network
arranged as a DAG
–
Our causes are construction and/or accident and our
effects are orange barrels, bad traffic and/or flashing
lights
–
Each variable (node) will either be true or false
–
Given the input values of whether we see B, T or L, we
want to compute the likelihood that the cause was
either C or A
•
notice that T can be caused by either C or A or both
–
while
this makes our graph somewhat more complicated than a
linear graph, it does not contain a cycle
Computing the Cause
–
We use the chain rule to compute the probability of a chain of
states being true (C, A, B, T, L) (we address in a bit what we
really want to compute, not this particular chain)
•
p(C, A, B, T, L) = p(C) * p(A  C) * p(B  C, A) * p(T  B, C, A)
* p(L  C, A, B, T) where p(C) is a prior probability and the
others are conditional probabilities
–
with 5 items, we need 2
5
= 32 conditional probabilities
•
We can reduce some of the above terms
–
B has nothing to do with A so that p(B  C, A) becomes p(B  C)
–
T has nothing to do with B so p(T  B, C, A) becomes p(T  C, A)
–
L has nothing to do with C so p(L  C, A, B, T) becomes P(T  C, A)
and
•
We can reduce p(C, A, B, T, L) to p(C) * p(A) * p(B  C) * p(T
 C, A) * p(L  A)
–
If we could assume independence, then even p(T  C, A) could be
simplified, however since two causes feed into the same effect, this
may not be wise
–
If we did choose to implement this as p(T  C) * p(T  A) then we are
assuming independence and we solve this with Naïve
Bayes
A Naïve Bayes Example Network
•
Assume we have one cause and m effects as
shown in the figure below
–
Compute p(Y) and p(!Y)
•
Number of times Y occurs in the data and the number of
times Y does not occur out of n data
–
Compute p(Xi  Y)
•
Number of times Xi occurs when Y is true for each
i
from 1
to m
–
Given some collection V, a subset of {X1, X2, …,
Xm
}, then p(Y  V) = p(Y) * p(vi  Y) for each vi in V
•
Let’s examine an example
–
Y
–
student is a grad student
–
X1
–
student taking CSC course
–
X2
–
student works
–
X3
–
student is serious (dedicated)
–
X4
–
student calls
prof
by first name
Example Continued
•
The University has 15,000 students of which 1,500 are
graduate students
–
p(Y) = 1500/15000 = .10
•
Out of 1500 graduate students, 60 are taking CSC courses
–
P(CSC  grad student) = 60 / 1500 = .04
–
P(!CSC  grad student) = 1450 / 1500 = .96
•
Out of 1500 graduate students
, 1250
work full time
–
P(work  grad student) = 1250 / 1500 = .83
–
P(!word  grad student) = 250 / 1500 = .17
•
Out of 1500 graduate students, 1400 are serious about their
studies
–
P(serious  grad student) = 1400 / 1500 = .93
–
P(!serious  grad student) = 100 / 1500 = .07
•
Out of 1500 graduate students, 750 call their profs by their
first names
–
P(first name  grad student) = 750 / 1500 = .5
–
P(!first name  grad student) = 750 / 1500 = .5
Example Continued
•
NOTE: we will similarly have statistics for the non graduate
students (see below)
•
A given student works, is in CSC, is serious but does not call
his
prof
by the first name
–
p(CSC  !grad student) = 250 / 13500 = .02
–
p(work fulltime  !grad student) = 2500 / 13500 = .19
–
p(serious  !grad student) = 5000 / 13500 = .37
–
p(!first name  !grad student) = 12000 / 13500 = .89
•
What is the probability that the student is a graduate student?
–
p(grad student  works & CSC & serious & !first name) = p(grad
student) * p(works  grad student) * p(CSC  grad student) *
p(serious  grad student) * p(!first name  grad student) = .1 * .04 *
.83 * .93 * .5 = .0015
–
p(!grad student  works & CSC & serious & !first name) = p(!grad
student) * p(works  !grad student) * p(CSC  !grad student) *
p(serious  !grad student) * p(!first name  !grad student) = .9 * .02 *
.19 * .37 * .89 = .0011
•
Therefore we can conclude the student is a grad student
A Lengthier Example
•
The network below shows the cause and effects of two
independent situations
–
an earthquake (which can cause a radio announcement and/or your
alarm to go off)
–
and a burglary (which can cause your alarm to go off)
–
if your alarm goes off, your neighbor may call you to find out why
•
The joint probability of alarm, earthquake, radio, burglary =
p(A, R, E, B) = p(A  R, E, B) * p(R  E, B) * p(E  B) * p(B)
•
But because A & R and E & B are
independent pairs of events, we
can reduce p(A  R, E, B) to p(A 
E, B) and p(R  E, B) to p(R  E)
•
What about neighbor call?
Notice that it is dependent on
alarm which is dependent on
earthquake and burglary
Conditional Independence
•
In our previous example, we saw that neighbor
calling was not independent of burglary or
earthquake and therefore, a joint probability
p(C, A, R, E, B) will be far more complicated
–
However, in such a case, we can make the node
(neighbor calling) conditionally independent of
burglary and earthquake if we are given either
alarm or !alarm
On the left, A and B are independent of each
other unless we instantiate C, so that A is
dependent on B given C is true (or false)
On the right, A and B are dependent unless
we instantiate C, so A and B are independent
given C is true (or false)
Example
•
Here, age and gender are independent
and smoking and exposure to toxins
are independent if we are given age
•
Next, smoking is dependent on both
age and gender and cancer is
dependent on both exposure and
smoking and there’s nothing we can
do about that
•
But, serum calcium and lung tumor
are independent given cancer
•
So, given age and cancer
–
p(A, G, E, S, C, L, SC) = p(A) * p(G)
* p(E  A) * p(S  A, G) * p(C  E, S) *
p(SC  C) * p(L  C)
Instantiating Nodes
•
What does it mean in our previous example that
“given age” and “given cancer”?
–
For given cancer, that just means that we are assuming
cancer = true
•
We in fact will instantiate cancer to false and compute the chain
again to see which is more likely, this will tell us the probability
that the patient has cancer and the probability that the patient
does not have cancer
•
In this way, we can use the Bayesian network to compute for a
given node or set of nodes, which values are the most likely
–
Since age is not a
boolean
variable, we will have a series
of probabilities for different categories of age
•
E.g., p(age < 18), p(18 <= age < 30), p(30 <= age < 55), along
with conditional probabilities for age, e.g., p(smoking  age <
18), p(smoking  18 <= age < 30), …
What Happens With Dependencies?
•
In our previous example, what if we do not know the age or if
the patient has cancer? If we cannot instantiate those nodes,
we have cycles
•
Let’s do an easy example first
–
Recall our “grass is wet” example, we said that running the sprinkler
was not independent of rain
•
The Bayesian network
that represents this
domain is shown to the
right, but it has a cycle,
so there is a
dependence, and if we
do not know if we ran
the sprinkler or not, we
cannot remove that
dependence
–
What then should
we do?
Solution
•
We must find some way of removing the cycle
from the previous graph
–
We will take one of the nodes that is causing the cycle
and instantiate it to both true and false
•
Let’s compute p(rain  grass) by instantiating
sprinkler to both true and false
–
p(rain  grass wet) = [p(r, g, s) + p(r, g, !s)] / [p(r, g, s)
+ p(r, g, !s) + p(!r, g, s) + p(!r, g, !s)] = (.2 * .01 * .99
+ .2 * .99 * .8) / (.2 * .01 * .99 + .2 * .99 * .8 + .8 * .4
* .9 + .8 * .6 * 0) = .36
–
So there is a 36% chance that the grass is wet because
it rained
•
in the denominator, grass remains true throughout our
denominator because we
know
that the grass was wet
More Complex Example
•
Let’s return to our cancer example and see how to resolve it
–
The cycle occurs because of age pointing at two nodes and those two
nodes pointing at cancer
–
In order to remove the cycle, we must collapse the Exposure to Toxins
and Smoking nodes into one
–
We will also collapse the Age and Gender nodes into one to simplify
that Age points to the two separate nodes (which are now collapsed
into one)
•
How do we use a collapsed node? We must instantiate all
possible combinations of the values in the node and use these
to compute the probability for cancer
–
Exposure = t, Smoking = t
–
Exposure = t, Smoking = f
–
Exposure = f, Smoking = t
–
Exposure = f, Smoking = f
•
This becomes far more computationally
complex if our node has 10 variables in it
–
2
10
combinations instead of 2
2
Junction Trees
•
With more forethought in our design, we may be
able to avoid such a problem of collapsing a large
number of nodes into a group node by creating
what is known as a
junction tree
•
Any Bayesian network can be transformed by
–
adding links between the parent nodes of any given
node
–
adding links to any cycle of length more than three so
that cycles are all of length three or shorter (this helps
complete the graph)
•
Each cycle is a
clique
of no more than 3 nodes
–
each of which forms a junction resulting in
dependencies of no more than 3 nodes to restrict the
number of probabilities needed
Propagation Algorithm
•
To this point, we have assumed that our knowledge is
static but what if we are dealing with either incomplete
knowledge or changing evidence?
–
Now we need to not only perform a chain of computations,
we may have to feed back into the network based on new
posterior knowledge
–
This requires a bi

directional propagation algorithm
•
Judea Pearl came up with an approach that, somewhat
like a neural network, involves forward and backward
passes until probabilities at each node converge (do not
change between iterations)
–
The idea is, initially the same: introduce the prior
probabilities and compute intermediate node conditional
probabilities
–
But when we reach our conclusion (the last or bottom
nodes), we can take our result and combine it with new
evidence (posterior probabilities) and pass them backward
through the network,
recomputing
each node’s probability
as before but backwards
Pearl’s Algorithm
•
The algorithm is complex and so is only introduced here with a
brief example
–
Bel
(B) = p(B  e
+
, e

) =
a
*
p
(B)
T
o
l
(B)
•
p
(B) is computed by starting with the prior probability and
l
(B)
is computed by starting with the posterior probability
•
Think of the values of
p
(B) and
l
(B) as being rows and columns
of a matrix, T means transpose, and o is the dot inner product of
the two matrices,
a
is a normalizing constant
What Pearl’s
algorithm offers
is the ability to
change our evidence
over time and be
able to update
probabilities (beliefs)
without having to
start from scratch in
our computations
Example
•
We have a murder investigation under way with
three suspects and the murder weapon, a knife, with
finger prints
–
X: identity of the last user of the weapon (and therefore
the most likely person to be the murderer)
–
Y: the last holder of the weapon
–
Z: the identity of the person’s finger prints found on the
weapon
•
our three murder suspects are a, b, and c
•
Our Bayesian network is merely X
Y
Z
–
Based on the fingerprint evidence, we have a prior
probability of e
+
=
•
p(X = a) = .8, p(X = b) = .1, p(X = c) = .1
–
And the following conditional probabilities
•
p(Y = q  X = q) = .8, p(Q = !q  X = q) = .1
–
that is, there is an 80% chance that the fingerprint indicates the
murderer and a 10% chance that the fingerprint indicates one of the
non

murderers
Belief Computations
•
We start with our formula
Bel
(B) = p(B  e
+
, e

) =
a
*
p
(B)
T
o
l
(B) where
a
= [p(e)]

1
–
p
(Y) =
•
A lab report provides us with
l
(Y) = (.8, .6, .5) giving
more support that either b or c are the murderers so now
we have to update given e

(a posterior probability)
•
We can now compute
–
Bel
(Y) =
a
* (.8, .6, .5) o (.66, .17, .17) = (.74, .14, .12)
•
At this point, we compute
l
(X) =
–
and now we update
Bel
(X) =
a
* (.75, .61, .54) o (.8, .1, .1) =
(.84, .01, .01)
)
17
.
17
,.
66
(.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
)
1
,.
1
,.
8
(.
54
.
61
.
75
.
5
.
6
.
8
.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
Continued
•
Now suspect 1 produces a strong alibi reducing the
probability that he was the murderer from .8 to .28
–
so that
p
(X) = (.28, .36, .36)
•
We repeat our belief computations, first by computing
p
(Y) =
–
Bel
(X) =
a
* (.28, .36, .36) o (.75, .61, .54) = (.34, .35, .31)
–
Bel
(Y) =
a
* (.3, .35, .35) o (.8, .65, .5) = (.38, .34, .28)
•
At this point, probabilities tell us that the most likely
suspect in the murder is b because p(X = b) > p(X = a)
by a very slight margin
–
although our belief in Y says that the fingerprints are still
more likely to be
a’s
than
b’s
since p(Y = a) > p(Y = b)
)
35
,.
35
,.
3
(.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
)
36
,.
36
,.
28
(.
Tree

based Propagation
•
Our murder investigation had a single, linear
causality, based on fingerprints
–
What if we had other pieces of evidence aside from
finger prints? Then it would be likely that our chain
of causality would not merely be linear, but perhaps
a tree shape
–
We would have to enhance the propagation
algorithm to include data fusion which include top

down and bottom

up propagations where nodes
could be anticipatory, evidence, judgment, or a root
node
•
trees may not have single root nodes, but if they are still
acyclic, they use a similar top

down, bottom

up
propagation
Graph

based Propagation
•
Unfortunately, it is unlikely that a real

world
problem would result in an acyclic tree
–
Below is a more common form of causal network with
multiple possible conclusions where D stands for
various diseases and O for various observations
(symptoms)
–
When our graph has cycles, we must again resort to
either collapsing nodes or instantiating nodes
•
We would have to instantiate all combinations of 3 nodes
below in order to remove all cycles, e.g., for D2, D3, D4 try
TTT, TTF, TFT, TFF, FTT, FTF, FFT, FFF
Or, we might try to collapse
nodes such as D2, D3 and D4 into
a single node
Or we might add links between the
disease nodes and observation
nodes to create smaller cliques
Approximate Algorithms
•
Bayesian networks are inherently cyclical
–
That is, most domains of interest have causes and effects that
are co

dependent leading lead to graphs with cycles
–
if we assume independence between nodes, we do not accurately
represent the domain
•
The result is the need to deal with these cycles by
instantiating nodes to all possible combinations, which is
of course intractable
–
Junction trees can reduce the amount of intractability but does
not remove it (and there is no single strategy that will
minimize a
Bayes
network in general)
–
Pearl’s algorithm introduces even greater complexity when the
graphs have cycles
•
There are a number of approximation algorithms
available, but each is applicable only to a particular
structured network
–
that is, there is no single
approximation algorithm that will either reduce the
complexity or provide an accurate result in all cases
Dynamic Bayesian Networks
•
Cause

effect situations may also be temporal
–
at time
i
, an event arises and causes an event at time i+1
–
the Bayesian belief network is static, it captures a situation at a
singular point in time
–
we need a dynamic network instead
•
The
dynamic
Bayesian network is similar to our previous
networks except that each edge represents not merely a
dependency, but a temporal change
–
when you take the branch from state
i
to state i+1, you are not
only indicating that state
i
can cause i+1 but that
i
was at a
time prior to i+1
Each node represents a sound
at a particular time interval
NOTE: the DBN is really
a form of hidden Markov
model, so we defer discussion
of this until later
Bayesian Forms of Learning
•
There are three forms of Bayesian learning
–
learning probabilities
–
l
earning structure
–
supervised learning of probabilities
•
In the first form, we merely want to learn the
probabilities needed for Bayesian reasoning
–
this can be done merely by counting occurrences
•
take all the training data and compute every necessary probability
–
we might adopt the naïve stance that data are
conditionally independent
•
P(d  h) = P(a1, a2, a3, …, an  h) = P(a1  h) * P(a2  h) * … *
P(an  h)
–
this assumption is used for Naïve Bayesian Classifiers and
this is how our spam filters can learn over time, by
recounting the probabilities from time to time
Another Example of Naïve Bayesian Learning
•
We want to learn, given some conditions, whether
to play tennis or not
–
see the table on the next page
•
The data available generated tells us from previous
occurrences what the conditions were and whether
we played tennis or not during those conditions
–
there are 14 previous days’ worth of data
•
To compute our prior probabilities, we just do
–
P(tennis) = days we played tennis / totals days = 9 / 14
–
P(!tennis) = days we didn’t play tennis = 5 / 14
•
The evidential probabilities are computed by adding
up the number of Tennis = yes and Tennis = no for
that evidence, for instance
–
P(wind = strong  tennis) = 3 / 9 = .33 and P(wind =
strong  !tennis) = 3 / 5 = .60
Continued
•
We have a problem in
computing our evidential
probabilities
–
we do not have enough data
to tell us if we played in
some of the various
combinations of conditions
–
did we play when it was
overcast, mild, normal
humidity and weak winds?
No, so we have no
probability for that
–
do we use 0% if we have no
probability?
•
We must rely on the Naïve
Bayesian assumption of
conditional independence to
get around this problem
Day
Outlook
Temperature
Humidity
Wind
Play
Tennis
Day1
Sunny
Hot
High
Weak
No
Day2
Sunny
Hot
High
Strong
No
Day3
Overcast
Hot
High
Weak
Yes
Day4
Rain
Mild
High
Weak
Yes
Day5
Rain
Cool
Normal
Weak
Yes
Day6
Rain
Cool
Normal
Strong
No
Day7
Overcast
Co
ol
Normal
Strong
Yes
Day8
Sunny
Mild
High
Weak
No
Day9
Sunny
Cool
Normal
Weak
Yes
Day10
Rain
Mild
Normal
Weak
Yes
Day11
Sunny
Mild
Normal
Strong
Yes
Day12
Overcast
Mild
High
Strong
Yes
Day13
Overcast
Hot
Normal
Weak
Yes
Day14
Rain
Mild
High
Strong
N
o
p(Sunny & Hot & Weak  Yes) = p(tennis) * p(sunny  tennis) *
p(hot  tennis) * p(weak  tennis) = 9 / 14 * 2 / 9 * 2 / 9 * 6 / 9 = 12 / 567
Learning Structure
•
For a Bayesian network, how do we know what
states should exist in our structure? How do we
know what links should exist between states?
•
There are two forms of learning here
–
to learn the states that should exist
–
t
o learn which transitions should exist between states
•
Learning states is less common as we usually have a
model in mind before we get started
–
we already know the causes we want to model and the
data tell us what effects we have probabilities for
–
what the data might not tell us is all of the possible links
between the nodes, but so we might have to learn the
transitions
•
Learning transitions is more common and more
interesting
Learning Transitions
•
One approach is to start with a fully connected graph
–
we learn the transition probabilities using the Baum

Welch algorithm
and remove any links whose probabilities are 0 (or negligible)
•
but this approach will be impractical
•
we will discuss Baum

Welch with HMMs
•
Another approach is to create a model using neighbor

merging
–
start with each observation of each test case representing its own
node
–
as each new test case is introduced, merge nodes that have the same
observation at time t
–
the network begin to collapse
•
Another approach is to use V

merging
–
here, we not only collapse states that are the same, but also states
that share the same transitions
–
for instance, if we have a situation where in case j si

1 goes to
si
goes to si+1 and we match that in case k, then we collapse that entire
set of transitions into a single set of transitions
–
notice there is nothing probabilistic about learning the structure
Example
•
Given a collection of research articles, learn the structure of
a paper’s header
–
that is, the fields that go into a paper
•
Data came in three forms: labeled (by human), unlabeled,
distantly labeled (data came from
bibtex
entries, which
contains all of the relevant data but had extra fields that
were to be discarded) from approximately 5700 papers
–
the transition probabilities were learned by simple counting
Applications for Bayes
•
As we saw, spam filters are commonly implemented
using naïve
Bayes
classifiers but what other
applications use
Bayes
?
–
Inference/diagnosis
–
as we saw in many of the examples
presented here, we might use
Bayes
to identify the most
likely cause of various effects that we see, the cause
might be a diagnostic conclusion, a classification, or more
generally, assigning credit (blame)
•
As an example, Pathfinder is a medical diagnostic system for
lymph node disease which has 60 diseases and over 130 features
where features are not always binary (many are real valued)
–
Prediction
–
given our Bayesian network, we might want
to predict what the most likely result will be given
starting conditions
Continued
–
Design
–
nodes represent the components that
might go into the design product, and the features
that each component provides or goals that it
fulfills
•
Design testing can also be solved
•
Similarly, decision making can easily be modeled with
the intent being to maximize the expected utility
–
Story understanding
–
words and concepts are
stored in a network and a new story is introduced
and the most likely concept is recognized
probabilistically
•
Word sense disambiguation (what meaning does a given
work have) is supported as intermediate nodes in the
network
Comments 0
Log in to post a comment