# Bayesian Probabilities

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 6 μήνες)

76 εμφανίσεις

Bayesian Probabilities

Recall that
Bayes

theorem gives us a simple way to
compute likelihoods for hypotheses and thus is useful for
problems like diagnosis, recognition, interpretation, etc

P(h | E) = p(E | h) * P(h)

As we discussed, there are several problems when
applying
Bayes

Bayes

theorem works only when we have independent events

We need too many probabilities to account for all of the
combinations of evidence in E

Probabilities are derived from statistics which might include
some bias

We can get around the first two problems if we assume
independence (sometimes known as Naïve Bayesian
probabilities) or if we can construct a suitable Bayesian
network

We can also employ a form of learning so that the
probabilities more suit the problem at hand

Co
-
dependent Events

Recall from our sidewalk being wet example

The two causes were that it rained or that we ran the
sprinkler

While the two events do not seem to be dependent in fact
we might not run the sprinkler if it is supposed to rain or
we might run the sprinkler if we postponed running it
because it might rain and it didn’t

Therefore, we cannot treat rain and sprinkler as
independent events and we need to resolve this by using
what is known as the probability
chain rule

P(A & B & C) = P(A) * P(B | A) * P(C | A & B)

the & means “each item in the list happens”

if we have 10 co
-
dependent events, say A1, …, A10, then
our last probability in the list is P(A10 | A1 & A2 & … &
A9)

More on the Chain Rule

In order to apply the chain rule, I will need a large
number of probabilities

Assume that I have events A, B, C, D, E

P(A & B & C & D & E) = P(A) * P(B | A) * P(C | A &
B) * P(D | A & B & C) * P(E | A & B & C & D)

P(A & B & D & E) = P(A) * P(B | A) * P(D | A & B) *
P(E | A & B & D)

P(A & C & D) = P(A) * P(C | A) * P(D | A & C)

etc

So I will need P(event
i

| some combination of
other events j) for all
i

and j

if I have 5 co
-
dependent events, I need 32 conditional
probabilities (along with 5 prior probabilities)

Independence

Because of the problem of co
-
dependence, we might
want to see if two events are independent

For two events to be independent, if one event arises
it does not impact the probability of the other event

Consider for instance if you roll a 6 on a die, the
probability of rolling a 6 on the same or another die
should not change from 1 in 6

Consider drawing a red card from deck of cards and
replacing it (putting it back into the deck) will not change
the probability that the next card drawn is also red

However, after drawing the first red card, if you do not
replace it, then the probability of drawing a second red
card will change

We say that two events, A and B, are independent if
and only if P(
A

B
) = P(A)P(B)

Continued

If we have independent events, then it simplifies what we
need in order to compute our probabilities

When reasoning about events A and B, if they are
independent, we do not need probabilities of P(A), P(B) and
P(A ∩ B), just P(A) and P(B)

Similarly, if events A and B are independent, then

P(A | B) = P(A) because B being present or absent will have
no impact on A

With the assumption that events are independent in place,
our previous equation of the form:

P(H | e1 & e2 & e3 & …) = P(H) * P(e1 & e2 & e3 & … | H)
= P(H) * P(e1 | H) * P(e2 | e1 & H) * P(e3 | e1 & e2 & H) *

can rewrite a statement as

P(H | e1 & e2 & e3 & …) = P(H) * P(e1 | H) * P(e2 | H) *
P(e3 | H) * …

Thus, independence gets past the problem of needing an
exponential number of probabilities

Naïve Bayesian Classifier

Assume we have q data sets where each data set comprises
some elements of the set {d1, d2, …,
dn
} and each set
qi

has been classified into one of m categories {c1, c2, …,
cm}

Given a new case = {
dj
,
dk
, …} we compute the likelihood
that the new case is in each of the categories as follows:

P(
ci

| case) = P(
ci
) * P(
dj

|
ci
) * P(
dk

|
ci
) * … for each
i

from 1 to m

Where P(
ci
) is simply the number of times
ci

occurs out of q

And where P(
dj

|
ci
) is the number of times datum
dj

occurred
in a data set that was classified as category
ci

P(
ci

| case) is a Naïve Bayesian classifier (NBC) for
category
i

This only works if the data making up any one case are
independent

this assumption is not necessarily true, thus the
word “naïve”

The good news? This is easy!

Chain Rule vs. Naïve Bayes

Let’s consider an example

We want to determine the probability that a person will have spoken
or typed a particular phrase, say “the man bit a dog”

We will compute the probability of this by examining several
hundred or thousand training sentences

Using Naïve
Bayes
:

P(“the man bit a dog”) = P(“the”) * P(“man”) * P(“bit”) * P(“a”) *
P(“dog”)

We compute this just by counting the number of times each word occurs in
the training sentences

Using the chain rule

P(“the man bit the dog”) = P(“the”) * P(“man” | “the”) * P(“bit” |
“the man”) * P(“a” | “the man bit”) * P(“dog” | “the man bit a”)

P(“bit” | “the man”) is the number of times the word “bit” followed “the
man” in all of our training sentences

The probability computed by the chain rule is far smaller but
also much more realistic, so using Naïve
Bayes

should be done
only with caution and the foreknowledge that we can make an
independence assumption

Spam Filters

One of the most common uses of a NBC is to construct a
spam filter

the spam filter works by learning a “bag of words”

that is, the
words that are typically associated with spam

we take all of the words of every email message and discard
any common words (I, of, the, is, etc)

Now we “train” our spam filter by computing these
probabilities:

P(spam)

the number of emails that were spam out of the
training set

P(!spam)

the number of emails that were not spam out of the
training set

P(word1 | spam)

the number of times word1 appeared in
spam emails

P(word1 | !spam)

the number of times word 1 appears in non
spam emails

And so forth for every non common word

Using Our Spam Filter

A new email comes in

Discard all common words

Compute P(spam | words) and P(!spam | words)

P(spam | word1 & word2 & … &
wordn
) = P(spam) * P(word1
| spam) * P(word2 | spam) * … * P(
wordn

| spam)

P(!spam | word1 & word2 & … &
wordn
) = P(!spam) *
P(word1 | !spam) * P(word2 | !spam) * … * P(
wordn

| !spam)

Which probability is higher? That gives you your answer

Without the naïve assumption, our computation becomes:

P(spam | word1, word2, …,
wordn
) = P(spam) * P(word1, word2, … ,
wordn

| spam) = P(spam) * P(word1 | spam) * P(word2, … ,
wordn

|
spam) = P(spam) * P(word1 | spam) * P(word1 & word2 | spam) * P( …
,
wordn

| spam) * …

English has well over 100,000 words but many are common, so
that a spam filter may only deal with say 5,000 words but that
would require 2
5000
probabilities if we did not use naïve
Bayes
!

Classifier Example: Clustering

Clustering is used in data mining to infer boundaries for
classes

Here, we assume that we have already clustered a set of data into two
classes

We want to use NBC to determine if a new datum, which lies in
between the two clusters, is of one class or another

We have two categories, we will call them red and green

P(red) = # of red entries / # of total entries = 20 / 60

P(green) = # of green entries / # of total entries = 40 / 60

We add a new datum (shown in white in the figure) and
identify which class it is most likely a part of

P(x | green) = # of green entries nearby / # of green entries = 1 / 40

P(x | red) = # of red entries nearby / # of red entries = 3 / 20

P(x is green | green entries) = p(green) *
P(x | green) = 40 / 60 * 1 / 40 = 1 / 60

P(x is red | red entries) = p(red) * P(x |
red) = 20 / 60 * 3 / 20 = 1 / 20

Bayesian Network

Thus far, our computations have been based on a
single mapping from input to output

What if our reasoning process is more involved?

for instance, we have symptoms that map to intermediate
conclusions that map to disease hypotheses?

or we have an explicit causal chain?

We can form a network of the values that make up the
chain of reasoning, these values are our random variables
and each variable will be represented as a node

the network will be a directed acyclic graph

nodes will point
from cause to effect, and we hope that the resulting directed
graph is acyclic (although as we will see this may not be the case)

we use prior probabilities for the initial nodes and conditional
probabilities to link those nodes to other nodes

Simple Example

Below we have a simple Bayesian network
arranged as a DAG

Our causes are construction and/or accident and our
effects are orange barrels, bad traffic and/or flashing
lights

Each variable (node) will either be true or false

Given the input values of whether we see B, T or L, we
want to compute the likelihood that the cause was
either C or A

notice that T can be caused by either C or A or both

while
this makes our graph somewhat more complicated than a
linear graph, it does not contain a cycle

Computing the Cause

We use the chain rule to compute the probability of a chain of
states being true (C, A, B, T, L) (we address in a bit what we
really want to compute, not this particular chain)

p(C, A, B, T, L) = p(C) * p(A | C) * p(B | C, A) * p(T | B, C, A)
* p(L | C, A, B, T) where p(C) is a prior probability and the
others are conditional probabilities

with 5 items, we need 2
5

= 32 conditional probabilities

We can reduce some of the above terms

B has nothing to do with A so that p(B | C, A) becomes p(B | C)

T has nothing to do with B so p(T | B, C, A) becomes p(T | C, A)

L has nothing to do with C so p(L | C, A, B, T) becomes P(T | C, A)
and

We can reduce p(C, A, B, T, L) to p(C) * p(A) * p(B | C) * p(T
| C, A) * p(L | A)

If we could assume independence, then even p(T | C, A) could be
simplified, however since two causes feed into the same effect, this
may not be wise

If we did choose to implement this as p(T | C) * p(T | A) then we are
assuming independence and we solve this with Naïve
Bayes

A Naïve Bayes Example Network

Assume we have one cause and m effects as
shown in the figure below

Compute p(Y) and p(!Y)

Number of times Y occurs in the data and the number of
times Y does not occur out of n data

Compute p(Xi | Y)

Number of times Xi occurs when Y is true for each
i

from 1
to m

Given some collection V, a subset of {X1, X2, …,
Xm
}, then p(Y | V) = p(Y) * p(vi | Y) for each vi in V

Let’s examine an example

Y

student is a grad student

X1

student taking CSC course

X2

student works

X3

student is serious (dedicated)

X4

student calls
prof

by first name

Example Continued

The University has 15,000 students of which 1,500 are
graduate students

p(Y) = 1500/15000 = .10

Out of 1500 graduate students, 60 are taking CSC courses

P(CSC | grad student) = 60 / 1500 = .04

P(!CSC | grad student) = 1450 / 1500 = .96

Out of 1500 graduate students
, 1250
work full time

P(work | grad student) = 1250 / 1500 = .83

P(!word | grad student) = 250 / 1500 = .17

Out of 1500 graduate students, 1400 are serious about their
studies

P(serious | grad student) = 1400 / 1500 = .93

P(!serious | grad student) = 100 / 1500 = .07

Out of 1500 graduate students, 750 call their profs by their
first names

P(first name | grad student) = 750 / 1500 = .5

P(!first name | grad student) = 750 / 1500 = .5

Example Continued

NOTE: we will similarly have statistics for the non graduate
students (see below)

A given student works, is in CSC, is serious but does not call
his
prof

by the first name

p(CSC | !grad student) = 250 / 13500 = .02

p(work fulltime | !grad student) = 2500 / 13500 = .19

p(serious | !grad student) = 5000 / 13500 = .37

p(!first name | !grad student) = 12000 / 13500 = .89

What is the probability that the student is a graduate student?

p(grad student | works & CSC & serious & !first name) = p(grad
student) * p(works | grad student) * p(CSC | grad student) *
p(serious | grad student) * p(!first name | grad student) = .1 * .04 *
.83 * .93 * .5 = .0015

p(!grad student | works & CSC & serious & !first name) = p(!grad
student) * p(works | !grad student) * p(CSC | !grad student) *
p(serious | !grad student) * p(!first name | !grad student) = .9 * .02 *
.19 * .37 * .89 = .0011

Therefore we can conclude the student is a grad student

A Lengthier Example

The network below shows the cause and effects of two
independent situations

an earthquake (which can cause a radio announcement and/or your
alarm to go off)

and a burglary (which can cause your alarm to go off)

if your alarm goes off, your neighbor may call you to find out why

The joint probability of alarm, earthquake, radio, burglary =
p(A, R, E, B) = p(A | R, E, B) * p(R | E, B) * p(E | B) * p(B)

But because A & R and E & B are
independent pairs of events, we
can reduce p(A | R, E, B) to p(A |
E, B) and p(R | E, B) to p(R | E)

What about neighbor call?
Notice that it is dependent on
alarm which is dependent on
earthquake and burglary

Conditional Independence

In our previous example, we saw that neighbor
calling was not independent of burglary or
earthquake and therefore, a joint probability
p(C, A, R, E, B) will be far more complicated

However, in such a case, we can make the node
(neighbor calling) conditionally independent of
burglary and earthquake if we are given either
alarm or !alarm

On the left, A and B are independent of each

other unless we instantiate C, so that A is

dependent on B given C is true (or false)

On the right, A and B are dependent unless

we instantiate C, so A and B are independent

given C is true (or false)

Example

Here, age and gender are independent
and smoking and exposure to toxins
are independent if we are given age

Next, smoking is dependent on both
age and gender and cancer is
dependent on both exposure and
smoking and there’s nothing we can
do about that

But, serum calcium and lung tumor
are independent given cancer

So, given age and cancer

p(A, G, E, S, C, L, SC) = p(A) * p(G)
* p(E | A) * p(S | A, G) * p(C | E, S) *
p(SC | C) * p(L | C)

Instantiating Nodes

What does it mean in our previous example that
“given age” and “given cancer”?

For given cancer, that just means that we are assuming
cancer = true

We in fact will instantiate cancer to false and compute the chain
again to see which is more likely, this will tell us the probability
that the patient has cancer and the probability that the patient
does not have cancer

In this way, we can use the Bayesian network to compute for a
given node or set of nodes, which values are the most likely

Since age is not a
boolean

variable, we will have a series
of probabilities for different categories of age

E.g., p(age < 18), p(18 <= age < 30), p(30 <= age < 55), along
with conditional probabilities for age, e.g., p(smoking | age <
18), p(smoking | 18 <= age < 30), …

What Happens With Dependencies?

In our previous example, what if we do not know the age or if
the patient has cancer? If we cannot instantiate those nodes,
we have cycles

Let’s do an easy example first

Recall our “grass is wet” example, we said that running the sprinkler
was not independent of rain

The Bayesian network
that represents this
domain is shown to the
right, but it has a cycle,
so there is a
dependence, and if we
do not know if we ran
the sprinkler or not, we
cannot remove that
dependence

What then should
we do?

Solution

We must find some way of removing the cycle
from the previous graph

We will take one of the nodes that is causing the cycle
and instantiate it to both true and false

Let’s compute p(rain | grass) by instantiating
sprinkler to both true and false

p(rain | grass wet) = [p(r, g, s) + p(r, g, !s)] / [p(r, g, s)
+ p(r, g, !s) + p(!r, g, s) + p(!r, g, !s)] = (.2 * .01 * .99
+ .2 * .99 * .8) / (.2 * .01 * .99 + .2 * .99 * .8 + .8 * .4
* .9 + .8 * .6 * 0) = .36

So there is a 36% chance that the grass is wet because
it rained

in the denominator, grass remains true throughout our
denominator because we
know
that the grass was wet

More Complex Example

Let’s return to our cancer example and see how to resolve it

The cycle occurs because of age pointing at two nodes and those two
nodes pointing at cancer

In order to remove the cycle, we must collapse the Exposure to Toxins
and Smoking nodes into one

We will also collapse the Age and Gender nodes into one to simplify
that Age points to the two separate nodes (which are now collapsed
into one)

How do we use a collapsed node? We must instantiate all
possible combinations of the values in the node and use these
to compute the probability for cancer

Exposure = t, Smoking = t

Exposure = t, Smoking = f

Exposure = f, Smoking = t

Exposure = f, Smoking = f

This becomes far more computationally
complex if our node has 10 variables in it

2
10

combinations instead of 2
2

Junction Trees

With more forethought in our design, we may be
able to avoid such a problem of collapsing a large
number of nodes into a group node by creating
what is known as a
junction tree

Any Bayesian network can be transformed by

adding links between the parent nodes of any given
node

adding links to any cycle of length more than three so
that cycles are all of length three or shorter (this helps
complete the graph)

Each cycle is a
clique
of no more than 3 nodes

each of which forms a junction resulting in
dependencies of no more than 3 nodes to restrict the
number of probabilities needed

Propagation Algorithm

To this point, we have assumed that our knowledge is
static but what if we are dealing with either incomplete
knowledge or changing evidence?

Now we need to not only perform a chain of computations,
we may have to feed back into the network based on new
posterior knowledge

This requires a bi
-
directional propagation algorithm

Judea Pearl came up with an approach that, somewhat
like a neural network, involves forward and backward
passes until probabilities at each node converge (do not
change between iterations)

The idea is, initially the same: introduce the prior
probabilities and compute intermediate node conditional
probabilities

But when we reach our conclusion (the last or bottom
nodes), we can take our result and combine it with new
evidence (posterior probabilities) and pass them backward
through the network,
recomputing

each node’s probability
as before but backwards

Pearl’s Algorithm

The algorithm is complex and so is only introduced here with a
brief example

Bel
(B) = p(B | e
+
, e
-
) =
a

*
p

(B)
T

o
l
(B)

p
(B) is computed by starting with the prior probability and
l
(B)
is computed by starting with the posterior probability

Think of the values of
p
(B) and
l
(B) as being rows and columns
of a matrix, T means transpose, and o is the dot inner product of
the two matrices,
a

is a normalizing constant

What Pearl’s

algorithm offers

is the ability to

change our evidence

over time and be

able to update

probabilities (beliefs)

without having to

start from scratch in

our computations

Example

We have a murder investigation under way with
three suspects and the murder weapon, a knife, with
finger prints

X: identity of the last user of the weapon (and therefore
the most likely person to be the murderer)

Y: the last holder of the weapon

Z: the identity of the person’s finger prints found on the
weapon

our three murder suspects are a, b, and c

Our Bayesian network is merely X

Y

Z

Based on the fingerprint evidence, we have a prior
probability of e
+

=

p(X = a) = .8, p(X = b) = .1, p(X = c) = .1

And the following conditional probabilities

p(Y = q | X = q) = .8, p(Q = !q | X = q) = .1

that is, there is an 80% chance that the fingerprint indicates the
murderer and a 10% chance that the fingerprint indicates one of the
non
-
murderers

Belief Computations

We start with our formula
Bel
(B) = p(B | e
+
, e
-
) =
a

*
p

(B)
T

o
l
(B) where
a

= [p(e)]
-
1

p
(Y) =

A lab report provides us with
l
(Y) = (.8, .6, .5) giving
more support that either b or c are the murderers so now
we have to update given e
-

(a posterior probability)

We can now compute

Bel
(Y) =
a

* (.8, .6, .5) o (.66, .17, .17) = (.74, .14, .12)

At this point, we compute
l
(X) =

and now we update
Bel
(X) =
a

* (.75, .61, .54) o (.8, .1, .1) =
(.84, .01, .01)

)
17
.
17
,.
66
(.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
)
1
,.
1
,.
8
(.

54
.
61
.
75
.
5
.
6
.
8
.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
Continued

Now suspect 1 produces a strong alibi reducing the
probability that he was the murderer from .8 to .28

so that
p
(X) = (.28, .36, .36)

We repeat our belief computations, first by computing
p
(Y) =

Bel
(X) =
a

* (.28, .36, .36) o (.75, .61, .54) = (.34, .35, .31)

Bel
(Y) =
a

* (.3, .35, .35) o (.8, .65, .5) = (.38, .34, .28)

At this point, probabilities tell us that the most likely
suspect in the murder is b because p(X = b) > p(X = a)
by a very slight margin

although our belief in Y says that the fingerprints are still
more likely to be
a’s

than
b’s

since p(Y = a) > p(Y = b)

)
35
,.
35
,.
3
(.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
)
36
,.
36
,.
28
(.

Tree
-
based Propagation

Our murder investigation had a single, linear
causality, based on fingerprints

What if we had other pieces of evidence aside from
finger prints? Then it would be likely that our chain
of causality would not merely be linear, but perhaps
a tree shape

We would have to enhance the propagation
algorithm to include data fusion which include top
-
down and bottom
-
up propagations where nodes
could be anticipatory, evidence, judgment, or a root
node

trees may not have single root nodes, but if they are still
acyclic, they use a similar top
-
down, bottom
-
up
propagation

Graph
-
based Propagation

Unfortunately, it is unlikely that a real
-
world
problem would result in an acyclic tree

Below is a more common form of causal network with
multiple possible conclusions where D stands for
various diseases and O for various observations
(symptoms)

When our graph has cycles, we must again resort to
either collapsing nodes or instantiating nodes

We would have to instantiate all combinations of 3 nodes
below in order to remove all cycles, e.g., for D2, D3, D4 try
TTT, TTF, TFT, TFF, FTT, FTF, FFT, FFF

Or, we might try to collapse

nodes such as D2, D3 and D4 into

a single node

Or we might add links between the

disease nodes and observation

nodes to create smaller cliques

Approximate Algorithms

Bayesian networks are inherently cyclical

That is, most domains of interest have causes and effects that
are co
-
dependent leading lead to graphs with cycles

if we assume independence between nodes, we do not accurately
represent the domain

The result is the need to deal with these cycles by
instantiating nodes to all possible combinations, which is
of course intractable

Junction trees can reduce the amount of intractability but does
not remove it (and there is no single strategy that will
minimize a
Bayes

network in general)

Pearl’s algorithm introduces even greater complexity when the
graphs have cycles

There are a number of approximation algorithms
available, but each is applicable only to a particular
structured network

that is, there is no single
approximation algorithm that will either reduce the
complexity or provide an accurate result in all cases

Dynamic Bayesian Networks

Cause
-
effect situations may also be temporal

at time
i
, an event arises and causes an event at time i+1

the Bayesian belief network is static, it captures a situation at a
singular point in time

we need a dynamic network instead

The
dynamic
Bayesian network is similar to our previous
networks except that each edge represents not merely a
dependency, but a temporal change

when you take the branch from state
i

to state i+1, you are not
only indicating that state
i

can cause i+1 but that
i

was at a
time prior to i+1

Each node represents a sound

at a particular time interval

NOTE: the DBN is really

a form of hidden Markov

model, so we defer discussion

of this until later

Bayesian Forms of Learning

There are three forms of Bayesian learning

learning probabilities

l
earning structure

supervised learning of probabilities

In the first form, we merely want to learn the
probabilities needed for Bayesian reasoning

this can be done merely by counting occurrences

take all the training data and compute every necessary probability

we might adopt the naïve stance that data are
conditionally independent

P(d | h) = P(a1, a2, a3, …, an | h) = P(a1 | h) * P(a2 | h) * … *
P(an | h)

this assumption is used for Naïve Bayesian Classifiers and
this is how our spam filters can learn over time, by
recounting the probabilities from time to time

Another Example of Naïve Bayesian Learning

We want to learn, given some conditions, whether
to play tennis or not

see the table on the next page

The data available generated tells us from previous
occurrences what the conditions were and whether
we played tennis or not during those conditions

there are 14 previous days’ worth of data

To compute our prior probabilities, we just do

P(tennis) = days we played tennis / totals days = 9 / 14

P(!tennis) = days we didn’t play tennis = 5 / 14

The evidential probabilities are computed by adding
up the number of Tennis = yes and Tennis = no for
that evidence, for instance

P(wind = strong | tennis) = 3 / 9 = .33 and P(wind =
strong | !tennis) = 3 / 5 = .60

Continued

We have a problem in
computing our evidential
probabilities

we do not have enough data
to tell us if we played in
some of the various
combinations of conditions

did we play when it was
overcast, mild, normal
humidity and weak winds?
No, so we have no
probability for that

do we use 0% if we have no
probability?

We must rely on the Naïve
Bayesian assumption of
conditional independence to
get around this problem

Day

Outlook

Temperature

Humidity

Wind

Play
Tennis

Day1

Sunny

Hot

High

Weak

No

Day2

Sunny

Hot

High

Strong

No

Day3

Overcast

Hot

High

Weak

Yes

Day4

Rain

Mild

High

Weak

Yes

Day5

Rain

Cool

Normal

Weak

Yes

Day6

Rain

Cool

Normal

Strong

No

Day7

Overcast

Co
ol

Normal

Strong

Yes

Day8

Sunny

Mild

High

Weak

No

Day9

Sunny

Cool

Normal

Weak

Yes

Day10

Rain

Mild

Normal

Weak

Yes

Day11

Sunny

Mild

Normal

Strong

Yes

Day12

Overcast

Mild

High

Strong

Yes

Day13

Overcast

Hot

Normal

Weak

Yes

Day14

Rain

Mild

High

Strong

N
o

p(Sunny & Hot & Weak | Yes) = p(tennis) * p(sunny | tennis) *

p(hot | tennis) * p(weak | tennis) = 9 / 14 * 2 / 9 * 2 / 9 * 6 / 9 = 12 / 567

Learning Structure

For a Bayesian network, how do we know what
states should exist in our structure? How do we
know what links should exist between states?

There are two forms of learning here

to learn the states that should exist

t
o learn which transitions should exist between states

Learning states is less common as we usually have a
model in mind before we get started

we already know the causes we want to model and the
data tell us what effects we have probabilities for

what the data might not tell us is all of the possible links
between the nodes, but so we might have to learn the
transitions

Learning transitions is more common and more
interesting

Learning Transitions

One approach is to start with a fully connected graph

we learn the transition probabilities using the Baum
-
Welch algorithm
and remove any links whose probabilities are 0 (or negligible)

but this approach will be impractical

we will discuss Baum
-
Welch with HMMs

Another approach is to create a model using neighbor
-
merging

start with each observation of each test case representing its own
node

as each new test case is introduced, merge nodes that have the same
observation at time t

the network begin to collapse

Another approach is to use V
-
merging

here, we not only collapse states that are the same, but also states
that share the same transitions

for instance, if we have a situation where in case j si
-
1 goes to
si

goes to si+1 and we match that in case k, then we collapse that entire
set of transitions into a single set of transitions

notice there is nothing probabilistic about learning the structure

Example

Given a collection of research articles, learn the structure of
a paper’s header

that is, the fields that go into a paper

Data came in three forms: labeled (by human), unlabeled,
distantly labeled (data came from
bibtex

entries, which
contains all of the relevant data but had extra fields that
were to be discarded) from approximately 5700 papers

the transition probabilities were learned by simple counting

Applications for Bayes

As we saw, spam filters are commonly implemented
using naïve
Bayes

classifiers but what other
applications use
Bayes
?

Inference/diagnosis

as we saw in many of the examples
presented here, we might use
Bayes

to identify the most
likely cause of various effects that we see, the cause
might be a diagnostic conclusion, a classification, or more
generally, assigning credit (blame)

As an example, Pathfinder is a medical diagnostic system for
lymph node disease which has 60 diseases and over 130 features
where features are not always binary (many are real valued)

Prediction

given our Bayesian network, we might want
to predict what the most likely result will be given
starting conditions

Continued

Design

nodes represent the components that
might go into the design product, and the features
that each component provides or goals that it
fulfills

Design testing can also be solved

Similarly, decision making can easily be modeled with
the intent being to maximize the expected utility

Story understanding

words and concepts are
stored in a network and a new story is introduced
and the most likely concept is recognized
probabilistically

Word sense disambiguation (what meaning does a given
work have) is supported as intermediate nodes in the
network