Bayesian Probabilities

lettuceescargatoireΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

67 εμφανίσεις

Bayesian Probabilities


Recall that
Bayes

theorem gives us a simple way to
compute likelihoods for hypotheses and thus is useful for
problems like diagnosis, recognition, interpretation, etc


P(h | E) = p(E | h) * P(h)


As we discussed, there are several problems when
applying
Bayes


Bayes

theorem works only when we have independent events


We need too many probabilities to account for all of the
combinations of evidence in E


Probabilities are derived from statistics which might include
some bias


We can get around the first two problems if we assume
independence (sometimes known as Naïve Bayesian
probabilities) or if we can construct a suitable Bayesian
network


We can also employ a form of learning so that the
probabilities more suit the problem at hand

Co
-
dependent Events


Recall from our sidewalk being wet example


The two causes were that it rained or that we ran the
sprinkler


While the two events do not seem to be dependent in fact
we might not run the sprinkler if it is supposed to rain or
we might run the sprinkler if we postponed running it
because it might rain and it didn’t


Therefore, we cannot treat rain and sprinkler as
independent events and we need to resolve this by using
what is known as the probability
chain rule


P(A & B & C) = P(A) * P(B | A) * P(C | A & B)


the & means “each item in the list happens”


if we have 10 co
-
dependent events, say A1, …, A10, then
our last probability in the list is P(A10 | A1 & A2 & … &
A9)

More on the Chain Rule


In order to apply the chain rule, I will need a large
number of probabilities


Assume that I have events A, B, C, D, E


P(A & B & C & D & E) = P(A) * P(B | A) * P(C | A &
B) * P(D | A & B & C) * P(E | A & B & C & D)


P(A & B & D & E) = P(A) * P(B | A) * P(D | A & B) *
P(E | A & B & D)


P(A & C & D) = P(A) * P(C | A) * P(D | A & C)


etc


So I will need P(event
i

| some combination of
other events j) for all
i

and j


if I have 5 co
-
dependent events, I need 32 conditional
probabilities (along with 5 prior probabilities)

Independence


Because of the problem of co
-
dependence, we might
want to see if two events are independent


For two events to be independent, if one event arises
it does not impact the probability of the other event


Consider for instance if you roll a 6 on a die, the
probability of rolling a 6 on the same or another die
should not change from 1 in 6


Consider drawing a red card from deck of cards and
replacing it (putting it back into the deck) will not change
the probability that the next card drawn is also red


However, after drawing the first red card, if you do not
replace it, then the probability of drawing a second red
card will change


We say that two events, A and B, are independent if
and only if P(
A


B
) = P(A)P(B)

Continued


If we have independent events, then it simplifies what we
need in order to compute our probabilities


When reasoning about events A and B, if they are
independent, we do not need probabilities of P(A), P(B) and
P(A ∩ B), just P(A) and P(B)


Similarly, if events A and B are independent, then


P(A | B) = P(A) because B being present or absent will have
no impact on A


With the assumption that events are independent in place,
our previous equation of the form:


P(H | e1 & e2 & e3 & …) = P(H) * P(e1 & e2 & e3 & … | H)
= P(H) * P(e1 | H) * P(e2 | e1 & H) * P(e3 | e1 & e2 & H) *



can rewrite a statement as


P(H | e1 & e2 & e3 & …) = P(H) * P(e1 | H) * P(e2 | H) *
P(e3 | H) * …


Thus, independence gets past the problem of needing an
exponential number of probabilities


Naïve Bayesian Classifier


Assume we have q data sets where each data set comprises
some elements of the set {d1, d2, …,
dn
} and each set
qi

has been classified into one of m categories {c1, c2, …,
cm}


Given a new case = {
dj
,
dk
, …} we compute the likelihood
that the new case is in each of the categories as follows:


P(
ci

| case) = P(
ci
) * P(
dj

|
ci
) * P(
dk

|
ci
) * … for each
i

from 1 to m


Where P(
ci
) is simply the number of times
ci

occurs out of q


And where P(
dj

|
ci
) is the number of times datum
dj

occurred
in a data set that was classified as category
ci


P(
ci

| case) is a Naïve Bayesian classifier (NBC) for
category
i


This only works if the data making up any one case are
independent


this assumption is not necessarily true, thus the
word “naïve”


The good news? This is easy!

Chain Rule vs. Naïve Bayes


Let’s consider an example


We want to determine the probability that a person will have spoken
or typed a particular phrase, say “the man bit a dog”


We will compute the probability of this by examining several
hundred or thousand training sentences


Using Naïve
Bayes
:


P(“the man bit a dog”) = P(“the”) * P(“man”) * P(“bit”) * P(“a”) *
P(“dog”)


We compute this just by counting the number of times each word occurs in
the training sentences


Using the chain rule


P(“the man bit the dog”) = P(“the”) * P(“man” | “the”) * P(“bit” |
“the man”) * P(“a” | “the man bit”) * P(“dog” | “the man bit a”)


P(“bit” | “the man”) is the number of times the word “bit” followed “the
man” in all of our training sentences


The probability computed by the chain rule is far smaller but
also much more realistic, so using Naïve
Bayes

should be done
only with caution and the foreknowledge that we can make an
independence assumption

Spam Filters


One of the most common uses of a NBC is to construct a
spam filter


the spam filter works by learning a “bag of words”


that is, the
words that are typically associated with spam


we take all of the words of every email message and discard
any common words (I, of, the, is, etc)


Now we “train” our spam filter by computing these
probabilities:


P(spam)


the number of emails that were spam out of the
training set


P(!spam)


the number of emails that were not spam out of the
training set


P(word1 | spam)


the number of times word1 appeared in
spam emails


P(word1 | !spam)


the number of times word 1 appears in non
spam emails


And so forth for every non common word

Using Our Spam Filter


A new email comes in


Discard all common words


Compute P(spam | words) and P(!spam | words)


P(spam | word1 & word2 & … &
wordn
) = P(spam) * P(word1
| spam) * P(word2 | spam) * … * P(
wordn

| spam)


P(!spam | word1 & word2 & … &
wordn
) = P(!spam) *
P(word1 | !spam) * P(word2 | !spam) * … * P(
wordn

| !spam)


Which probability is higher? That gives you your answer


Without the naïve assumption, our computation becomes:


P(spam | word1, word2, …,
wordn
) = P(spam) * P(word1, word2, … ,
wordn

| spam) = P(spam) * P(word1 | spam) * P(word2, … ,
wordn

|
spam) = P(spam) * P(word1 | spam) * P(word1 & word2 | spam) * P( …
,
wordn

| spam) * …


English has well over 100,000 words but many are common, so
that a spam filter may only deal with say 5,000 words but that
would require 2
5000
probabilities if we did not use naïve
Bayes
!

Classifier Example: Clustering


Clustering is used in data mining to infer boundaries for
classes


Here, we assume that we have already clustered a set of data into two
classes


We want to use NBC to determine if a new datum, which lies in
between the two clusters, is of one class or another


We have two categories, we will call them red and green


P(red) = # of red entries / # of total entries = 20 / 60


P(green) = # of green entries / # of total entries = 40 / 60


We add a new datum (shown in white in the figure) and
identify which class it is most likely a part of


P(x | green) = # of green entries nearby / # of green entries = 1 / 40


P(x | red) = # of red entries nearby / # of red entries = 3 / 20


P(x is green | green entries) = p(green) *
P(x | green) = 40 / 60 * 1 / 40 = 1 / 60


P(x is red | red entries) = p(red) * P(x |
red) = 20 / 60 * 3 / 20 = 1 / 20

Bayesian Network


Thus far, our computations have been based on a
single mapping from input to output


What if our reasoning process is more involved?


for instance, we have symptoms that map to intermediate
conclusions that map to disease hypotheses?


or we have an explicit causal chain?


We can form a network of the values that make up the
chain of reasoning, these values are our random variables
and each variable will be represented as a node


the network will be a directed acyclic graph


nodes will point
from cause to effect, and we hope that the resulting directed
graph is acyclic (although as we will see this may not be the case)


we use prior probabilities for the initial nodes and conditional
probabilities to link those nodes to other nodes

Simple Example


Below we have a simple Bayesian network
arranged as a DAG


Our causes are construction and/or accident and our
effects are orange barrels, bad traffic and/or flashing
lights


Each variable (node) will either be true or false


Given the input values of whether we see B, T or L, we
want to compute the likelihood that the cause was
either C or A


notice that T can be caused by either C or A or both


while
this makes our graph somewhat more complicated than a
linear graph, it does not contain a cycle

Computing the Cause


We use the chain rule to compute the probability of a chain of
states being true (C, A, B, T, L) (we address in a bit what we
really want to compute, not this particular chain)


p(C, A, B, T, L) = p(C) * p(A | C) * p(B | C, A) * p(T | B, C, A)
* p(L | C, A, B, T) where p(C) is a prior probability and the
others are conditional probabilities


with 5 items, we need 2
5

= 32 conditional probabilities


We can reduce some of the above terms


B has nothing to do with A so that p(B | C, A) becomes p(B | C)


T has nothing to do with B so p(T | B, C, A) becomes p(T | C, A)


L has nothing to do with C so p(L | C, A, B, T) becomes P(T | C, A)
and


We can reduce p(C, A, B, T, L) to p(C) * p(A) * p(B | C) * p(T
| C, A) * p(L | A)


If we could assume independence, then even p(T | C, A) could be
simplified, however since two causes feed into the same effect, this
may not be wise


If we did choose to implement this as p(T | C) * p(T | A) then we are
assuming independence and we solve this with Naïve
Bayes

A Naïve Bayes Example Network


Assume we have one cause and m effects as
shown in the figure below


Compute p(Y) and p(!Y)


Number of times Y occurs in the data and the number of
times Y does not occur out of n data


Compute p(Xi | Y)


Number of times Xi occurs when Y is true for each
i

from 1
to m


Given some collection V, a subset of {X1, X2, …,
Xm
}, then p(Y | V) = p(Y) * p(vi | Y) for each vi in V


Let’s examine an example


Y


student is a grad student


X1


student taking CSC course


X2


student works


X3


student is serious (dedicated)


X4


student calls
prof

by first name

Example Continued


The University has 15,000 students of which 1,500 are
graduate students


p(Y) = 1500/15000 = .10


Out of 1500 graduate students, 60 are taking CSC courses


P(CSC | grad student) = 60 / 1500 = .04


P(!CSC | grad student) = 1450 / 1500 = .96


Out of 1500 graduate students
, 1250
work full time


P(work | grad student) = 1250 / 1500 = .83


P(!word | grad student) = 250 / 1500 = .17


Out of 1500 graduate students, 1400 are serious about their
studies


P(serious | grad student) = 1400 / 1500 = .93


P(!serious | grad student) = 100 / 1500 = .07


Out of 1500 graduate students, 750 call their profs by their
first names


P(first name | grad student) = 750 / 1500 = .5


P(!first name | grad student) = 750 / 1500 = .5

Example Continued


NOTE: we will similarly have statistics for the non graduate
students (see below)


A given student works, is in CSC, is serious but does not call
his
prof

by the first name


p(CSC | !grad student) = 250 / 13500 = .02


p(work fulltime | !grad student) = 2500 / 13500 = .19


p(serious | !grad student) = 5000 / 13500 = .37


p(!first name | !grad student) = 12000 / 13500 = .89


What is the probability that the student is a graduate student?


p(grad student | works & CSC & serious & !first name) = p(grad
student) * p(works | grad student) * p(CSC | grad student) *
p(serious | grad student) * p(!first name | grad student) = .1 * .04 *
.83 * .93 * .5 = .0015


p(!grad student | works & CSC & serious & !first name) = p(!grad
student) * p(works | !grad student) * p(CSC | !grad student) *
p(serious | !grad student) * p(!first name | !grad student) = .9 * .02 *
.19 * .37 * .89 = .0011


Therefore we can conclude the student is a grad student

A Lengthier Example


The network below shows the cause and effects of two
independent situations


an earthquake (which can cause a radio announcement and/or your
alarm to go off)


and a burglary (which can cause your alarm to go off)


if your alarm goes off, your neighbor may call you to find out why


The joint probability of alarm, earthquake, radio, burglary =
p(A, R, E, B) = p(A | R, E, B) * p(R | E, B) * p(E | B) * p(B)


But because A & R and E & B are
independent pairs of events, we
can reduce p(A | R, E, B) to p(A |
E, B) and p(R | E, B) to p(R | E)


What about neighbor call?
Notice that it is dependent on
alarm which is dependent on
earthquake and burglary


Conditional Independence


In our previous example, we saw that neighbor
calling was not independent of burglary or
earthquake and therefore, a joint probability
p(C, A, R, E, B) will be far more complicated


However, in such a case, we can make the node
(neighbor calling) conditionally independent of
burglary and earthquake if we are given either
alarm or !alarm

On the left, A and B are independent of each

other unless we instantiate C, so that A is

dependent on B given C is true (or false)


On the right, A and B are dependent unless

we instantiate C, so A and B are independent

given C is true (or false)

Example


Here, age and gender are independent
and smoking and exposure to toxins
are independent if we are given age


Next, smoking is dependent on both
age and gender and cancer is
dependent on both exposure and
smoking and there’s nothing we can
do about that


But, serum calcium and lung tumor
are independent given cancer


So, given age and cancer


p(A, G, E, S, C, L, SC) = p(A) * p(G)
* p(E | A) * p(S | A, G) * p(C | E, S) *
p(SC | C) * p(L | C)

Instantiating Nodes


What does it mean in our previous example that
“given age” and “given cancer”?


For given cancer, that just means that we are assuming
cancer = true


We in fact will instantiate cancer to false and compute the chain
again to see which is more likely, this will tell us the probability
that the patient has cancer and the probability that the patient
does not have cancer


In this way, we can use the Bayesian network to compute for a
given node or set of nodes, which values are the most likely


Since age is not a
boolean

variable, we will have a series
of probabilities for different categories of age


E.g., p(age < 18), p(18 <= age < 30), p(30 <= age < 55), along
with conditional probabilities for age, e.g., p(smoking | age <
18), p(smoking | 18 <= age < 30), …

What Happens With Dependencies?


In our previous example, what if we do not know the age or if
the patient has cancer? If we cannot instantiate those nodes,
we have cycles


Let’s do an easy example first


Recall our “grass is wet” example, we said that running the sprinkler
was not independent of rain


The Bayesian network
that represents this
domain is shown to the
right, but it has a cycle,
so there is a
dependence, and if we
do not know if we ran
the sprinkler or not, we
cannot remove that
dependence


What then should
we do?

Solution


We must find some way of removing the cycle
from the previous graph



We will take one of the nodes that is causing the cycle
and instantiate it to both true and false


Let’s compute p(rain | grass) by instantiating
sprinkler to both true and false


p(rain | grass wet) = [p(r, g, s) + p(r, g, !s)] / [p(r, g, s)
+ p(r, g, !s) + p(!r, g, s) + p(!r, g, !s)] = (.2 * .01 * .99
+ .2 * .99 * .8) / (.2 * .01 * .99 + .2 * .99 * .8 + .8 * .4
* .9 + .8 * .6 * 0) = .36


So there is a 36% chance that the grass is wet because
it rained


in the denominator, grass remains true throughout our
denominator because we
know
that the grass was wet

More Complex Example


Let’s return to our cancer example and see how to resolve it


The cycle occurs because of age pointing at two nodes and those two
nodes pointing at cancer


In order to remove the cycle, we must collapse the Exposure to Toxins
and Smoking nodes into one


We will also collapse the Age and Gender nodes into one to simplify
that Age points to the two separate nodes (which are now collapsed
into one)


How do we use a collapsed node? We must instantiate all
possible combinations of the values in the node and use these
to compute the probability for cancer


Exposure = t, Smoking = t


Exposure = t, Smoking = f


Exposure = f, Smoking = t


Exposure = f, Smoking = f


This becomes far more computationally
complex if our node has 10 variables in it


2
10

combinations instead of 2
2


Junction Trees


With more forethought in our design, we may be
able to avoid such a problem of collapsing a large
number of nodes into a group node by creating
what is known as a
junction tree


Any Bayesian network can be transformed by


adding links between the parent nodes of any given
node


adding links to any cycle of length more than three so
that cycles are all of length three or shorter (this helps
complete the graph)


Each cycle is a
clique
of no more than 3 nodes


each of which forms a junction resulting in
dependencies of no more than 3 nodes to restrict the
number of probabilities needed

Propagation Algorithm


To this point, we have assumed that our knowledge is
static but what if we are dealing with either incomplete
knowledge or changing evidence?


Now we need to not only perform a chain of computations,
we may have to feed back into the network based on new
posterior knowledge


This requires a bi
-
directional propagation algorithm


Judea Pearl came up with an approach that, somewhat
like a neural network, involves forward and backward
passes until probabilities at each node converge (do not
change between iterations)


The idea is, initially the same: introduce the prior
probabilities and compute intermediate node conditional
probabilities


But when we reach our conclusion (the last or bottom
nodes), we can take our result and combine it with new
evidence (posterior probabilities) and pass them backward
through the network,
recomputing

each node’s probability
as before but backwards

Pearl’s Algorithm


The algorithm is complex and so is only introduced here with a
brief example


Bel
(B) = p(B | e
+
, e
-
) =
a

*
p

(B)
T

o
l
(B)


p
(B) is computed by starting with the prior probability and
l
(B)
is computed by starting with the posterior probability


Think of the values of
p
(B) and
l
(B) as being rows and columns
of a matrix, T means transpose, and o is the dot inner product of
the two matrices,
a

is a normalizing constant

What Pearl’s

algorithm offers

is the ability to

change our evidence

over time and be

able to update

probabilities (beliefs)

without having to

start from scratch in

our computations

Example


We have a murder investigation under way with
three suspects and the murder weapon, a knife, with
finger prints


X: identity of the last user of the weapon (and therefore
the most likely person to be the murderer)


Y: the last holder of the weapon


Z: the identity of the person’s finger prints found on the
weapon


our three murder suspects are a, b, and c


Our Bayesian network is merely X


Y


Z


Based on the fingerprint evidence, we have a prior
probability of e
+

=


p(X = a) = .8, p(X = b) = .1, p(X = c) = .1


And the following conditional probabilities


p(Y = q | X = q) = .8, p(Q = !q | X = q) = .1


that is, there is an 80% chance that the fingerprint indicates the
murderer and a 10% chance that the fingerprint indicates one of the
non
-
murderers


Belief Computations


We start with our formula
Bel
(B) = p(B | e
+
, e
-
) =
a

*
p

(B)
T

o
l
(B) where
a

= [p(e)]
-
1


p
(Y) =




A lab report provides us with
l
(Y) = (.8, .6, .5) giving
more support that either b or c are the murderers so now
we have to update given e
-

(a posterior probability)


We can now compute


Bel
(Y) =
a

* (.8, .6, .5) o (.66, .17, .17) = (.74, .14, .12)



At this point, we compute
l
(X) =



and now we update
Bel
(X) =
a

* (.75, .61, .54) o (.8, .1, .1) =
(.84, .01, .01)





)
17
.
17
,.
66
(.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
)
1
,.
1
,.
8
(.












































54
.
61
.
75
.
5
.
6
.
8
.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
Continued


Now suspect 1 produces a strong alibi reducing the
probability that he was the murderer from .8 to .28


so that
p
(X) = (.28, .36, .36)


We repeat our belief computations, first by computing
p
(Y) =




Bel
(X) =
a

* (.28, .36, .36) o (.75, .61, .54) = (.34, .35, .31)


Bel
(Y) =
a

* (.3, .35, .35) o (.8, .65, .5) = (.38, .34, .28)


At this point, probabilities tell us that the most likely
suspect in the murder is b because p(X = b) > p(X = a)
by a very slight margin


although our belief in Y says that the fingerprints are still
more likely to be
a’s

than
b’s

since p(Y = a) > p(Y = b)

)
35
,.
35
,.
3
(.
8
.
1
.
1
.
1
.
8
.
1
.
1
.
1
.
8
.
)
36
,.
36
,.
28
(.












Tree
-
based Propagation


Our murder investigation had a single, linear
causality, based on fingerprints


What if we had other pieces of evidence aside from
finger prints? Then it would be likely that our chain
of causality would not merely be linear, but perhaps
a tree shape


We would have to enhance the propagation
algorithm to include data fusion which include top
-
down and bottom
-
up propagations where nodes
could be anticipatory, evidence, judgment, or a root
node


trees may not have single root nodes, but if they are still
acyclic, they use a similar top
-
down, bottom
-
up
propagation

Graph
-
based Propagation


Unfortunately, it is unlikely that a real
-
world
problem would result in an acyclic tree


Below is a more common form of causal network with
multiple possible conclusions where D stands for
various diseases and O for various observations
(symptoms)


When our graph has cycles, we must again resort to
either collapsing nodes or instantiating nodes


We would have to instantiate all combinations of 3 nodes
below in order to remove all cycles, e.g., for D2, D3, D4 try
TTT, TTF, TFT, TFF, FTT, FTF, FFT, FFF

Or, we might try to collapse

nodes such as D2, D3 and D4 into

a single node

Or we might add links between the

disease nodes and observation

nodes to create smaller cliques

Approximate Algorithms


Bayesian networks are inherently cyclical


That is, most domains of interest have causes and effects that
are co
-
dependent leading lead to graphs with cycles


if we assume independence between nodes, we do not accurately
represent the domain


The result is the need to deal with these cycles by
instantiating nodes to all possible combinations, which is
of course intractable


Junction trees can reduce the amount of intractability but does
not remove it (and there is no single strategy that will
minimize a
Bayes

network in general)


Pearl’s algorithm introduces even greater complexity when the
graphs have cycles


There are a number of approximation algorithms
available, but each is applicable only to a particular
structured network


that is, there is no single
approximation algorithm that will either reduce the
complexity or provide an accurate result in all cases

Dynamic Bayesian Networks


Cause
-
effect situations may also be temporal


at time
i
, an event arises and causes an event at time i+1


the Bayesian belief network is static, it captures a situation at a
singular point in time


we need a dynamic network instead


The
dynamic
Bayesian network is similar to our previous
networks except that each edge represents not merely a
dependency, but a temporal change


when you take the branch from state
i

to state i+1, you are not
only indicating that state
i

can cause i+1 but that
i

was at a
time prior to i+1

Each node represents a sound

at a particular time interval


NOTE: the DBN is really

a form of hidden Markov

model, so we defer discussion

of this until later

Bayesian Forms of Learning


There are three forms of Bayesian learning


learning probabilities


l
earning structure


supervised learning of probabilities


In the first form, we merely want to learn the
probabilities needed for Bayesian reasoning


this can be done merely by counting occurrences


take all the training data and compute every necessary probability


we might adopt the naïve stance that data are
conditionally independent


P(d | h) = P(a1, a2, a3, …, an | h) = P(a1 | h) * P(a2 | h) * … *
P(an | h)


this assumption is used for Naïve Bayesian Classifiers and
this is how our spam filters can learn over time, by
recounting the probabilities from time to time

Another Example of Naïve Bayesian Learning


We want to learn, given some conditions, whether
to play tennis or not


see the table on the next page


The data available generated tells us from previous
occurrences what the conditions were and whether
we played tennis or not during those conditions


there are 14 previous days’ worth of data


To compute our prior probabilities, we just do


P(tennis) = days we played tennis / totals days = 9 / 14


P(!tennis) = days we didn’t play tennis = 5 / 14


The evidential probabilities are computed by adding
up the number of Tennis = yes and Tennis = no for
that evidence, for instance


P(wind = strong | tennis) = 3 / 9 = .33 and P(wind =
strong | !tennis) = 3 / 5 = .60

Continued


We have a problem in
computing our evidential
probabilities


we do not have enough data
to tell us if we played in
some of the various
combinations of conditions


did we play when it was
overcast, mild, normal
humidity and weak winds?
No, so we have no
probability for that


do we use 0% if we have no
probability?


We must rely on the Naïve
Bayesian assumption of
conditional independence to
get around this problem


Day

Outlook

Temperature

Humidity

Wind

Play
Tennis

Day1

Sunny

Hot

High

Weak

No

Day2

Sunny

Hot

High

Strong

No

Day3

Overcast

Hot

High

Weak

Yes

Day4

Rain

Mild

High

Weak

Yes

Day5

Rain

Cool

Normal

Weak

Yes

Day6

Rain

Cool

Normal

Strong

No

Day7

Overcast

Co
ol

Normal

Strong

Yes

Day8

Sunny

Mild

High

Weak

No

Day9

Sunny

Cool

Normal

Weak

Yes

Day10

Rain

Mild

Normal

Weak

Yes

Day11

Sunny

Mild

Normal

Strong

Yes

Day12

Overcast

Mild

High

Strong

Yes

Day13

Overcast

Hot

Normal

Weak

Yes

Day14

Rain

Mild

High

Strong

N
o


p(Sunny & Hot & Weak | Yes) = p(tennis) * p(sunny | tennis) *


p(hot | tennis) * p(weak | tennis) = 9 / 14 * 2 / 9 * 2 / 9 * 6 / 9 = 12 / 567

Learning Structure


For a Bayesian network, how do we know what
states should exist in our structure? How do we
know what links should exist between states?


There are two forms of learning here


to learn the states that should exist


t
o learn which transitions should exist between states


Learning states is less common as we usually have a
model in mind before we get started


we already know the causes we want to model and the
data tell us what effects we have probabilities for


what the data might not tell us is all of the possible links
between the nodes, but so we might have to learn the
transitions


Learning transitions is more common and more
interesting

Learning Transitions


One approach is to start with a fully connected graph


we learn the transition probabilities using the Baum
-
Welch algorithm
and remove any links whose probabilities are 0 (or negligible)


but this approach will be impractical


we will discuss Baum
-
Welch with HMMs


Another approach is to create a model using neighbor
-
merging


start with each observation of each test case representing its own
node


as each new test case is introduced, merge nodes that have the same
observation at time t


the network begin to collapse


Another approach is to use V
-
merging


here, we not only collapse states that are the same, but also states
that share the same transitions


for instance, if we have a situation where in case j si
-
1 goes to
si

goes to si+1 and we match that in case k, then we collapse that entire
set of transitions into a single set of transitions


notice there is nothing probabilistic about learning the structure

Example


Given a collection of research articles, learn the structure of
a paper’s header


that is, the fields that go into a paper


Data came in three forms: labeled (by human), unlabeled,
distantly labeled (data came from
bibtex

entries, which
contains all of the relevant data but had extra fields that
were to be discarded) from approximately 5700 papers


the transition probabilities were learned by simple counting

Applications for Bayes


As we saw, spam filters are commonly implemented
using naïve
Bayes

classifiers but what other
applications use
Bayes
?


Inference/diagnosis


as we saw in many of the examples
presented here, we might use
Bayes

to identify the most
likely cause of various effects that we see, the cause
might be a diagnostic conclusion, a classification, or more
generally, assigning credit (blame)


As an example, Pathfinder is a medical diagnostic system for
lymph node disease which has 60 diseases and over 130 features
where features are not always binary (many are real valued)


Prediction


given our Bayesian network, we might want
to predict what the most likely result will be given
starting conditions

Continued


Design


nodes represent the components that
might go into the design product, and the features
that each component provides or goals that it
fulfills


Design testing can also be solved


Similarly, decision making can easily be modeled with
the intent being to maximize the expected utility


Story understanding


words and concepts are
stored in a network and a new story is introduced
and the most likely concept is recognized
probabilistically


Word sense disambiguation (what meaning does a given
work have) is supported as intermediate nodes in the
network