The University of Texas at Arlington

reverandrunAI and Robotics

Nov 7, 2013 (4 years ago)

111 views

1
The University of Texas at Arlington
Lecture-5
Bayesian Networks
CSE 5301 – Data Modeling and Analysis
Techniques
Dr. Gergely Záruba
Modeling Complexity and Dependence
• Modeling distributions and inferring probability
values in multi-variable systems is
computationally complex
• Model size and probabilistic inference are exponential
in the number of random variables
• Independence relations can reduce this
complexity
• Model size is exponential only in the number of
mutually dependent variables
• Conditional independence in probability limits the
number of random variables that have to be
considered to make inference
2
2
Graphical Models
• Graphical models provide an efficient structure to represent dependencies
in probabilistic systems.
• There are two main types of graphical models for probabilistic systems:
• Bayesian Networks are directed graphical models
• Markov Networks (Markov Random Fields) are undirected graphical models
(they can model dependencies Bayes nets cannot. – Will not be discussed.
• Both types of models can represent different types of dependencies
• Graphical models in probabilistic systems allow the representation
of the interdependencies of random variables
• Structure shows dependency relations
• Inference can use the structure to control the computations
• Graphical models provide a basis for a number of efficient problem
solutions
• Inference of prior and conditional probabilities
• Learning of network structure
3
BAYESIAN NETWORKS
4
3
Bayesian Networks
• Bayesian networks are graphical representation for
conditional independence providing a compact
specification of joint probability distributions
• Bayesian networks are directed, acyclic graphs G(V,E)
• Vertices represent random variables
ܸ ൌ ሼܺ

|1 ൑ ݅ ൑ ݊ሽ
• Edges represent “direct influences”
ܧ ൌ ሼሺܺ



ሻ|ܺ

directly influences ܺ


• Nodes are annotated with the conditional probability distribution of
the node given its parents
ܲ
ܺ

|ܲܽݎ݁݊ݐݏ
ܺ

• Probabilities in the network represent the joint distribution
5
A Simple BayesNet Example
6
4
Joint Distribution
• Remember a Bayesian Network is should be a simple
representation of a system with a large number of probabilistic
variables with some independecies.
• Calculating the joint distribution can be done using:
ܲ
ݔ

,…,ݔ

ൌ ෑܲሺݔ

|ܲܽݎ݁݊ݐݏ
ܺ



௜ୀଵ
• E.g., the probability of:
P(sp,!hp,bd,!rs,!fl)=
P(!fl|bd)*P(!rs|bd)*P(bd|sp,!hp)*P(!hp)*P(sp)=0.7*0.8*0.5*0.99*0.1=0.02772
• If all variables are Boolean, then keeping the joint probability table
would result in maintaining 2
n
values. If we can limit the parents of
each node to be no more than k, then with a Bayesain network we
can reduce that value to O(n*2
k
)
7
Conditional Independence
• Bayesian networks are powerful as they
capture conditional independence
between random variables.
• Bothe the following expressions are true::
• A node is conditionally independent of all of
its non-descendants given its parents
• A node is conditionally independent of all
other nodes in the network given its parents,
its children and its children’s parents. This is
what we call the Markov blanket.
8
5
Node Ordering
• Note, that the node ordering matters. The best
node ordering is usually “causal”.
• Add the root causes first, then add variables
they influence from top to bottom until you reach
the leaves.
• Fortunately in most situation this causal
relationship is what the researcher requires.
• Note, that any ordering is possible but the
number of links may grow significantly if the
relationship is not casual.
9
Discrete and Continuous Variables
• Obviously any discrete distribution is
representable in a node in a Bayesian network.
• An arbitrary continuous distribution cannot be
easily represented.
• One trick could be to discretize the distribution, where
precision of the size of the containers could be
balanced against the size of the network.
• Distributions that can be given by a formula and
parameters (e.g., exponential, Gaussian) can be used
if attention is paid at their meanings.
10
6
Child continuous, Parent Discrete
• It is common to change the parameter of
the distribution in a continuous child node
based on the discrete parent node’s
probabilities.
• E.g., it is common to use a Gaussian
distribution with a fixed variance but a mean
that is influenced by the parent node.
11
Hybrid Example
12
Example taken from RN2003
ܲ
ܥ ܪ,!ݏ ൌ ܰሺܽ

݄ ൅ܾ





ܲ
ܥ ܪ,ݏ ൌ ܰሺܽ

݄ ൅ܾ





7
Linear Gaussian Distribution
13
Discrete variable with
Continuous Parent
14
• Common to use soft threshold functions.
• probit, if Φ
ݔ ൌ
׬
ܰሺ0,1ሻሺݔሻ

ିஶ
݀ݔ then
P(Buy=true|Cost=c)=Φ
ሺെܿ ൅ߤሻ/ߪ
• sigmoid (logit) P(Buy=true|Cost=c)=

ଵା௘
ሺషమ
ഋష೎


8
INFERENCE IN BAYESIAN NETS
So, why did we do all of these?
15
Inference
• With a nicely constructed Bayisian Network we can
make diagnosis and thus can make good informed
decisions.
• We can fix the value of any or more of the nodes in the
network (to a precise value) and see how that changes
the probabilities of the distributions.
• Thus we can observe evidence variables and see what
their impact is on some other variables, the query
variables.Variables in the network that are not used for
evidence nor for query are called hidden variables.
• So, what is the posterior distribution of P(x
q
|x
e
1
,..,x
e
n
)
16
9
Inference by Enumeration
• Conditional probabilities can be computed
from the joint probabilities.
• A query can be answered as the
normalized sum of joint probabilities, thus
as the normalized sum of products of
conditional probabilities found in the
network.
• Recall: ܲ
ܺ ࢋ ൌ ߙܲ
ܺ,ࢋ ൌ ߙ

ܲ
ܺ,ࢋ,࢟

17
Inference by Enumeration –
Example
• How much is P(SP|rs,fl)
• ܲ
ܵܲ ݎݏ,݂݈ ൌ ߙܲ
ܵܲ,ݎݏ,݂݈ ൌ ߙ

ܲ
ܵܲ,ܪܲ,ܤܦ,ݎݏ,݂݈
ு௉,஻஽
• This requires 4 additions over n multiplications
• The worst case complexity is ܱሺ݊2


• Example shows “variable elimination” with which real complexity can be
reduced. Complexity also depends on sparsity of the network and which
variables are used for evidence, query and which are hidden)
18
10
Approximate Inference
• So, large multi-connected Bayesian Networks
can pose a problem for exact inference.
• We can use Mote Carlo methods to determine
interesting conditional probabilities (i.e., to do
inference).
• The four basic methods are:
• Direct sampling
• Sampling from an empty network (for joint probabilities)
• Rejection sampling in Bayesian networks (for inference)
• Likelihood weighting
• Markov chain Monte Carlo
19
Sampling from an Empty Network
• Simplest method.
• Forget any evidence you may have for nodes.
• Sample each variable in topological order based on the
outcomes of the previous samples. Do this many times
(let’s say M times).
• This is going to result in M number of N-tuples:
e.g., {൏ ܨܮ

,ܴܵ

,ܤܦ

,ܪܲ

,ܵܲ

൐ |1 ൏ ݅ ൏ ܯሽ
• Individual and joint probabilities now can be estimated by
how many times out of M samples something has
happened.
20
11
Rejection Sampling in
Bayesian Networks
• Recall: rejection sampling was used to sample from a
hard to sample distribution given an easy one.
• Used in this context to add evidence and thus to
determine conditional probabilities.
• Having the M N-tuples, we can count how many times
the evidence happened and out of those times, how
many times the query happened (for Boolean). The
conditional probability will be the ratio of these two.
• The problem is that some probabilities may become very
low and thus using them as evidence variables will
require huge sets of samples.
21
Likelihood Weighting
• Avoids the previous inefficiency by only generating
samples that conform to the evidence.
• We sample all variables in order as before. However if
we are about to sample an evidence variable we set the
variable and (we do not sample) modify a weight value
for this n-tuple. For each tuple, the weight starts from 1
and gets multiplied by P(e|Parentoutcomes(e)) for each
evidence.
• Thus each n-tuple has the correct evidence and an
additional weight capturing the likelihood of such n-tuple.
• Conditional probabilities are then sums of the weight for
the various outcomes of the query variable normalized
over the total of the weights.
22
12
Markov Chain Monte Carlo
• Think of an n-tuple as the state of a process.
• Evidence variables reduce the size n (as they
are fixed and will never change).
• Initialize non-evidence variables randomly.
• Next state is determined by changing exactly
one non-evidence variable by its distribution on
the current state and its Markov blanket.
• The conditional probability is the normalized
number of states over the query variable.
23
References
• [RN2003] S. Russel, P Norvig, “Artificial
Intelligence, A Modern Approach,” Second
edition, Prentice Hall, 2003 (Chapter 14)
• Eugene Charniak, “Bayesian Networks
Without Tears,” AI Magazine, 12(4), 1991,
http://www.aaai.org/ojs/index.php/aimagaz
ine/article/view/918
24