Learning Bayesian Networks

AI and Robotics

Nov 7, 2013 (4 years and 6 months ago)

73 views

Bayesian Networks

4
th
, December 2009

Presented by
Kwak
, Nam
-
ju

The slides are based on

<Data Mining : Practical Learning Tools and Techniques>, 2nd ed.,

written by Ian H. Witten &
Eibe

Frank.

Images and Materials are from the official lecture slides of the book.

Probability Estimate vs. Prediction

What is Bayesian Network?

A Simple Example

A Complex One

Why does it work?

Learning Bayesian Networks

Overfitting

Searching for a Good Network Structure

K2 Algorithm

Other Algorithms

Conditional Likelihood

Data Structures for Fast Learning

Probability Estimate vs. Prediction

Naïve
Bayes

classifier, logistic regression
models: probability estimates

For each class, they estimate the probability
that a given instance belongs to that class.

Probability Estimate vs. Prediction

Why probability estimates are useful?

They allow predictions to be ranked.

Treat classification learning as the task of
learning class probability estimates from the data.

What is being estimated is

The conditional probability distribution of the
values of the class attribute given the values of
the other attributes.

Probability Estimate vs. Prediction

In this way, Naïve
Bayes

classifiers, logistic
regression models and decision trees are
ways of representing a conditional probability
distribution.

What is Bayesian Network?

A theoretically well
-
founded way of
representing probability distributions
concisely and comprehensively in a
graphical manner.

They are drawn as a
network of nodes
, one
for each attribute, connected by
directed
edges

in such a way that there are
no cycles
.

A directed acyclic graph

A Simple Example

Pr[outlook=rainy | play=no]

Summed up into 1

A Complex One

When outlook=rainy,
temperature=cool,
humidity=high, and
windy=true…

Let’s call E the situation
given above.

A Complex One

E: rainy, cool, high, and true

Pr[play=no, E] =
0.0025

Pr[play=yes, E] =
0.0077

Multiply all those!!

An additional example of the calculation

A Complex One

E: rainy, cool, high, and true

Pr[play=no, E] = 0.0025

Pr[play=yes, E] = 0.0077

A Complex One

Summed up into 1

Why does it work?

Terminologies

T: all the nodes, P: parents, D: descendant

Non
-
descendant: T
-
D

Non
-
descendants

Why does it work?

Assumption (conditional independence)

Pr[node | parents plus any other set of non
-
descendants]

= Pr[node | parents]

Chain rule

The nodes are ordered to give all ancestors of a
node
ai

indices smaller than
i
. It’s possible since the
network is acyclic.

Why does it work?

Ok, that’s what I’m talking about!!!

Learning Bayesian Networks

Basic components of algorithms for learning
Bayesian networks:

Methods for evaluating the goodness of a given
network

Methods for searching through space of possible
networks

Learning Bayesian Networks

Methods for evaluating the goodness of a given
network

Calculate the probability that the network accords
to each instance and multiply these probabilities
all together.

Alternatively, use the sum of logarithms.

Methods for searching through space of possible
networks

Search through the space of possible sets of
edges.

Overfitting

While maximizing the log
-
likelihood based on the
training data, the resulting network may
overfit
. What
are the solutions?

Cross
-
validation: training instances and validation
instances (similar to ‘early stopping’ in learning of
neural networks)

Penalty for the complexity of the network

Assign a prior distribution over network structures
and find the most likely network using the
probability by the data.

Overfitting

Penalty for the complexity of the network

Based on the total # of independent estimates in
all the probability tables, which is called the # of
parameters

Overfitting

Penalty for the complexity of the network

K: the # of parameters

LL: log
-
likelihood

N: the # of instances in the training data

AIC score =
-
LL+K

MDL score =
-
LL+(K/2)
logN

Those two scores are supposed to be minimized.

Akaike

Information
Criterion

Minimum
Description Length

Overfitting

Assign a prior distribution over network
structures and find the most likely network by
combining its prior probability with the
probability accorded to the network by the
data.

Searching for

a Good Network Structure

The probability of a single instance is the
product of all the individual probabilities from
the various conditional probability tables.

The product can be rewritten to group
together all factors relating to the same table.

Log
-
likelihood can also be grouped in such a
way.

Searching for

a Good Network Structure

Therefore log
-
likelihood can be optimized
separately for each node.

This can be done by adding, or removing
edges from other nodes to the node being
optimized. (without making cycles)

Which one is the best?

Searching for

a Good Network Structure

AIC and MDL can be dealt with in a similar
way since they can be split into several
components, one for each node.

K2 Algorithm

Starts with given ordering of
nodes
(attributes
)

Processes
each node in turn

Greedily
from previous
nodes to current node

Moves
to next node when current
node can’t
be optimized
further

Result depends on
the initial order

K2 Algorithm

Some tricks

Use Naïve
Bayes

classifier as a starting point.

Ensure that every node is in the
Marcov

blanket
of the class node. (
Marcov

blanket: parents,
children, and children’s parents)

Naïve Bayesian Classifier

Marcov

blanket

Pictures from Wikipedia and

Other Algorithms

Extended K2

sophisticated but slow

Do not order the nodes.

Greedily add or delete edges between arbitrary
pairs of nodes.

Tree Augmented Naïve
Bayes

(TAN)

Other Algorithms

Tree Augmented Naïve
Bayes

(TAN)

Augment a tree to a Naïve
Bayes

classifier.

When the class node and its outgoing edges are
eliminated, the remaining edges should form a
tree.

Naïve
Bayes

classifier

Tree

Pictures from
http://www.usenix.org/events/osdi04/tech/full_papers
/cohen/cohen_html/index.html

Other Algorithms

Tree Augmented Naïve
Bayes

(TAN)

MST of the network will be a clue for maximizing
likelihood.

Conditional Likelihood

What we actually need to know is the
conditional likelihood, which is the
conditional probability of the class given the
other attributes.

However, what we’ve tried to maximize is, in
fact, just the likelihood.

O

X

Conditional Likelihood

Computing the conditional likelihood for a
given network and dataset is straightforward.

This is what logistic regression does.

Data Structures for Fast Learning

Learning Bayesian networks involves a lot of
counting.

For each network structure to be searched,
the data must be scanned to get the
conditional probability tables.

(Since the
‘given term’ of the table of a certain node
changes frequently, we should rescan the
data in order to get the brand new
conditional probabilities many times.)

Data Structures for Fast Learning

Use a general hash tables.

Assuming that there are 5 attributes, 2 with 3
values and 3 with 2 values.

There’re 4*4*3*3*3=432 possible categories.

This calculation includes cases of missing values.
(i.e. null)

This can cause memory problems.

Data Structures for Fast Learning

-
dimensions) tree

Using a general hash table, there will be
3*3*3=27 categories, even though only 8
categories are actually used.

Data Structures for Fast Learning

-
dimensions) tree

Only 8 categories are required,

compared to 27.

Data Structures for Fast Learning

-
dimensions) tree
-

construction

Assume each attribute in the data has
been assigned an
index.

Then
, expand node for attribute
i

with
the
values
of all
attributes
j >
i

Two
important restrictions:

Most
populous expansion for each attribute
is omitted
(breaking ties arbitrarily)

Expansions
with counts that are zero are
also omitted

The
root node is given index zero

Data Structures for Fast Learning

-
dimensions) tree

Data Structures for Fast Learning

-
dimensions) tree

Q. # of (humidity=normal, windy=true, play=no)?

Data Structures for Fast Learning

-
dimensions) tree

Q. # of (humidity=normal, windy=false, play=no)?

?

Data Structures for Fast Learning

-
dimensions) tree

Q. # of (humidity=normal, windy=false, play=no)?

#(humidity=normal, play=no)

#(humidity=normal, windy=true, play=no)

= 1
-
1=0

Data Structures for Fast Learning

AD tree only pay off if the data contains
many thousands of instances.