Learning Bayesian Networks

benhurspicyΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

57 εμφανίσεις

Bayesian Networks


4
th
, December 2009

Presented by
Kwak
, Nam
-
ju

The slides are based on

<Data Mining : Practical Learning Tools and Techniques>, 2nd ed.,


written by Ian H. Witten &
Eibe

Frank.

Images and Materials are from the official lecture slides of the book.


Table of Contents


Probability Estimate vs. Prediction


What is Bayesian Network?


A Simple Example


A Complex One


Why does it work?


Learning Bayesian Networks


Overfitting


Searching for a Good Network Structure


K2 Algorithm


Other Algorithms


Conditional Likelihood


Data Structures for Fast Learning


Probability Estimate vs. Prediction


Naïve
Bayes

classifier, logistic regression
models: probability estimates


For each class, they estimate the probability
that a given instance belongs to that class.



Probability Estimate vs. Prediction


Why probability estimates are useful?


They allow predictions to be ranked.


Treat classification learning as the task of
learning class probability estimates from the data.


What is being estimated is


The conditional probability distribution of the
values of the class attribute given the values of
the other attributes.

Probability Estimate vs. Prediction


In this way, Naïve
Bayes

classifiers, logistic
regression models and decision trees are
ways of representing a conditional probability
distribution.

What is Bayesian Network?


A theoretically well
-
founded way of
representing probability distributions
concisely and comprehensively in a
graphical manner.


They are drawn as a
network of nodes
, one
for each attribute, connected by
directed
edges

in such a way that there are
no cycles
.


A directed acyclic graph


A Simple Example


Pr[outlook=rainy | play=no]

Summed up into 1


A Complex One


When outlook=rainy,
temperature=cool,
humidity=high, and
windy=true…


Let’s call E the situation
given above.

A Complex One


E: rainy, cool, high, and true


Pr[play=no, E] =
0.0025


Pr[play=yes, E] =
0.0077

Multiply all those!!

An additional example of the calculation

A Complex One


E: rainy, cool, high, and true


Pr[play=no, E] = 0.0025


Pr[play=yes, E] = 0.0077


A Complex One


Summed up into 1

Why does it work?


Terminologies


T: all the nodes, P: parents, D: descendant


Non
-
descendant: T
-
D


Non
-
descendants

Why does it work?


Assumption (conditional independence)


Pr[node | parents plus any other set of non
-
descendants]



= Pr[node | parents]


Chain rule




The nodes are ordered to give all ancestors of a
node
ai

indices smaller than
i
. It’s possible since the
network is acyclic.



Why does it work?





Ok, that’s what I’m talking about!!!

Learning Bayesian Networks


Basic components of algorithms for learning
Bayesian networks:


Methods for evaluating the goodness of a given
network


Methods for searching through space of possible
networks


Learning Bayesian Networks


Methods for evaluating the goodness of a given
network


Calculate the probability that the network accords
to each instance and multiply these probabilities
all together.


Alternatively, use the sum of logarithms.


Methods for searching through space of possible
networks


Search through the space of possible sets of
edges.

Overfitting


While maximizing the log
-
likelihood based on the
training data, the resulting network may
overfit
. What
are the solutions?


Cross
-
validation: training instances and validation
instances (similar to ‘early stopping’ in learning of
neural networks)


Penalty for the complexity of the network


Assign a prior distribution over network structures
and find the most likely network using the
probability by the data.



Overfitting


Penalty for the complexity of the network


Based on the total # of independent estimates in
all the probability tables, which is called the # of
parameters

Overfitting


Penalty for the complexity of the network


K: the # of parameters


LL: log
-
likelihood


N: the # of instances in the training data


AIC score =
-
LL+K


MDL score =
-
LL+(K/2)
logN


Those two scores are supposed to be minimized.

Akaike

Information
Criterion

Minimum
Description Length

Overfitting


Assign a prior distribution over network
structures and find the most likely network by
combining its prior probability with the
probability accorded to the network by the
data.

Searching for

a Good Network Structure


The probability of a single instance is the
product of all the individual probabilities from
the various conditional probability tables.


The product can be rewritten to group
together all factors relating to the same table.


Log
-
likelihood can also be grouped in such a
way.

Searching for

a Good Network Structure


Therefore log
-
likelihood can be optimized
separately for each node.


This can be done by adding, or removing
edges from other nodes to the node being
optimized. (without making cycles)

Which one is the best?

Searching for

a Good Network Structure


AIC and MDL can be dealt with in a similar
way since they can be split into several
components, one for each node.

K2 Algorithm


Starts with given ordering of
nodes
(attributes
)


Processes
each node in turn


Greedily
tries adding edges
from previous
nodes to current node


Moves
to next node when current
node can’t
be optimized
further

Result depends on
the initial order

K2 Algorithm


Some tricks


Use Naïve
Bayes

classifier as a starting point.


Ensure that every node is in the
Marcov

blanket
of the class node. (
Marcov

blanket: parents,
children, and children’s parents)

Naïve Bayesian Classifier

Marcov

blanket

Pictures from Wikipedia and
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

Other Algorithms


Extended K2


sophisticated but slow


Do not order the nodes.


Greedily add or delete edges between arbitrary
pairs of nodes.


Tree Augmented Naïve
Bayes

(TAN)


Other Algorithms


Tree Augmented Naïve
Bayes

(TAN)


Augment a tree to a Naïve
Bayes

classifier.


When the class node and its outgoing edges are
eliminated, the remaining edges should form a
tree.

Naïve
Bayes

classifier

Tree

Pictures from
http://www.usenix.org/events/osdi04/tech/full_papers
/cohen/cohen_html/index.html

Other Algorithms


Tree Augmented Naïve
Bayes

(TAN)


MST of the network will be a clue for maximizing
likelihood.

Conditional Likelihood


What we actually need to know is the
conditional likelihood, which is the
conditional probability of the class given the
other attributes.


However, what we’ve tried to maximize is, in
fact, just the likelihood.

O

X

Conditional Likelihood


Computing the conditional likelihood for a
given network and dataset is straightforward.


This is what logistic regression does.

Data Structures for Fast Learning


Learning Bayesian networks involves a lot of
counting.


For each network structure to be searched,
the data must be scanned to get the
conditional probability tables.

(Since the
‘given term’ of the table of a certain node
changes frequently, we should rescan the
data in order to get the brand new
conditional probabilities many times.)

Data Structures for Fast Learning


Use a general hash tables.


Assuming that there are 5 attributes, 2 with 3
values and 3 with 2 values.


There’re 4*4*3*3*3=432 possible categories.


This calculation includes cases of missing values.
(i.e. null)


This can cause memory problems.

Data Structures for Fast Learning


AD (all
-
dimensions) tree






Using a general hash table, there will be
3*3*3=27 categories, even though only 8
categories are actually used.


Data Structures for Fast Learning


AD (all
-
dimensions) tree

Only 8 categories are required,

compared to 27.

Data Structures for Fast Learning


AD (all
-
dimensions) tree
-

construction


Assume each attribute in the data has
been assigned an
index.


Then
, expand node for attribute
i

with
the
values
of all
attributes
j >
i


Two
important restrictions:


Most
populous expansion for each attribute
is omitted
(breaking ties arbitrarily)


Expansions
with counts that are zero are
also omitted


The
root node is given index zero

Data Structures for Fast Learning


AD (all
-
dimensions) tree

Data Structures for Fast Learning


AD (all
-
dimensions) tree

Q. # of (humidity=normal, windy=true, play=no)?

Data Structures for Fast Learning


AD (all
-
dimensions) tree

Q. # of (humidity=normal, windy=false, play=no)?

?

Data Structures for Fast Learning


AD (all
-
dimensions) tree

Q. # of (humidity=normal, windy=false, play=no)?

#(humidity=normal, play=no)


#(humidity=normal, windy=true, play=no)

= 1
-
1=0

Data Structures for Fast Learning


AD tree only pay off if the data contains
many thousands of instances.

Questions and Answers


Any question?

Pictures from
http://news.ninemsn.com.au/article.aspx?id=805150