Dynamic Bayesian Networks:

placecornersdeceitΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

122 εμφανίσεις




Dynamic Bayesian Networks:
Class Discriminative Structure Search
with an Application to Functional Magnetic
Resonance Imaging Data







John Burge
PhD Thesis Proposal
Table of Contents
1 Introduction.................................................................................................................3
2 Background.................................................................................................................4
2.1 Bayesian Networks.............................................................................................4
2.2 Notation...............................................................................................................5
2.3 Bayesian Multinets..............................................................................................5
2.4 Functional Magnetic Resonance Imaging...........................................................6
2.5 Talairach Database..............................................................................................7
2.6 Related Neuroimaging Analysis Work...............................................................7
3 Current Work  Analyzing Neuroanatomical Relationship Dynamics.......................8
3.1 Experimental Design for fMRI Analysis............................................................8
3.2 DBN Analysis...................................................................................................10
3.3 Results...............................................................................................................10
3.4 Confidence Measure.........................................................................................12
3.5 Experimental Conclusions................................................................................14
4 Bayesian Networks  Research Questions................................................................14
4.1 Feature Extraction.............................................................................................15
4.2 Structure Search................................................................................................15
4.3 Parameter Identification....................................................................................16
4.4 Structure Scoring..............................................................................................17
4.5 Analyzing Network Structures..........................................................................17
4.6 Phase Transitions..............................................................................................17
5 Bayesian Networks  Proposed Issues to be Addressed...........................................17
5.1 Comparative Scoring Methods and ACL..........................................................18
5.2 Maximum ACL parameters..............................................................................20
5.3 Hierarchical Information...................................................................................22
5.4 Shrinkage..........................................................................................................24
6 Research Contributions.............................................................................................27
7 Acknowledgements...................................................................................................28
8 References.................................................................................................................28
9 Appendices................................................................................................................31
9.1 Bayesian networks............................................................................................31
9.2 Dynamic Bayesian Networks............................................................................34
9.3 Gaussian Naïve Bayesian Networks.................................................................37
9.4 Lagrange Multipliers for Maximizing the Binomial Ratio Test.......................38


1 Introduction
The neuroanatomical behavior and underlying root causes of many mental illnesses are
only moderately understood. What changes in the neural activity of patients with
schizophrenia? How do the relationships among neuroanatomical regions change due to
the onset of senile dementia? Questions such as these are frequently addressed in the
work of many neuroscientists. Answers to these questions could result in better
diagnoses and treatment options for patients, ultimately leading to an improved livelihood
for many people.
But to answer these questions, a fundamental issue must be addressed: how can the
relationships among neuroanatomical regions be modeled? Neuroscientists have several
standardized analytical methods which have yielded much success in furthering the
current understanding of mental illness (Section 2.6). However, many of their modeling
techniques make overly restrictive assumptions, such as the linearity and Gaussian
distribution assumptions. These assumptions could result in the inability of their models
to represent many important relationships. The machine learning community, and the AI
community in general, have developed many modeling techniques that can be applied to
the neuroanatomical domain that do not require these assumptions.
I propose a method based on Bayesian networks (BNs) (Jensen 2001, Pearl 1986) capable
of modeling the temporal relationships among neuroanatomical regions. Specifically, I
use a subset of BNs referred to as dynamic Bayesian networks (DBNs) (Murphy 2002)
that are capable of modeling temporal processes. BNs are graphical models that
explicitly represent statistical relationships among random variables (RVs). They are
composed of two primary components: a topological structure indicating which RVs are
statistically dependent on each other and a set of parameters detailing the particular
behavior among interdependent RVs. In many systems, it is not known a priori which
RVs are dependent on each other. For these systems, a BNs topological structure will be
unknown and must be searched for an NP-hard task. In conjunction with finding a
structure, a method for determining the BNs parameters must also be devised. My
research focuses on novel techniques capable of addressing these issues.
First, I observe that under certain circumstances, learning multiple Bayesian networks,
one for each possible class of data, is a better approach than the frequently employed
method of learning a single BN with a class node. Doing this falls under the general
multinet framework (Heckerman 1991; Geiger and Heckerman 1996) (Section 2.3).
Second, I propose a method for determining which structures are representative of class-
discriminating relationships. Third, I propose a structure search heuristic that takes
advantage of the relationships among hierarchically related RVs. As an extension to the
third proposal, I propose to incorporate shrinkage, a statistical technique previously used
in machine learning to improve classification efficacy (Section 5.4).
As structure search is an unsupervised learning problem, validation of experimental
results is an issue. For validation, I compare the classification efficacy of my methods
with other commonly used machine learning classification techniques. I also employ a
bootstrapping method for quantitatively measuring the confidence in the results which
indicates how likely the results were due to chance (Section 3.4). Further, I show that my
methods are capable of extracting known relationships in simulated data. Finally, I
analyze functional magnetic resonance imaging data collected in an experiment designed
to identify neuroanatomical differences between healthy and demented elderly patients.
A collaborating neuroscientist, Vincent P. Clark, compares my structural results with
putative neuroanatomical behavior.
2 Background
2.1 Bayesian Networks

For an introductory overview of BNs, I refer the reader to Charniaks aptly titled
 Bayesian Networks without Tears (1991) and for a detailed analysis, to (Heckerman,
Geiger & Chickering 1995; Jensen 2001). I will briefly introduce BNs and the notation I
use to describe them. See Appendix 9.1 for a significantly more detailed overview.
Given a system composed of a set of RVs, it is desirable to assign a probability to every
possible event, where an event is a unique setting of the RVs. Modeling a system with a
joint probability distribution (JPD) does this. However, as the number of RVs in a
system increases, the number of JPD parameters increases exponentially. As such, using
a JPD becomes intractable for all but simplistic models. However, if some of the RVs are
statistically independent, many of the JPDs parameters will be redundant. BNs take
advantage of this redundancy and provide a method to succinctly represent the non-
redundant parameters within the JPD.
A BN can be represented as a directed acyclic graph (DAG). For each RV in the system,
the DAG contains a single node. If two nodes are statistically dependent, the DAG
contains a directed link connecting the nodes. The links origin is the parent node and
the links destination is the child. Since the link is directed, a sense of causality is often
semantically associated with the relation such that the parent causally influences the
child. Associated with each child node is a conditional probability distribution (CPD)
describing the statistical relationship between the node and its parents.
A subclass of BNs, dynamic Bayesian networks (DBNs), are used to model temporal
processes. Ideally, the DBNs topological structure would contain a single column of
nodes for each timeframe in the system. Each column would include one node for every
RV. Unfortunately, for all but the simplest domains, such models quickly become
intractably large. To deal with this intractability, two assumptions are frequently made:
the Markov assumption and the stationary assumption (Appendix 9.2). Under these
assumptions, the system can be modeled with a DBN containing two columns of RVs
(see Figure 13 in Appendix 9.2 for an example of such a DBN).
Unlike other BNs with fixed topologies, such as the Gaussian naïve Bayesian network
(Appendix 9.3) and the hidden Markov model, a DBNs topology is frequently not
known. Finding the correct DBN topology, a process referred to as structure search, is a
well-understood problem (Heckerman 1995) that is NP-Hard (Chickering, Geiger &
Heckerman 1994). Much of my research focuses on the details for finding the structures
that best describe a set of data.
2.2 Notation

A Bayesian network is described by the pair (B
S
, B
O
) where B
S
is the graphical structure
of the network and B
O
specifies the conditional probability parameters of the BN. B
S

contains a set of nodes for every RV in the system, X = {X
1
, X
2
, ,X
n
} with arities {r
1
,
r
2
, , r
n
} and a link for every causal dependency between RVs.
Y is a RV used to represent the class associated with a set of observed RVs. To simplify
the math, I will assume binary classification such that the domain of Y = {1, 2}. A data
poin t consists of observable RVs and a class RV: d = {X, Y}. X
d
and Y
d
refer to the
observable RVs and the class of data point d. A d ataset, D, is a collection of m data
points, {d
1
, , d
m
}. D

denotes a dataset containing all of (and only) the data points for a
specific class, i.e., D

= {d : Y
d
=  }.
B denotes a Bayesian network containing nodes for RVs {X, Y}. The parent set for a RV
X
i
in B is denoted Pa
B
(X
i
) or just Pa(X
i
) if the BN can be inferred. q
i
is the number of
possible configurations for the RVs in Pa(X
i
). The substructure in a BN containing X
i
and
Pa(X
i
) is referred to as a family. For convenience, the family that contains X
i
as a child is
referred to as family X
i
.
Unless specified otherwise, all BNs are assumed to be discrete and their conditional
probability distributions are represented by conditional probability tables (CPTs). The
joint probability distribution represented by B is given in Equation (1),

B
is the set of CPT parameters for BN B.
B
i

is the CPT for node X
i
in BN B,
,
B
i j

is the
multinomial P
B
(X
i
| Pa(X
i
)=j) and
,,
B
i j k


is the single CPT parameter P
B
(X
i
= k | Pa(X
i
) = j).
Notation for DBNs is slightly modified. X
i
t

(also denoted X
i
t+0
) and X
i
t+1
represent the i
th

RV in columns t and t + 1, respect ively. Th e set of DBN RVs is denot ed X
0:1
={X
i
t
,
X
i
t+1

:
1 i  n}. The parameters for P
B
(X
i
t+e
= k | Pa(X
i
t+e
) = j) are denoted
,,,
B
e i j k
, e  {0,1}. Y
represents the class associated with the e nti re time series.
2.3 Bayesian Multinets

Frequently, when learning a BN for the purposes of class discrimination, a single BN
with a class node is trained. However, if the conditional independence assumptions
among RVs change based on the class, the resulting BN will have an overly complex
structure with many redundant parameters. Instead of learning a single BN, a BN for
each value of the class node should be learnt. This reduces the maximum number of
simultaneous parents needed for the families whose parent sets change across classes and
results in a better use of sparse data. A multi ne t (Heckerman 1991; Geiger and
Heckerman 1996) results as the use of multiple BNs to represent a single more complex
BN. Figure 1 illustrates a simple example.
Using multiple BNs also increases the comprehensibility of learned models. If a single
BN were used to simultaneously model both classes, the relationships applicable for each
individual class would have to be teased out from the parameters a task significantly
( ) ( )
( )
( )
( )
1
,| |.
n
B B B i i
i
P Y P Y Pa Y P X Pa X
=
=

X
(1)
more difficult than analyzing the structure. For example, observing the topology for the
BN in Figure 1, it is not apparent that X
1
and X
3
are only correlated when Y = 1.
The BN used to model class  is referred to as B

. D

is B

s intra-class data and D

, 
 , as B

s extra-class data. An event
,,
i j k


is defined to be the occurrence of a data point
d D

where X
i
= k and Pa
B
(X
i
) is in its j
th
configuration. The value
,,
i j k


is the number of
times
,,
i j k


occurs.
Multinets have been previously employed for DBNs for the purposes of speech
recognition by Bilmes (2000). Bilmes employed multinets as alternatives for hidden
Markov models, achieving higher classification accuracies from models with fewer
parameters.
2.4 Functional Magnetic Resonance Imaging

Recently, magnetic resonance imaging (MRI) techniques have become widely used in the
study and diagnosis of metal illness. They are non-invasive and allow properties of small
cubic regions of brain tissue (voxels) to be measured. Depending on the type of MRI
used, different properties of tissue can be measured. Of interest to my work is functional
MRI (fMRI). fMRI measures the amount of oxygenized hemoglobin present in the blood
surrounding neural tissue. This measurement is referred to as the blood oxygen level
dependant (BOLD) response. The level of the measured BOLD response is believed to
be positively correlated with the amount of neural activation within the voxel.
In order to test a hypothesis on the behavior of a mental illness, psychologists design
experiments for patients to undergo while receiving an fMRI scan. These experiments
are believed to elicit different responses between healthy controls and afflicted patients.
Typically, these experiments consist of a simple recognition task in which the patient is
exposed to an audio or visual stimulus to which they respond with a signal indicating that
they have observed the stimulus.
A typical fMRI session generates a tremendous amount of information which frequently
renders direct analysis intractable. Instead, the analysis must be performed on features
Figure 1. A system where X
3
is correlated with X
1
when Y = 1 and with X
2
when Y = 2. (left) The BN
required to model the system. The CPT contains  (r
4
) parameters where r is the arity of the nodes.
(right) The multinet required to represent the system. The CPTs contain only  (r
3
) parameters.
Y = 1 Y = 2

Bayes Network Bayes Multinet
X
1

X
3

X
2

X
1

Y

X
3

X
2

X
1

X
3

X
2

extracted from the data. For example, a standard fMRI scanning session can result in a
single 3D image every few seconds and will last on the order of an hour, yielding
hundreds of images. Each image will typically contain on the order of tens of thousands
of voxels. In total, an fMRI session can result in millions or even tens of millions of data
values per patient.
2.5 Talairach Database

To reduce the enormous quantity of data produced by an fMRI scan, I abstract from the
voxel level to the neuroanatomical region level. Several databases describe such regions
of interest (ROIs), with the most widely used being the Talairach database (Lancaster et
al. 2000).
The Talairach database separates voxels into 150 different regions layered into four
hierarchical levels. Each hierarchical level corresponds to a different level of granularity
of the brain. The topmost level contains gross regions such as the left and right
cerebrum, cerebellum and brain stems while the bottom most level contains smaller more
specialized regions such as the Brodmann areas and individual nuclei. The Talairach
database maps every voxel into one region per hierarchical level. I use the mapping to
compute the aggregate mean voxel activation (AMVA) for each ROI. Analysis can then
be performed on the resulting AMVA time series.
2.6 Related Neuroimaging Analysis Work

Much of the previous work in the analysis of fMRI images uses statistical regression and
hypothesis testing, frequently via statistical software packages such as SPM (Friston
2002) or AFNI (Cox 1996). The general linear model (GLM) is frequently used to
determine if a stimulus is correlated with voxel activity (Friston et al 1995). Derived
from the GLM, the t-test is also used (Worsley & Friston 1995).
Discriminant analysis techniques (e.g., multivariate regression) are commonly used to
analyze neuroimaging data and they share many modeling similarities with BNs. The
main difference lies in the methods used to measure correlation between RVs.
Discriminant analysis techniques generally measure correlation with an analysis of
variance (ANOVA), whereas, the generality of the BN framework does not constrain the
means of measuring correlation to any single method. Instead, the BN framework allows
RVs to be modeled in an arbitrary fashion, with many possible methods for measuring
correlation possible. Of course, for any particular method chosen, there will be
assumptions made that limit the representative power of the BN.
The most commonly employed method in neuroscience for modeling connectivity
between ROIs is structural equation modeling (SEM), also known as path analysis. The
goal in SEM is to use a graphical structure to model the covariance between groups of
RVs. Statistical inference is used to measure the goodness of fit of proposed covariance
structures to data and, by selecting the model that maximizes the goodness of fit, a set of
causal relationships can be elicited from the data. As an example of using SEM to
analyze fMRI data, Büchel and Friston (1997) were able to show significant changes in
the relationships between regions in the dorsal visual system caused by shifts in attention.
The most significant difference between SEM and BNs is that SEM assumes the random
variables can be accurately modeled with first and second order moments (i.e., that they
have values distributed with a Gaussian) and frequently makes linearity assumptions.
Again, the generality of the BN framework does not require these assumptions (though
they may be present in some BN applications) and can model relationships SEMs
typically cannot.
Mine is not the first BN technique applied within the neuroanatomical domain. H jen-
S rensen, Hansen and Rasmussen (2000) use Hidden Markov Models (HMMs), a special
case of DBNs, to learn a model of activity within the visual cortex from visual stimuli.
They were capable of reconstructing the patterns of visual stimulus presented to patients
based solely on fMRI data.
However, the work most similar to mine is that of Mitchell et al. (2003). They use a BN
approach to analyze fMRI data with the goal of classifying instantaneous cognitive states
such as reading a book or looking at a picture. They use three separate algorithms for
analyzing fMRI data: a Gaussian naïve Bayesian network (GNBN) classifier, a support
vector machine (SVM) and a K-Nearest Neighbor (KNN) classifier. Their work
significantly differs from mine as they solely attempt to classify fMRI data, a supervised
learning task, while I attempt to both classify and extract neuroanatomical relationships, a
more difficult unsupervised learning task.
Friston, Harrison and Penny (2003) employed Bayesian network modeling techniques to
attempt reconstruction of activation networks. Like me, they employ dynamic Bayesian
networks, which they refer to as dynamic causal models (DCMs). I refer the reader to
Penny et al. (2004) for a review and comparison of SEMs and DCMs. They restrict the
correlational model among RVs to bilinear relationships. Given the simplicity of their
correlational model, they do not need to perform a structure search. Further, using their
extensive knowledge of neurophysiology, they incorporate a detailed hemodynamic
response into their model. I instead assume a multinomial correlational model and do not
explicitly model the hemodynamic response. Though my correlational model is more
general and can capture many nonlinear relationships theirs cannot, my approach leads to
a vastly more complex model identification problem and requires a number of
assumptions and approximations not required by Friston, Harrison and Penny.
3 Current Work  Analyzing Neuroanatomical Relationship Dynamics
I have experimented with the ability of DBNs to model changing relationships among
neuroanatomical regions caused by the onset of dementia. In this section, I will describe
the DBN analysis method, the experimental results, the validation methods and the
conclusions I reached during the experiment.
3.1 Experimental Design for fMRI Analysis

Figure 2 gives a flowchart for the steps performed in the analysis of the fMRI data. First,
raw fMRI data is obtained for each of the patients. Then, for each individual fMRI scan,
voxels are grouped into the ROIs they compose via the Talairach database. For each
structure, the aggregate mean voxel activation (AMVA) is computed at each time step,
resulting in a time series for each ROI. The collection of AMVA time series for every
ROI is collated into a single data point. The collection of all the data points is contained
in dataset D = {d
1
, d
2
,  , d
40
} where d
i
contains the data of the i
th
patient. Each d
i
can
be thought of as a table, with the columns representing ROIs and the rows representing
time frames. Each cell in the table is the AVMA of a single ROI at a single point in time.
D is divided into four datasets, D
h
, D
d
, D
e
and D
y
, containing data for the healthy elderly,
the demented elderly, all the elderly and the young subjects, respectively. D
e
and D
y

form a partition of D. D
h
and D
d
form a partition of D
e
.
At this point, an SVM and a GNBN can be applied to the data. However, the DBN I
employ is discrete and requires the continuous data to be quantized. Each data point is
separated into sequential non-overlapping temporal windows of eight time points. A
trend is then defined per window as the mean of the AMVA values within the window.
The difference between the AMVA value and trend value can then be used to assign a
region into a high state or a low state.
Three quantized methods are employed
1
: Highlow, Highlow2 and Highlow4. The first
method results in two quantized states; one state above the trend, and one state below the
trend. The second method resulted in four equally sized states, two above and two below
the trend. The third method resulted in eight equally sized states, four above and four
below the trend. Figure 2e. illustrates the Highlow quantization for the AMVA time
series in Figure 2c. After the data is quantized, a DBN is learnt for each of the datasets.


1
I examined other methods of quantization such as using medians and quartiles as opposed to means
and extremum, and creating a separate region surrounding the trend constituting a  noisy state, but
found no significant impact on the results.
Time
Activation
Time
Activation












X


X
2




X
4

t
t+1

SVM Model












3.2 DBN Analysis

The DBNs make both the Markov order 1 and stationary assumptions. To perform the
structure search, I use the BDE score (1995) and employ the following greedy algorithm.
For each node in column t + 1, the algorithm finds the parent that increases the BNs score
the most. It then finds the next parent that increases the score the most in conjunction
with the previously added parent. This process is repeated until a predetermined number
of parents has been added.
The search algorithm is parameterized by two variables, numParent s and
numBest ToKeep. numParent s indicates the maximum number of parents each ROI
should have. This is required since adding a parent will result in a higher BDE score but
will eventually result in a loss of model generality and lower classification accuracy due
to over fitting. I set numParent s empirically, as the highest value possible before
LLOCV accuracies begin to drop. numBest ToKeep indicates how many nodes should
retain their parents during classification. Generally, classification accuracy improves
when only a certain subset of nodes are allowed to retain their parents. This is because
many nodes contribute only noise or over fit the data, and reduce model generalization.
The BDE score is a Bayesian approach that allows prior information to guide the results
of the structure search. I employ an uninformed structural prior and an uninformed
conditional probability prior. This makes the assertion that, prior to seeing any data on
the patients, there is no bias towards believing any two ROIs are more or less likely to
appear causally related. Nor is there a bias towards how any discovered relationships
should behave.
Using this algorithm, four DBNs were learnt: DBN
h
, DBN
d
, DBN
e
and DBN
y
, obtained by
training on the D
h
, D
d
, D
e
and D
y
datasets, respectively. To classify a new subjects data,
D
new
, (which has not be used to train the DBNs) the posterior likelihood of each of the
models given the subjects data,
( | ),{,,,},
new
DBN D h d e y

 

was calculated. The
model with the highest posterior likelihood was used to determine the class of the new
subject. Datasets D
h
and D
d
were used to differentiate between healthy and demented
patients while D
e
and D
y
were used to differentiate between elderly and young patients.
3.3 Results

Analyzing the structural differences between DBNs can illuminate how the relationships
between neuroanatomical ROIs change due to dementia. One way to analyze the
structural differences is to examine how the parent set for a child changes between the
healthy and demented DBNs. Another telling feature is the list of the highest scoring
families. Figure 3 gives a sample the top families for both the healthy and demented
DBNs. By looking at which families have the best scores, one can determine which ROIs
are most strongly correlated with other regions in the brain.
However, through interaction with my collaborating neuroscientist, I have found the most
informative structural features tend to be information on which ROIs are commonly
parents in one DBN, but not the other. Finding such regions can indicate significant
changes in the way patients use their brain. Figure 3b gives the structural results for
seven of the top ten demented families. Its striking that a region known as the amygdala
appears as a parent in all of them. In fact, of all the demented families, the amygdala is a
parent of nearly fifty percent of them. This is profound given that in the healthy DBN,
the amygdala is a parent of only a single family. These results strongly suggest that the
behavior of the amygdala has significantly changed in demented patients. Further
analysis of the structural results is outside the scope of this proposal. For a more detailed
discussion, see (Burge et al. 2004).
To validate the structure search method, I have compared the classification performance
of the DBNs against two other powerful machine learning classifiers, a support vector
machine (SVM) and a Gaussian Naïve Bayesian network (GNBN). For the classification
between healthy elderly and demented elderly patients, both the DBN and the GNBN
achieved a 73% classification accuracy. Of the healthy subjects, the DBN had a 93%
accuracy, while the GNBN had a 84% accuracy. Thus, while the overall classification
accuracy of the DBN and GNBN were the same, they show different biases towards
classifying healthy patients correctly vs. classifying demented patients correctly. Both
the DBN and GNBN achieved higher accuracies than the SVM which at its best only
achieved a classification accuracy of 65%. These results verify that the DBN is capable
of learning discriminative qualities between healthy and demented patients with at least
equal strength compared to two other powerful and commonly used machine learning
techniques.
Cuneus
Cingulate

Gyrus

BA 13
BA 12
Gray

Matter

Parietal
Lobe

Amygdala

Uvula of

Vermis

Lat. Dorsal

Nucleus
BA
40

Inf. Parietal

Lobule
Uvula of
Vermis
Amygdala

Inf.
Occipital
Gyrus

Inf.
Parietal
Lobule

Inf.
Parietal

Declive
Amygdala

Inf. Semi
Lunar
Lobule

BA 7

BA 31
Insula
Amygdala

Inf. Semi
Lunar
Lobule

Right
Cerebrum

Post-central

Gyrus
Amygdala

Lat. Dorsal

Nucleus
Uvula of
Vermis
Post-
Central
Gyrus

a)

b)

Figure
3
.
DBN s
tructural results.
a) Highest scoring family in the healthy DBN. b) Seven of the
highest scoring families in the demented DBN with the amygdala as a parent (highlighted to
emphasize its prevalence in the structural results). Of all the demented families, the amygdala is a
parent in 49% of them. Of all the healthy families, the amygdala is only a parent in one of them.
Parietal
Lobe

Right
Cerebellum

Amygdala

Lat. Dorsal

Nucleus
Parietal

Lobe
Inf. Parietal

Lobule

Caudate

Tail

Amygdala

Inf. Semi
Lunar
Lobule

BA 3

I have also attempted to compare the classification efficacy of the DBN with that of
typical neuroscience approaches. Unfortunately, as classification is not the primary goal
in many neuroscience applications, Buckner et al. did not attempt classification on the
demented data. However, they found few functional differences between groups which
suggests that classification would be difficult. There have been other studies that have
attempted to classify dementia based on neuroanatomical data. Using multivariate linear
regression on positron emission tomography (PET) data, Azari et al. (1993) examined the
neurophysiological effects of dementia in elderly patients. They were able to attain a
significantly higher classification accuracy than I have on my initial DBNs, correctly
classifying dementia with 87% accuracy. However, while the subject population between
Buckner et al.s study and Azari et al.s study are similar, Buckner et al.s demented
patients suffered only from very mild or mild dementia as opposed to moderate dementia
in Azari et al.s study. The comparison between my results and theirs are also
complicated by the different imaging techniques used (fMRI vs. PET) and different
experimental paradigms. I am not aware of other more-compatible experiments to
compare my classification results with.
To show that the results of the DBN structure search would not be found with linear
regression models (such as the one employed by Azari et al.), I compared the strength of
the linear fit in the relationships of the top five demented and healthy families with the
linear fit of 200,000 families whose structures were randomly chosen. For each family, a
linear regressor minimizing the sum of squared residuals was fit with the dependent RV
set as the familys child RV and the independent RVs set as the family s parent RVs.
The r values
2
for the top five demented families ranged from 0.19 to 0.59 and for the top
five healthy families, from 0.2 6 to 0.65. However, many of the randomly generated
families had significantly higher r values, with the maximum r value being 0.74. Further,
within the top 5 families, the best r values did not correspond to the best BDE scores.
For instance, of the top five healthy families, the family with the lowest BDE score had
the highest r value and thus, the strongest linear fit. These results indicate both that the
DBN structure search does not simply maximize linear fit and that linear regression based
models would not have found the relationships within the DBN (which could explain
why Bucker et al. found little differences between healthy and demented subjects).
3.4 Confidence Measure

I have applied a technique to measure the confidence in the individual DBN structural
results. It is based on the bootstrap method used by Friedman et al. (2 000) in their
analysis of gene expression data. The bootstrap method measures the likelihood that the
score of a DBN family was due to random chance.
Recall that the data for each patient is stored in a table where the columns represent the
activation values of neuroanatomical regions and the rows represent points in time.
Normally, the rows are ordered chronologically and temporal correlations between times
frames are intact. To destroy the temporal correlations, the rows can be randomly


2
r is the coefficient of determination, a measure of the goodness of fit of a linear regressor.
r = (Ý-<Y>)
2
/ (Y-<Y>)
2
, where Y is the dependant variable and Ý is the regressor's estimation of Y .
permuted. For any given row in the table, the relationships among regions are not
affected. DBNs are then learnt for the randomly permuted datasets to calibrate BDE
scores for networks based on data with no temporal relationships. I will refer to these
DBNs as random DBNs and the DBNs learnt on the actual data as true DBNs.
This process is repeated 20 times and a distribution of BDE scores for each family in the
random DBNs is created. I then assign a measure of confidence to the families in the true
DBNs based on the difference between the true DBN's BDE score and the random
DBNs' BDE scores. Equation (2) gives the confidence measure,
( )
(
)
,,
,
,
i true i rand
i
i rand
BDE X
Con f X



= (2)
where BDE(X
i,true
) is the BDE score for family X
i
in the true DBN,

rand
is the mean BDE
score for the X
i
family in the random DBNs and

rand
is the standard deviation of the BDE
scores for the X
i
family in the random DBNs. Figure 4 gives a visualization of the
confidence measure.
For all of the highest BDE scoring families, there was high confidence that the
relationships were not due to random chance. For instance, the top demented family had
a confidence of 7.64. This indicates that family's BDE score in the true DBN was 7.64
standard deviations higher than the average BDE score in the random DBNs. The top
five families (and many others), all had confidences of at least 4.9. Thus, it is extremely
unlikely the relationships in the DBN families were due to chance alone.
Figure 5 illustrates a  difference histogram which shows that the BDE score distribution
in the actual data was significantly higher than the BDE score distribution in the
randomized data. This indicates that overall, the true DBN relationships are significantly
stronger than random DBN relationshipsÐ as was expected.
increasing score


Randomized model score
Low confidence score
Medium confidence score
High confidence score
Figure 4. Visualization of confidence score. The confidence of each of the families in a DBN trained
on a true dataset vary in their reliability. Computing the BDE score for families trained on randomly
permuted datasets yields a distribution, assumed to be Gaussian with mean  and variance , of BDE
scores occurring due to random chance. The larger the distance of a true family's BDE score from ,
the more likely the true family did not occur by chance.
3.5 Experimental Conclusions

By comparing the structural differences between DBNs learnt on healthy and demented
subjects, I found many changes in neuroanatomical relationships congruent with known
neuroanatomical behavior (as confirmed by my collaborating neuroscientist). I also
uncovered a significant increase in the correlation of the amygdala to many other
neuroanatomical regions, especially within the parietal lobe. This is the first report that
my collaborating neuroscientist was aware of showing an increased correlation of the
amygdala in an emotionally neutral task due to the onset of dementia. Further, it agrees
with a previous study by Grady et al. (Grady et al. 2001) which found an increased role
of the amygdala in demented patients on a emotionally charged task (facial recognition).
It is my hope that the results of this experiment will foster interest into further research
towards determining the extent of the amygdalar involvement in dementia in order to lead
to a method for better diagnosis and treatment. I plan to continue my analysis of the
demented dataset with the DBN methods I am proposing in Section 5 that build on the
techniques and lessons learned in this experiment.
4 Bayesian Networks  Research Questions
I have uncovered too many BN research problems in my fMRI experiments to cover in a
single thesis. In this section I will briefly describe the open areas of research and specify
which of them I will pursue in my doctoral research. I will not be directly addressing all
of these issues in my thesis.
Much of the BN research can be put into two main categories. First, given a BN and its
parameters, how can the values of hidden variables be inferred from a set of observed
variables? This process is referred to as BN inference. While inference on typically
-10
-5
0
5
10
15
Healthy Family's BDE Score (right is higher)
Figure 5. A  Difference Histogram showing the difference between the number of times the true BDE
scores were in a bin minus the number of times the randomized BDE scores were in a bin. Positive
bins indicate ranges of BDE scores seen more in the true DBNs. Negative bins indicate ranges seen
more in random DBNs. The positive bins are located on the right side of the graph, corresponding to
higher scores and indicating the scores were on average higher for the true DBNs.
employed BNs (e.g., those with common probability distribution functions such as
multinomials, Gaussians, etc.) is a well understood problem (Jensen 2001), many
modifications to the BN framework require modifications to inference techniques.
Second, given a collection of data, what is the most appropriate Bayesian network model
that describes it? My research focuses on this second category, which I will refer to as
model identification. BN model identification can be separated into two tasks: structure
search and parameter identification.
Model identification is also a well-understood problem (Heckerman 1995). Typically, it
involves searching many hypothesis structures, calculating the parameters for each
structure (frequently with a maximum likelihood estimate) and then scoring the model
with a function that measures the goodness of fit of the proposed BN to the data. While
this is a fairly straight forward procedure, many details need to be specified and many
issues arise.
4.1 Feature Extraction

Within many domains, the amount of data available is so large that learning directly on
the data is intractable. This is the case with fMRI data. In order to deal with this
intractability, learning is applied to features extracted from the data. The method for
extracting features is generally specific to the domain at hand. I use the Talairach
database (Lancaster et al. 2000) to convert a series of fMRI images into a time series
detailing the activation of neuroanatomical regions, but other methods for extracting
features could prove promising.
Of particular interest is the notion of a probabi li sti c 4D atlas proposed by Mazziotta et al.
(2001) in which individual voxels are not assigned absolute labels, as they are in the
Talairach database. Instead, each voxel is assigned a probability of being located within
a set of ROIs. This system could adequately deal with the differing anatomical structures
among patients due to age, disease, normal spatial differences, etc. Bayesian methods
such as the DBNs I employ offer a natural method for working with such a probabilistic
database.
4.2 Structure Search

Given a method for extracting features from a dataset, the next issue is structure search.
Ideally, every possible BN structure would be searched. However, the size of the
structure space is prohibitively large for most real-world domains. For example, the BNs
that I use in the fMRI domain have approximately 10
47
possible structuresÐ clearly too
many to search exhaustively. Search heuristics must be employed.
The most common heuristics applied are iterative greedy algorithms that repetitively
make small changes to a hypothesis structure, accepting the change only if it results in an
increase in the structure's score. Other more complex heuristics have also been applied
to BN structure search such as simulated annealing (Heckerman, Geiger and Chickering
1995) and genetic algorithms (Larrañaga et al. 1996) as well as techniques that attempt to
increase implementation efficiency such as Moore and Wong's (2003) opti mal
rei nserti on. Techniques for searching structures with a very large sets of RVs have also
been developed (Goldberg and Moore 2004 ).
Further, in many real-world domains, RVs do not interact with each other haphazardly or
chaotically, but instead participate in meaningful semantic relationships, such as the i s-a
and has-a relationships from the object oriented programming paradigm. Within the
neuroanatomical domain, ROI are related to each other via the contai ns and contai ned by
relationships. Each ROI contai ns smaller regions or is contai ned by larger regions (or
both). A hierarchy can be built as the conglomeration of these semantic relationships.
Structure search can benefit from a hierarchical relationship between RVs. This is one of
my main focuses for research and I will defer further discussion to Section 5.3.
An issue important in modeling temporal processes is time-delay. In the fMRI DBNs, I
make the first order Markov assumption. Doing so causes the DBN to ignore all
correlations among RVs that take more than a single time frame to develop. An obvious
approach to taking larger temporal relations into account is to simply have more columns
of RVs in the BNÐ however, this dramatically increases the search space. For example,
simply adding a t+2 column of RVs in the fMRI DBNs increases the number of possible
structures from 10
47
to 10
92
. Further exacerbating the problem, many of these structures
have larger maximum family sizes, which exponentially increases the number of
parameters. One possible avenue for research would be to determine how structure search
in such large state spaces could progress.
Structure search has been shown to be an NP-hard problem (Chickering 19 9 4). If it can
also be shown to also be NP-easy, then it could be reduced to other NP-complete
problems such as integer linear programming (ILP). The tremendous amount of research
developed for solving ILP problems could possibly be leveraged for solving structure
search problems. All that would be needed is a method for converting a BN structure
search into an ILP.
Another significant issue is the incorporation of hidden variables. I have constructed
BNs that are built solely upon observed data points. Others have built models that
assume a ROI's state cannot be directly measured and must be inferred from observed
data (Friston, Harrison and Penny 2 003). Unfortunately, performing a structure search on
BNs with hidden variables is not a well understood problem. The relatively obvious
approach of estimating the distribution of values for hidden variables (perhaps via EM)
for each proposed structure is not tractable. However, Friedman (19 9 8) has made
significant advancement in this area with his Structural EM algorithm, which calculates
expected scores for hypothesis BNs as opposed to actual scores.
4.3 Parameter Identification

For each hypothesized BN structure, a set of parameters must be determined. The
method for doing so is generally tied to the method for scoring a BN. In discrete BNs,
due to Heckerman, Geiger and Chickering (19 9 5), the parameters are typically calculated
by simply counting the number of times RVs take on certain values in the data. As I
propose a new BN scoring method, I must also propose an appropriate parameter
identification method, which I discuss in Section 5.1. As was the case with structure
search, I conjecture that a hierarchical relationship between RVs can be used to more
effectively determine the BN parameters. I will discuss this in Section 5.4.
4.4 Structure Scoring

When dealing with Bayesian scores such as Heckerman's BDE score, prior information
needs to be specified. I use uninformative priors. This means that all structures and
parameters are considered, a priori, as equally likely as possible. However, a wealth of
information exists in neuroscience that could be used to help guide BN structure search
and parameter learning. This information could be incorporated as prior information for
Bayesian methods.
For BN models that attempt to discriminate among separate classes of data, using
likelihood-based scores such as the BDE score is not ideal. Instead, the model should
favor structures that discriminate among the classes. This constitutes a significant
portion of my research and I will defer further discuss to Section 5.1.
4.5 Analyzing Network Structures

After identifying separate BNs for differing classes of data, analysis of the structural
differences in the BNs can indicate fundamental differences among the classes. I have
found in the analysis of DBN structures learnt from fMRI data that the most informative
structural information is the number of times ROIs appear as parents in each DBN. For
instance, if a RV X is a parent 50% of the time in class C
1
, but only a parent 5% of the
time in class C
2
, then it is likely X plays a more significant role in C
2
than in C
1
.
In many domains, including the neuroanatomical domain, it would also be useful to
identify feedback loopsÐ networks of causal influences that are cyclic in nature.
Unfortunately, a method for inferring feedback loops from DBNs is not immediately
obvious. If a feedback loop is simply defined as a cyclic set of temporal correlations
between variables, there will be many possible feedback loops, perhaps too many to even
enumerate. A method for determining which feedback loops are important is necessary
and constitutes an interesting avenue for research.
4.6 Phase Transitions

A common problem in machine learning is determining under what conditions individual
algorithms do well. One possible approach for addressing this problem is phase
transition analysis (PTA). PTA is not frequently applied in machine learning but
Baskiotis and Sebag (2 004) have introduced a PTA-based method for analyzing the
behavior of the C4.5 decision tree algorithm (Hastie, Tibshirani and Friedman 2 001).
Using a similar technique, PTA could help determine which categories of problems
DBNs should be applied to.
5 Bayesian Networks  Proposed Issues to be Addressed
In this section, I describe in depth the issues identified in Section 4 that constitute the
bulk of my doctoral research.
5.1 Comparative Scoring Methods and ACL

During a structure search, the quality of a proposed structure is measured via a scoring
functions. Likelihood-derived scores, such as Heckerman's BDE score detailed in
Appendix 9.1, result in parameters being trained based only on data for a single class. In
the multinet framework where one BN is learnt for each class of data, this corresponds to
each BN being trained solely on its intra-class data. Such BNs include the highly likely
features of a single class of data.
If the goal is class discrimination, this is not ideal. The score should not only increase as
the likelihood of a BN given its intra-class data increases, but it should also decrease as
the likelihood with respect to its extra-class data increases. Such a score will produce
networks that discriminate between classes since high-scoring structures will be
indicative of relationships that are strong in one class and weak in another. I will refer to
these scores as comparative scores. Figure 6 provides motivation.
Using a comparative score is not a new idea in training BNs. Bilmes et al. (2 001) use
comparative scores with the multinet framework in their analysis of speech recognition.
Grossman and Domingos (2 004) suggest maximizing class-conditional likelihood (CCL).
In fact, if classification accuracy is the goal, the optimal discriminative classifier is the
one that maximizes CCL (Duda and Hart 19 73).
There are, however, limitations to BNs trained with comparative scores. Most
prominently, the resulting models will not be generative. Distributions of samples drawn
from the BNs will not match the distributions in the underlying system. Thus, if data
generation is the goal, comparative scores should not be employed.
There are also significant issues when using the CCL score in particular. The CCL score
cannot be maximized by independently maximizing a score from each of the DBN's
families, as can be done with most likelihood-derived scores (see Equations (2 1) and (2 2 )
in Appendix 9.1). This has two prominent effects. 1) The iterative greedy structure
search heuristic cannot cache family scores between iterations. This increases the
X
1

X
3

X
4

C
1
C
2

X
2

X
1

X
3

X
4

X
2

Figure 6. A graphical structure giving the conditional dependencies between RVs X
1
, X
2
, X
3
, X
4
, for
two classes of data, C
1
and C
2
. Darker lines indicate a stronger correlation between RVs. In C
1
, the

correlation between X
1
and X
4
is stronger than the correlation between X
2
and X
4
. Further, assume the
dark-lined correlation does not change between C
1
and C
2
. Likelihood-
based scores will favor the
dark-
lined relationship and fail to differentiate between the classes. Comparative scores will favor the
weaker but changing relationships and would succeed in class discrimination.
search's polynomial time complexity, yielding it intractable for larger domains. 2)
Individual BN parameters cannot be isolated in the score, preventing the parameters from
being easily maximized. In fact, at this time there is no known closed form solution for
the maximal CCL parameters.
Grossman and Domingos (2004) address the second issue by using the maximum
likelihood estimate for the parameters while scoring each hypothesized network with the
CCL score. For classification, they find that this technique significantly outperformed
standard scoring techniques such as the BDE score. However, by using the MLE
parameters, discriminative information is being discarded that could potentially be
beneficial for discriminative tasks.
I propose a scoring method that favors discriminative structures but also decomposes into
individual family scores and has a closed-form solution for maximizing parameters.
Using the notation given in Section 2.2, the CCL score is derived in Equation (3),
( ) ( )
(
)
( )
,
| |
B d d
B
d D
B d
P Y
CCL B D P Y
P

= =

X
X
X
(
)
( ) ( )
,
.
,1,2
B d d
d D
B d B d
P Y
P Y P Y

=
= + =

X
X X

(3)
It will be useful to separate the data into two datasets, D
1
and D
2
, each containing only
elements of class 1 or class 2, respectively. This results in Equation (4),
(
)
(
)
(
)
( )
( )
( ) ( )
( )
( )
( ) ( )
1
2
1 2
1
2
| | |,
,1
|,
,1,2
,2
|.
,1,2
B d d
d D
B d d B d d
B d d
d D
B d d B d d
CCL B D CCL B D CCL B D
P Y
CCL B D
P Y P Y
P Y
CCL B D
P Y P Y


= ×
=
=
= + =
=
=
= + =


X
X X
X
X X

(4)
Replacing BN B with the two multinets B
1
and B
2
, results in Equation (5),
( )
(
)
( ) ( )
(
)
( ) ( )
1 2
1 2
1 2 1 2
1 2
|.
B d B d
d D d D
B d B d B d B d
d D d D
P P
CCL B D
P P P P
 
 
= ×
   
+ +
   
 
 
X X
X X X X

(5)
It will also be useful to swap the denominators, such that the CCL score is the product of
two ratios, each between differing datasets, as seen in Equation (6),
( )
(
)
( ) ( )
(
)
( ) ( )
( ) ( )
1 2
1 2
1 2 1 2
2 1
1 2
|,
| |.
B d B d
d D d D
B d B d B d B d
d D d D
P P
CCL B D
P P P P
PCCL B D PCCL B D
 
 
= ×
   
+ +
   
= ×
 
 
X X
X X X X
(6)
where PCCL(B
1
|D) and PCCL(B
2
|D) are loosely said to represent the partial
contributions of B
1
and B
2
to the CCL score for BN B. Since there are parameters for
both Bayesian networks in each partial CCL term, the structure for each of the two
Bayesian networks cannot be determined independently. Further, the summations in the
PCCL's denominators prevent the score from decomposing into the aggregate score of
individual DBN families.
As an alternative to using CCL, I propose the approximate conditional likelihood (ACL)
score given in Equation (7),
( )
(
)
( )
(
)
( )
( ) ( )
1 2
1 2
1 2
2 1
1 2
| | |.
B d B d
d D d D
B d B d
d D d D
P P
ACL B D PACL B D PACL B D
P P
 
 
= × = ×
 
 
X X
X X

(7)
Each PACL term is a likelihood ratio test of a BN on both the classes of data: its intra-
class data in the numerator and its extra-class data in the denominator. In order to
maximize each PACL term, the numerator should be as large as possible while the
denominator should be as small as possible. Thus, high ACL scores occur for BNs
whose intra-class data likelihood is high but whose extra-class data likelihood is low.
The only difference between the PACL terms in Equation (7) and the PCCL terms in
Equation (6) is that the summations in the PCCL denominators have been replaced by
single probabilities in the PACL denominators. This similarity is why I refer to ACL as
the approximate conditional likelihood. However, if ACL is considered as an
approximation to CCL, it is an unbounded one.
There are consequences for not including both probability terms in the denominators of
the ACL score that are present in the CCL score. If the likelihood of B
1
given D
1
is high,
the
1
( )
B d
P
X
term in the PCCL's denominator (Equation (6)) would lower the PCCL(B
2
|
D) score. As this term does not exist in the ACL score, the PACL(B
2
| D) score is not
lowered and may score the proposed structure too highly.
Even so, ACL has significant computational advantages over CCL. It is decomposable
while still favoring discriminating structures over high-likelihood structuresÐ a feature
important to class-discrimination. In addition, closed form solutions exist for parameters
that maximize ACL.
5.2 Maximum ACL parameters

To calculate the parameters that maximize the ACL score, notice that the BN parameters
compose sets of multinomials and that the binomial is a special case of the multinomial.
I conjecture that calculating the maximizing parameters is closely related to calculating
the maximizing parameters for what I will refer to as a binomial ratio test (BRT).
In a standard binomial, there are two parameters
3
, 
a
and 
b
=1-
a
. Each gives the
probability a RV takes on one of two possible values, a or b. The likelihood of the
binomial model, M, given a set of data, D, is ( | )
a b
a b
M D
 
=  ×

where

a
is the
number of times a was seen in D and

b
is the number of times b was seen in D. The
MLE parameters are given by
( )/( )
a a a b
MLE
  
 = + and
 


 
/( )
b a b
  
+.


3
Since a binomial has a constraint that the sum of its two parameters = 1, it can be represented with a
single parameter. Since I will be generalizing to a multinomial, it's helpful to represent it with two.
In the BRT, the data is divided into two classes of data, D={D
1,
D
2
}. The likelihood of
the BRT is given by Equation (8),
( ) ( )
( ) ( )
1 1
2 2
( | ),
a b
a b
a b
a b
M D
 
 
 × 
=
 × 

(8)
where
y


is the number of times   {a,b} was seen in class y.

a
and

b
must both be
greater than , for some small value , and

a
+

b
must equal 1. The BRT is a ratio of
two binomial likelihoods. The parameters that maximize the BRT can be determined
using Lagrange multipliers (see Appendix 9.4 for the derivation). This results in three
possible settings for the parameters. In the first case,

a
= and

b
=1-; in the second
case

a
=1- and

b
=; in the third case,
2 1
( )/
a a a
  = 
2 1 2 1
( )
a a b b
   
 +  and
2 1
( )/
b b b
  = 
2 1 2 1
( )
a a b b
   
 + . The case that yields the highest BRT likelihood
corresponds to the maximizing parameters.
A similar ratio test with more than two parameters can also be devised. I will refer to this
as the multinomial ratio test (MRT). The likelihood function for the MRT is given by
Equation (9). I conjecture that the maximizing parameters for the MRT are given by
Equation (10). I will prove (or disprove) this in my thesis.
( )
( )
( )
1
2
|,
M D







 

=



(9)
( )
(
)
( )
2 1
2 1
,
MLE
 

 

 
 
 

 =



(10)
where

is the domain of the multinomial's RV. Equation (10) is actually only one case
of the MLE. The other cases remove an element

from

and sets


to 0 if
2 1
 
 
>
. As
a BN's CPT is merely a collection of multinomials, Equation (10) can be generalized for
CPT parameters, resulting in Equation (11),
( )
,,,,,,,,
,,
1
( )
,
i j k i j k i j k i j k
i j k
if
otherwise
   

    



 + >

 =




(11)
where  is a normalizing constant and  is a Laplacian smoothing constant. I will refer to
setting the parameters via Equation (11) as the ACL-ML parameters.
I conjecture that maximizing the parameters can be thought of as increasing a margin in
likelihood-space between the two classes. See Figure 7 for an example. Each point on
the graph indicates a single data point in likelihood space. Boxes represent the data
points d D
1
and 's represent data points d D
2
. The x-axis is B
1
's likelihood given d
and the y-axis is B
2
's likelihood given d. If the BN with the highest score is used to
classify a data point, the classification boundary is the diagonal dashed line. Anything
below the line would be classified as a box and above, classified as an . The
classification boundary in the example would misclassify two points: the box above the
boundary and the  below the boundary.
Assume that an event occurs more frequently in a BN's intra-class data than in its extra-
class data, i.e.,
1
,,
i j k

>
2
,,
i j k

where D
1
is the intra-class data for B
1
. I conjecture that
increasing the probability of the CPT element
1
,,
i j k
 will increase the BN's likelihood
given its intra-class data more than the likelihood given its extra-class data. Intuitively, it
makes sense to increase these elements since the intra-class events occur more frequently
than the extra-class events. Using the ACL-ML parameters does exactly this. I also
conjecture that using the ACL-ML parameters significantly moves data points in
likelihood space away from the diagonal classification boundary and increases the chance
a data point will be correctly classified. I plan on proving (or disproving) this conjecture
in my thesis.
As motivation, Figure 8 gives an example for the CPT in BN B
1
for node b (which is
assumed to be the 1
st
node). Take the CPT element (b
2
, a
4
). This element corresponds to
two events in the intra-class data, but not to any events in the extra-class data. As such, it
should have high probability mass. On the other hand, the element (b
3
, a
3
) corresponds to
two intra-class events but also to two extra-class events and the probability mass assigned
to it should be low. ACL-ML does this.
5.3 Hierarchical Information

Within the neuro-anatomical domain, regions of interest (ROIs) are hierarchically related
via the contains and contained by relationships. In a strict hierarchy, an element may
only be contained by a single larger element. In a semi-hierarchy (Figure 9), an element
may be contained by more than one larger regions. Elements that are contained by other
Figure 7. Class likelihood space. Boxes indicate
the data points d D
1
and ex's indicate the data
points d D
2
. If the BN with the maximum
likelihood given a data point is used to classify
the data point, the dashed diagonal line is the
classifier. Points crossing the line would be
misclassified. Two data points, one in D
1
and
one in D
2
, would be misclassified in this
example.

(
)
1
|
B d


(
)
2
|
B d


B

A

b
1

b
2
b
3

b
4
a
1
a
2
a
3

a
4
e
1 e
1

e
1

e
2

e
2
e
1

e
2

e
2

e
2 e
2

e
2

e
2

e
1 e
1

e
2

e
2

e
1 e
1

e
1

e
1

e
1 e
2

e
1

e
2
e
2

e
2

e
2
e
1

e
1

e
1 e
1
e
1

e
2 e
2

e
1 e
1

e
2
e
1

e
1 e
1

e
2
e
1
e
1
Figure 8. Example CPT for the BN B
1
, A

B.
Node b is assumed to the node with index 1. An
e
1
in the CPT element (a
j
, b
k
) indicates the
occurrence of the event
1
1,,
j k

in B
1
's intra-class
data. An e
2
in indicates the occurrence of the
event
2
1,,
j k

in B
1
's extra-class data. When
training B
1
's parameters, the CPT elements with
many e
1
 s but few e
2
's should be given more
probability mass than those with many e
2
 s and
few e
1
's. This allows the parameters to
incorporate class-discriminative information.
P(B=k | A=j)
elements are h-children and elements that contain other elements are h-parents
4
. The set
of RVs at the same hierarchical level is referred to as an h-level.
This hierarchy allows a system to be decomposed into multiple layers of granularity. The
system described in Figure 9 can be represented at a coarse level by RV A since all other
RVs are contained by A. At finer levels, the system can be represented as either {B C} or
{D E F}.
I propose that a hierarchical structure can be used to improve BN structure search since it
allows a system to be represented with varying numbers of RVs. Instead of learning one
BN for all the RVs, a single BN could be learned for the RVs in a single h-level.
The top h-level's BN would be trained first in an unconstrained manner. The candidate
parents for each node would consist of all other nodes in the top h-level. BN structures
for subsequent levels would be constrained by structures at the previous level. The only
links allowed would be between RVs whose h-parents were also linked. Using this
strategy makes the assumption that if two RVs are independent, their h-children will also
be independent. While this assumption is certainly not true in the general case, it
suggests a reasonable method to constrain structure searches at h-levels that cannot be
performed exhaustively.
One issue that arises is the order in which structure search is performed at each h-level.
While I will not thoroughly investigate this issue in my proposed research, it is worth
mentioning. Should the structure search for a BN at a high h-levels be fully completed
before searches at subsequent h-levels begin? Or, should the constraints implied by the


4
This ª h-º notation is used because child and parent are terms used in Bayesian networks and a single ROI
may be part of both a hierarchy and a BN.
Figure 9. An example of a semi-hierarchy.
This is not a strict hierarchy since E has two
h-parents. The dashed lines separate the h-
levels in the hierarchy. Since the hierarchical
relationships are contains and contained by,
the sets of nodes {A}, {B C} and {D E F}
can each be thought of as a decomposition of
the system at differing levels of granularity.
A
B
C
D
E
F
a) b)
Figure 10. Two methods for searching through the
structure state-space in a hierarchical search. Each
line represents a change in a structureÐ long lines
represent significant changes which correspond to
structure searches at high h-levels. a) For each
change made at the top h-level: propagate the
constraints to lower h-level and perform structure
searches at the lower levels. b) Completely finish
the structure search at each level of the hierarchy
before proceeding to the next.
changes made to the structures in higher h-levels be propagated to the lower level
searches before fully completing the higher level searches? The two proposed methods
admit a significantly different strategy for searching through structure space and is
visualized by Figure 10. Future work may find that each search pattern is better suited to
different domains.
A second advantage of using a hierarchy is the ability to set the nodes' CPT parameters at
subsequent h-levels based on CPT parameters at previous h-levels. For instance, in the
example in Figure 9, assume there is a wealth of information available on the RV B but
not much for RV D. Since B is Ds h-parent, B's well specified behavior can be used to
infer the relatively unspecified behavior of D. One previously used technique to achieve
this goal is shrinkage (Section 5.4).
There also seem to be parallels between what I suggest in hierarchical learning, and
relational learning, e.g. (Friedman, Getoor, Koller and Pfeffer 1999; Pfeffer 2001; Jaeger
1997; Poole 1993; Ng and Subrahmanian 1992; Kerstring and Raedt 2000; Muggleton
2000; McGovern and Jensen 2003). I plan to further explore the similarities between my
hierarchical proposals and these methods in my thesis.
5.4 Shrinkage

To infer information about one RV based on another, I propose using shrinkage (Stein
1955). Shrinkage is a statistical technique that allows for the combination of low-
variance, high-bias data with high-variance, low-bias data. Take an example in the
neuroanatomical domain: the amygdala. It is a small region embedded within the uncus,
which is in turn embedded within the limbic lobe. As the amygdala is a small structure,
measurements of it are based on few voxels resulting in relatively high variance
measurements.
Conversely, both the uncus and limbic lobe are composed of a far greater number of
voxels, resulting in a low-variance measurements. However, the activation of these
structures only roughly corresponds to the activation of the amygdala. I.e., they are a
biased source of information for the amygdala. Since the limbic lobe is both larger and
further abstracted from the amygdala than the uncus, it has an even smaller variance and
higher bias. Intuitively, it seems reasonable to complement high-variance data with
copious amounts of closely-related low-variance data. Shrinkage gives a mechanism for
doing this.
I have found three significant papers describing methods using shrinkage in the machine
learning literature. Anderson, Domingos and Weld (2002) employed shrinkage with
Relational Markov Models (an extension of standard Markov Models) in order to classify
a user's activity on internet web pages. McCallum et al. (1998) applied shrinkage to
multinomial models in order to improve text classification. Freitag and McCallum
(1999) applied shrinkage to hidden Markov Models (HMMs) in order to extract
information out of text documents.
Mathematically, shrinkage models are similar to mixture models. Both techniques
combine simpler models together to form a single more complex model. But they do so
in a significantly different fashion. An example differentiating the two techniques based
on a simple exponential model follows.
A single exponential model, parameterized by , is given by
x
e



. For a mixture of m
exponential models, the collection of m parameters is  ={
1
, ¼, 
m
}. Equation (12)
shows the probability of a data point given a single model in the mixture. Equation (13)
shows the probability of all n data points given a single model. Equation (14) shows the
probability of all the data points given the full mixture of exponentials.
(
)
|.
j i
x
i j j
P x e

 

=
(12)
(
)
1
|.
j i
n
x
j j
i
P X e

 

=
=

(13)
( )
1 11
|,1.
j i
n
m m
x
j j j
j ji
P X e

  

= ==
 = =
 


(14)
i is the number of data points and j is the number of mixture components. The  terms in
a single exponential model would be trained via MLE or maximum a-posterior (MAP).
Generally, the  and  terms in the mixture model would be learnt via expectation
maximization (EM).
In a shrinkage model, the equations for the probability of a single data point given a
single model and the probability of all the data given a single model remain the same as
in Equations (12) and (13). However, the probability of all the data given all of the
models changes, and is seen in Equation (15).
( )
1
|
i
n
x
i
P X e



=
 =




(15)
Equation (15) is identical to Equation (13) except the 
j
's have been replaced with


's.
Within a shrinkage model, the probability of the data given all of the models is computed
as the probability of all the data given a single model. Equation (16) details how to
compute the


values.
( )
:
k j
jml jml k k
k HAnc 
    

= +



(16)
jml

is the MLE (or MAP estimate if working in the Bayesian domain) for a single
exponential model based solely on the data and
( )
j
HAnc

is the set of all h-ancestors of
the j
th
exponential model in the hierarchy of exponential models. The

values are learnt
via EM or some other parameter estimating procedure.
One key difference between mixture and shrinkage models is that shrinkage models are
of the same form as the underlying model whereas mixture models aren't. For example,
with a mixture of exponentials, the resulting model is not an exponential model but
instead a linear combination of exponential models. The resulting shrinkage model
remains an exponential.
If a strict hierarchy relates a BN's RVs, the calculation of the CPT parameters is given by
Equation (17),
( )
( )
( ) ( )
( )( )
1
1
| |,
h
i raw i i i i i i
i
P X Pa X P HAnc X HAnc Pa X 

=
 = × +


(17)
where the

's are shrinkage coefficients, h equals the h-level X
i
is in and HAnc
i
(X) is the
i
th
h-ancestor of X. This form of shrinkage makes the assumption that RV X
i
's
relationship with its parents will be similar to the relationship between its h-ancestors and
its parents' h-ancestors. The extent to which a RV's parent relationship is similar to each
of its h-ancestors' relationships is quantified by the

i
shrinkage coefficients.
A significant issue arises when dealing with semi-hierarchies. In a semi-hierarchy, the
h-parent of a RV may actually be a set of other RVs (Figure 11). Equation (18) gives the
calculation of shrinkage CPTs for a semi-hierarchy.
( )( ) ( )
( ) ( )
( ){ }
1
,
1
| |
i
h
X raw i i S R
i
S R H
i i i i i
P X Pa X P S R
H H Anc X H Anc Pa X
 

=
 
 = × +
= 
 

(1 8)
There is a single term in (1 8) for every possible combination of X's h-ancestors and
Pa (X )'s h-ancestors. In this case, the shrinkage model assumes a RV's behavior is
similar to the multiple behaviors of all possible combinations of its h-parents and its
A
t

B
t

C
t

D
t

E
t

A
t+1

B
t+1

C
t+1

D
t+1

E
t+1

¼ ¼
A
B
C
D
E
A
B
C
D
E
a)

b)

c)

Figure 11. a) An example DBN. The dashed line separates the DBN RVs between the first and second
h-levels of a hierarchy. b) An example hierarchy for the RVs in the DBN. The boxes are square to
indicate they are a hierarchical structure as opposed to a BN. c) A Venn diagram depicting the
relationship between the RVs in the hierarchy. Assume that shrinkage is being applied to calculate the
CPT for node D
t+1
. The possible corresponding CPTs that would be included as terms in Equation (18)
would contain the cross product between any two parent elements from {A
t
, B
t
, C
t
} and any one child
element from {A
t+1
, B
t+1
} resulting in a total of six terms. If the terms were constrained by the results
of a structure search on the first level, then there is only a single valid term: the CPT corresponding to
the family with A
t
and B
t
as parents and A
t+1
as a child. All other possible combinations of parents and
children are not represented in the structure at the first h-level.
parents' h-parents. Depending on the number of h-parents for the nodes in the semi-
hierarchy, the number of H
i
terms could increase quickly.
There are several ways to deal with this issue. First, a method (yet to be determined) to
reduce the number of terms in H
i
could be devised by selecting which h-ancestors were
most important. Second, if the structure search for one level is constrained by the
structure search from a previous level (as described in Section 5.3), the H
i
terms in (18)
could be limited to terms containing RVs that were linked in previous h-levels..
6 Research Contributions
In this section, I succinctly iterate the contributions to neuroscience and machine learning
I will make in my thesis. My main contribution to neuroscience will be to devise a
method capable of indicating how the temporal relationships among neuroanatomical
regions change due to the onset of mental illness. The method will be applied to fMRI
data, but will be robust enough to be applied to other datasets relatively easily.
To accomplish this goal, I will model the fMRI data with dynamic Bayesian networks
(DBNs). I will augment DBN structure search with the three main methods proposed in
Section 5.

1) The approximate conditional likelihood (ACL) structure score and methods
for training parameters to maximize this score.
2) A hierarchical modification to standard BN structure search.
3) The incorporation of shrinkage in CPT parameter estimation.

I will validate the results of my techniques with the following four methods.

1) Domain experts will verify that results in the neuroscience domain are
meaningful and will hopefully find previously unknown and interesting
relationships among RVs.
2) A bootstrap confidence measure will be applied to show that the results are
not due to chance.
3) Simulated data with controllable RV relationships will be created and the
ability for my techniques to find the known relationships will be gauged.
4) My algorithms will be tested on well known and commonly accepted
examples from data repositories. This validation method is complicated by
the apparent absence of commonly used DBN benchmark data, but I will
attempt to locate appropriate data bases.

The hierarchical search method I propose makes a specific assumption: the absence of a
correlation among RVs at a high h-level implies the absence of a correlation at lower h-
levels. I will test whether or not this assumption holds in the fMRI data.
7 Acknowledgements
I would like to thank my advisor, Terran Lane, for the large amount of guidance he
provided in my research; Vincent Clark, the collaborating neuroscientist I have been
working with, who has spent significant amounts of time and effort with me on my
research; and Shibin Qiu and Hamilton Link for developing the SVM and GNBN
classifiers (respectively) that I used to compare the DBN classification accuracy to in my
experimental work described in Section 3.3.
8 References
Akike, H. (1973). Information Theory and an Extension of the Maximum Likelihood
Principle. In Second Internati onal Symposi um on Informati on Theory. B.N. Petrov
and F. Csaki (eds.), Academiai Kiado, Budapest, pp 267-281.
Azari, N. P., Pettigrew, K. D., Schapiro, M. B., Haxby J. V., Grady C. L., Pietrini, P.,
Salerno, J. A., Heston, L. L., Rapoport S. I. And Horwitz (1993), B. Early Detection
of Alzheimer's Disease: A Statistical Approach Using Positron Emission
Tomographic Data. Journal of Cerebral Blood Flow and Metaboli sm. v. 13, pg. 438-
447.
Anderson, C., Domingos, P. and Weld, D. (2002) Relational Markov models and their
application to adaptive web navigation. Internati onal Conference on Knowledge
Di scovery and Data Mi ni ng (KDD). 143-152.
Baskiotis, N., Sebag, M. (2004) C4.5 Competence Map: a Phase Transition-inspired
Approach. Internati onal Conference on Machi ne Learni ng, 21, 73-80.
Bilmes, J.A. Dynamic Bayesian Multinets. (2000). Proceedi ngs of the 16th conference
on Uncertai nty i n Arti fi ci al Intelli gence. pg. 38-45.
Bilmes, J., Zweig, G., Richardson, T., Filali, K., Livescu, K., Xu, P., Jackson, K.,
Brandman, Y., Sandness, E., Holtz, E., Torres, J., & Byrne, B. (2001).
Discriminatively structured graphical models for speech recognition (Tech. Report).
Center for Language and Speech Processing, Johns Hopkins Univ., Baltimore, MD.
Buchel, C., Friston, K.J. (1997) Modulation of connectivity in visual pathways by
attention: cortical interactions evaluated with structural equation modelling and fMRI.
(sic) Cerebral Cortex, Vol 7, 768-778.
Buntine, W. (1991). Theory Refinement on Bayesian Networks. Proceedings of the
Seventh Conference on UAI. 52-60.
Buckner, R. L., Snyder, A., Sanders, A., Marcus, R., Morris, J. (2000). Functional Brain
Imaging of Young, Nondemented, and Demented Older Adults. Journal of Cognitive
Neuroscience, 12, 2. 24-34.
Burge, J., Clark, V.P., Lane, T., Link, H., Qiu, S. (2004). Bayesian Classification of
FMRI Data: Evidence for Altered Neural Networks in Dementia. Submitted to
Human Brain Mapping. Temporary alternate: Tech Report TR-CS-2004-28,
University of New Mexico.
Cooper, G., & Herskovits, E. (1992). A Bayesian method for the induction of
probabilistic networks from data. Machine Learning, 9, 309±347.
Cox, R.W. (1996) Software for Analysis and Visualization of Functional Magnetic
Resonance Imaging Neuroimages. Computers and Biomedical Research. v. 29, pg.
162-173.
Charniak, E. (1991). Bayesian Networks Without Tears. AI Magazine, 12 (4).
Chickering, D., Geiger, D., Heckerman, D. (1994). Learning Bayesian Networks is NP-
Hard (Technical Report MSR-TR-94-17). Microsoft.
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York,
NY: Wiley.
Friston, K.J., Harrison, L. and Penny, W. (2003). Dynamic causal modeling.
NeuroImage 19, 1273±1302.
Friston, J. et al. (1995). Statistical Parametric Maps in Functional Imaging: A General
Linear Approach. Human Brain Mapping 2, 189-210.
Friston, J. (2002). Online document. Statistical Parametric Mapping. Available at
http://www.fil.ion.ucl.ac.uk/spm/papers/SPM-Chapter.pdf.
Freitag, D. McCallum, A. (1999) Information Extraction with HMMs and Shrinkage.
Proceedings of the Sixteenth National Conference on Artificial Intelligence: Workshop
on Machine Learning for Information Extraction. Orlando, FL, pp. 31-36
Friedman, N. (1998). The Bayesian Structural EM Algorithm. Uncertainty in Artificial
Intelligence.
Friedman, N., Getoor, L., Koller, D., Pfeffer, A. (1999) Learning Probabilistic Relational
Models. International Joint Conferences on Artificial Intelligence. 1300-1309
Friedman, N., Linial, M., Nachman, I. and Pe©er, D. (2000) Using Bayesian networks to
analyze expression data. Research in Computational Molecular Biology (RECOMB).
p. 127-135.
Geiger, D., Heckerman, D. (1996). Knowledge representation and inference in similarity
networks and Bayesian Multinets. Artificial Intelligence. v. 82, pg. 45-74.
Grady, C. L., Furey, M. L., Pietrini, P., Horwitz, B., & Rapoport, S. L. (2001). Altered
brain functional connectivity and impaired short-term memory in Alzheimer's disease.
Brain, 124, 739-756.
Goldenberg, A., Moore, A. (2004) Tractable Learning of Large Bayes Net Structures
from Sparse Data. International Conference on Machine Learning, 21, 345-352
Grossman, D., Domingos, P. (2004) Learning Bayesian Network Classifiers by
Maximizing Conditional Likelihood. International Conference on Machine Learning,
21. 361-368
Hastie, T., Ribshirani, R., Friedman, J. (2001). The Elements of Statistical Learning;
Data Mining, Inference, and Prediction. Springer Verlag. Q325.75.F75 2001.
Heckerman, D. (1991). Probability Similarity Networks. MIT Press, 1991.
Heckerman, D., Geiger, D., Chickering, D.M. (1995). Learning Bayesian Networks: the
Combination of Knowledge and Statistical Data. Machine Learning, 20, 197-243.
H jen-S rensen, P., Hansen, L., Rasmussen, C. (2000). Bayesian Modelling of fMRI
Time Series. In S.A. Solla, T.K. Leen and K. R. Muller (eds.), Advances in Neural
Information Processing Systems. MIT Press.
Jaeger, M. (1997) Relational Bayesian Networks. Uncertainty in Artificial Intelligence.
pg. 266-273.
Jensen, F. V., (2001). Bayesian Networks and Decision Graphs. Springer-Verlag, New
York.
Kersting, K., Raedt, L.D. (2000) Bayesian Logic Programs. Proceedings of the Work-in-
Progress Track at the 10th International Conference on Inductive Logic
Programming. Pg. 138-155.
Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R.H., Kuijpers, C.M.H. (1996).
Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance
Analysis of Control Parameters. IEEE Journal on Pattern Analysis and Machine
Intelligence. v 18, s. 9, 912-926.
Lancaster JL, Woldorff MG, Parsons LM, Liotti M, Freitas CS, Rainey L, Kochunov PV,
Nickerson D, Mikiten SA, Fox PT. (2000). Automated Talairach Atlas labels for
functional brain mapping. Human Brain Mapping 10,120-131.
Mazziotta, J., Toga, A., Evans, A., Fox, P., Lancaster, J., Zilles, K., Woods, R., Paus, T.,
Simpson, G., Pike, B., Holmes, C., Collins, L., Thompson, P., MacDonald, D.,
Iacoboni, M., Schormann, T., Amunts, K., Palomero-Gallagher, N., Geyer, S.,
Parsons, L., Narr, K., Kabani, N., Goualher, G., Feidler, J., Smith, K., Boomsma, D.,
Pol, H. H., Cannon, T., Kawashima, R., Mazoyer, B. (2001) A Four-Dimensional
Probabilistic Atlas of the Human Brain. Journal of the American Medical Informatics
Association, v. 8, n. 5, p. 401-430.
McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A. Y. (1998) Improving Text
Classification by Shrinkage in a Hierarchy of Classes. International Conference on
Machine Learning.
McGovern, A., Jensen, D. (2003) Identifying Predictive Structures in Relational Data
Using Multiple Instance Learning. International Conference on Machine Learning.
Mitchell, T., Hutchinson, R., Just, M., Niculescu, R.S., Pereira, F., Wang, X. (2003).
Classifying Instantaneous Cognitive States from fMRI Data. American Medical
Informatics Association Annual Symposium).
Moore, A., Wong, W. (2003) Optimal Reinsertion: A new search operator for accelerated
and more accurate Bayesian network structure learning. Proceedings of the 20th
International Conference on Machine Learning, 2003
Muggleton, S. (2000) Learning Stochastic Logic Programs. Proceedings of the AAAI
2000 Workshop on Learning Statistical Models from Relational Data.
Murphy, K. (2002). Dynamic Bayesian Networks: Representation, Inference and
Learning. PhD dissertation, Berkeley, University of California, Computer Science
Division.
Ng, R., Subrahmanian, V.S. (1992). Probabilistic Logic Programming. Information and
Computation, v. 101, n. 2, pg 150-201.
Pearl, J. (1986) Fusion, Propagation and Structuring in Belief Networks. Artificial
Intelligence, v. 29, n. 3, 241-288
Penny, W.D., Stephan K.E., Mechelli, A., Friston, K.J. (2004) Modelling Functional
Integration: A Comparison of Structural Equation and Dynamic Causal Models. (sic)
Neuroimage v 23, supplement 1, pg. S264-S274.
Pfeffer, A. (2001) IBAL: An Integrated Bayesian Agent Language. International Joint
Conference on Artificial Intelligence.
Poole, D. (1993) Probabilistic Horn Abduction and Bayesian Networks. Artificial
Intelligence, v. 61, n. 1, pg. 81-129.
Rabiner, L.R. (1989) A tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition. Proceedings of the IEEE, v. 77, n. 2, pg. 257-286.
Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461-
464.
Stein, C. (1955) Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability. v. 1, pg. 197-206. University of California Press.
Worsley, K. and Friston, K. (1995). Analysis of fMRI times-series revisited ± again.
Neuroimage, 2, 173-181.
9 Appendices
9.1 Bayesian networks

A Bayesian network is a compact representation of a joint probability distribution (JPD)
over a set of n random variables (RVs), X
1
, X
2
,¼, X
n
. The JPD is the
function
1 2
(,,,)
n
P X X X

giving the probability that the RV takes on each value in its
domain. When the variables have a finite discrete domain, as is the case with the
Bayesian networks I employ, the JPD can be represented by an n -dimensional table.
Unfortunately, as the number of RVs in a system increases, the size of the JPD grows
exponentially, quickly becoming too large to represent explicitly. Using the chain rule of
probability, the JPD can be rewritten as a multiplicative series of conditional
probabilities, as seen in Equation (1 9).
( ) ( ) ( )
1
1 2 1 2
1
,,,|,,,
n
n n i i i n
i
P X X X P X P X X X X

+ +
=
=

 
(19)
Recall that if two variables, X
a
and X
b
, are statistically independent, then
( | )
a b
P X X
=

( )
a
P X
. Using the chain rule and assumptions of independence, the number of terms
required to represent the JPD can be dramatically reduced. Take, for instance, a system
with 3 variables: A, B and C. Using the chain rule of probability,
(,,)
P A B C
=
( |,) ( | ) ( )
P A B C P B C P C
. If A and B were assumed to be independent, then
( |,) ( | )
P A B C P A C
=
and the JPD could be represented by the equation
( | ) ( | ) ( )
P A C P B C P C
, which has fewer terms than the full JPD. In this example, the
reduction in terms is minor but in a system with many variables and (more importantly)
many conditional independence assertions, the reduction can be dramatic.
A Bayesian network is a graphical system that encodes conditional dependencies between
variables which can be used as a compact alternate representation of a JPD. Specifically,
it is a directed graph with nodes and arcs connecting the nodes. The nodes represent
variables in the system being modeled and the arcs represent causal relationship between
them. Figure 12 gives a hypothetical example of a Bayesian network that models the
price of tea in China. The arcs in the graph indicate that the price is directly influenced
(stochastically) by both the supply and the demand for tea in China. The absence of an
arc from the Current Lunar Cycle to any of the other nodes in the graph indicates that the
current lunar cycle does not statistically influence the price of tea in China. The Price of
Tea in China node is considered a child node with a parent set that includes the nodes
Demand for tea in China and Supply of tea in China.
Each node represents a random variable with a specific domain. In the example, these
domains could be {high or low} for the demand, supply and price of tea in China, {less
than 50, more than 50 } for the annual rain fall and {new, waxing, full, waning} for the
current lunar cycle. Each of these domains is finite and discrete, though a Bayesian
network can model continuous or infinitely-sized discrete RVs as well.
Associated with each node is a set of conditional probabilities. In the discrete case, these
probabilities are typically represented by a conditional probability table (CPT) and in the
continuous case, are typically represented by conditional probability distribution
functions, such as the conditional Gaussian distribution. The CPT for the Price of Tea in
China node is shown in Figure 12b. CPTs can also be thought of as collections of
multinomial distributions, with a single multinomial describing a node's behavior given a
unique setting of the node's parents.
The structure for a Bayesian network can also be thought of as a collection of families,
with one family for every node. Included in a node's family are all of the parents of that
node (each family contains a single child node and zero or more parent nodes). For
instance, the family for the Price of Tea in China node in the BN in Figure 12 contains
the child node Price of Tea in China as well as the parent nodes {Demand for tea in
China, Supply of tea in China}.
Formally, a Bayesian network is described by the pair (B
S
, B
O
) where B
S
is the graphical
structure of the network and B
O
specifies the conditional probability information of the
BN. B
S
contains a set of nodes for every RV in the system, X = {X
1
, X
2
, ,X
n
}, and an
arc for every causal dependency between RVs. In the example given in Figure 12, B
S
=
Price of
tea in
China
Annual
rain fall
in China

Demand
for tea in
China
Supply of
tea in
China
Current
Lunar
Cycle
0.50 0.25
0.50 0.75

Demand

Price

High

Low

High

Low

Supply = High

0.95 0.35
0.05 0.65

Demand

Price

High

Low

High

Low

Supply = Low
Figure 12. a) Example of a hypothetical Bayesian network modeling the price of tea in China. The
directed arcs indicate that the price of tea is directly influenced by both the demand and the supply of
tea, and indirectly influenced by the annual rain fall in China. Conversely, the absence of an arc
indicates that the current lunar cycle has no statistical effect on the price of tea in China. b) The
conditional probability table for the Price of tea in China node. The table indicates, among other
things, that if the supply is high and the demand is low then 25% of the time the price will be high and
the 75% of the time the price will be low. Note that the Price node has two parents resulting in a 3
dimensional conditional probability table. There are also CPT for all other nodes in the graph (not
shown).
a)

b)

{Annual rain fall in China, Demand for tea in China, Supply of tea in China, Price of tea
in China, Current Lunar Cycle}. It would also contain the three arcs connecting these
nodes. The parents of node X
i
are denoted Pa(X
i
), which can assume q possible
configurations where q is given by Equation (20),
1
,
k
j
j
q r
=
=


(20)
where k is the number of parents and r
j
is the arity of the j
th
parent. For the node Price of
tea in China, Pa(X
j
) = {Demand for tea in China, Supply of tea in China}.
B
O
is the set of CPTs for all of the nodes in B
S
and encodes the degree and direction in
which the nodes in Pa(X
i
) influence X
i
. If a node has k parents, i.e.
| ( ) |
i
Pa X k
=
, then
the CPT for that node will have k+1 dimensions specifying
( | ( ))
i i
Pa X Pa X
=

1 2
( | ( ),( ),,( ))
i i i k i
P X Pa X Pa X Pa X

, where Pa


(X
i
) is the 
th
parent of X
i
. A significant
advantage to using CPTs is the ability to model complex non-linear relationships.
However, as the number of parents in Pa(X
i
) grows, the number of parameters in the CPT
grows quadratically, thus fundamentally limiting the number of parents each random
variable can reasonable have.
The formulation of the DBN search, and BN search in general, is a well-understood
problem (Heckerman 1995). Ideally, every possible structure in the DBN would be
individually constructed and the probability that structure represented the data would be
determined. Two significant problems arise. First, a network score must be devised
capable of indicating how well a given structure represents the relationships embedded
among variables in the data. Second, as structure search is NP-Complete, it is not feasible
to construct every possible structure.
Several scoring methods have been proposed including AIC (Akike 1973), BIC (Schwarz
1978), BDEU (Buntine 1991), BD (Cooper and Herskovits 1992) and BDE (Heckerman
1995). Likelihood itself can also be used as a score. One of the most commonly used
scores is Heckerman's BDE score, which has several significant advantages, including its
ability to easily incorporate prior knowledge and that it assigns two structures encoding
the same set of conditional dependencies the same BDE score (commonly referred to as
likelihood equivalence). Like most scoring methods, the BDE score is composed of the
aggregation of the scores of each of the families in the network. Setting up a score this
way has the significant advantage of allowing the BN structure search to be broken up
into smaller independent structure searches for each of the families.
Equation (21) gives the equation for calculating the scoring of the entire network, and
(22) gives the score for a single family in the network,
( ) ( )
(
)
( )
(
)
1
,|,,,,
n
S i i i i
i
P D B X Pa X score X Pa X D
 
=
µ ×

(21)
( )
(
)
( )
(
)
( )
,,,,,
1 1
,,,,
(,,),
i
q r
i j i j k i j k
i i
j k
i j i j i j k
N N N
score X Pa X D
N N N
= =
 
 
  +
=


 
 + 


 
 

(22)
where  (X
i
,Pa(X
i
)) is a structural prior allowing certain structures to be considered more
likely than other structures.  (×) is the gamma function. N
i,j
is a count for the number of
times the parent set of X
i
is in configuration j. N
i,j,k
is the count of the number of times
the parent set of X
i
is in configuration j while X
i
= k.  refers to the set of prior
information,
{,,(,( ))},,
ij ijk t t
N N X Pa X i j k

 
.
Prior information influencing conditional probabilities is allowed in the form of the N'
virtual counts, which are analogous to N except they are not counted from dataset D but
instead represent a hypothetical prior dataset that describes how variables in the system
are thought to behave in relation to each other. For instance, if we wanted to indicate that
the Price of tea in China should be low when the Demand for tea in China is high and the
Supply of tea in China is high, we could set the N'
ijk
count to a positive number where X
i

= Price of tea in China, j = {high, high} and k = low. The magnitude of the number
would indicate the degree to which we believed this would be the case. Networks
connecting the Price of tea in China to either the Supply of tea in China or the Demand
for tea in China would then receive a higher score even without sufficient supporting
data. Specifying an uninformative prior is achieved by setting  (X
i
, Pa(X
i
))=1, N'
ijk
to 1
and
1
,,,
r
ij ijk
k
N N i j k
=
 
= 

, where r is the number of states node k can be in.
While Heckerman's derivation of the BDE is somewhat complex, in a sense, the BDE
score measures the amount of uniformity (or entropy) in the CPTs' multinomial
distributions. The less uniform the distribution, the more informative the parents are and
the higher the BDE score will be.
Given a fully parameterized Bayesian network, it is possible to compute the posterior
likelihood that the Bayesian network describes a set of data. The posterior likelihood is
given in Equation (23),
( ) ( )
( )
(
)
1
| |,
i
m n
j j
i i i
Pa X
j i i
B P X Pa X
= =
= = =
 
D D D
(23)
where D is a set of m data points in which each data point contains a value for each node
in the DBN,
j
i
D
is the value of the i
th
node in the j
th
data point and
( )
i
j
Pa X
D is the set of
values of the parent variables of the i
th
node in the j
th
data point. Given two Bayesian
networks and a set of data, the Bayesian network that more accurately describes that data
is defined to be the one with the higher posterior likelihood.
9.2 Dynamic Bayesian Networks

The class of BNs that explicitly model temporal processes are referred to as dynamic
Bayesian networks (DBNs). The topological structure for a DBN is divided into columns
of RVs, where each column represents a time frame in the process being modeled and
each system variable is represented by a single node in each of the columns. Causal
relations are allowed to connect nodes between columns, provided the direction of the
causality goes from a past time step to a future time step. Figure 13 (left) gives an
example of the structure of a DBN. The example lists a single possible structure, but
there are a super-exponential
5
number of other structural possibilities.


5
In the number of nodes in the BN.
Ideally, a DBN would contain one column of variables for each time step in the system
being modeled. However, this is computationally intractable for all but the simplest
models. To reduce the complexity of the DBN model, several assumptions are typically
made. The first assumption is the Markov assumption which requires that the values a
variable takes on at time t can be accurately predicted (in a stochastic sense) from only
knowledge of its parent variables' values from time t-m to time t, for some bounded
history length m
6
. The Markov assumption is expressed in Equation (24).
The second assumption is stationary assumption, which requires the dynamic behavior of
a variable from time t to time t+1 be independent of t, given in Equation (25).
where


is a set of all variables in the column representing time step t. Adhering to
both the stationary and first order Markov assumptions results in a DBN composed of
two columns of n variables where each of the columns no longer models a specific time


6
Frequently, m is set to 1. When it is, the assumption is referred to as a first order Markov assumption.
(
)
(
)
1 2 1 1 1
|,,...,|,,...,,
t t t t m t m t
P P
   + 
=X X X X X X X X
(24)
(
)
(
)
1 1
| |,,,
t t t t
P P t t
 
 

= X X X X
(25)
X
1,t
X
2,t
X
3,t
X
n,t
X
1,t+1
X
2,t+1
X
3,t+1
X
n,t+1
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
Class

X
1
X
2
X
n
¼
X
2
Figure 13. (left) A possible DBN topology for two columns of RVs. The left column represents the
values of RVs at time t and the right column represents the values of neuroanatomical regions at time
t+1. The dashed lines indicate that DBN topology is generally not known, and must be determined.
The horizontal ellipses indicate this is just a portion of a bigger DBN. (right) The static structure for the
GNBN. There is a single discrete class variable whose CPT is represented with a multinomial, and n
observation variables whose CPTs are represented as a set of Gaussian distributions (one Gaussian for
each value the Class node can take on). The GNBN does not require a structure search.
point in the system, but instead models the average dynamic behavior of the system
across all time points.
The assumptions made by the DBN are far less restrictive than the assumptions made by
the GNBN and allow for a more sophisticated representation of the data. This causes
DBNs to be significantly more complex than GNBNs and introduces new challenges
when DBNs are used as a modeling framework. Most significantly, if the causal
structure of the DBN is unknown, as is the case in my work, the structure must be learnt.
Structure learning is a known NP-Hard problem (Chickering, Geiger and Heckerman
1994).
Figure 14 gives a simple example of how DBNs can be used to model fMRI data. In the
example are three ROIs: the amygdala, parietal lobe and BA 40. A hypothetical time
series based on the Highlow quantization is given for each region. The DBN family with
the BA 40 Time t+1 random variable node as a child is shown with three Time t candidate
parent nodes. Proposing each possible Time t node as the sole parent of the BA 40 Time
t+1 node results in the corresponding CPTs describing the statistical relationship between
the parent's values and the child's values at the next time step, averaged across all
consecutive pairs of time points. For example, in the top CPT describing the relationship
Amygdala

Time t
BA 7
Time t
BA 40
Time t+1

BA 40
Time t
?

?

?

b) BA 40 family structure with three
candidate parents
High L
ow
High Low

High Low

Amygdala time t
BA 40 time t
BA 7 time t

.66 0
.33 1

.33 .5
.66 .5

1 0
0 1

High

BA 40
time t+1

BA 40
time t+1

BA 40
time t+1

High

High

c) Possible CPTs for BA 40
Time t+1
a) Hypothetical time series
Figure 14. Example of a DBN modeling three ROIs. a) Hypothetical ROI average voxel activation
time series for each of the regions. b) DBN family for the BA 40 Time t+1 node with three candidate
parents: Amygdala Time t, BA 7 Time t, and BA 40 Time t. c) The CPTs for the BA 40 Time t+1 node
corresponding to each of the three candidate parents. Each entry in the CPTs is the probability the
child node takes on a value based on the previous value of its parent, averaged over all time points.
The columns correspond to parent values and the rows correspond to child values. The less uniform
the probabilities in a CPT are, the better a predictor the parent makes for the child. In this case, the best
predictor for the child node is the BA 7 Time t node, as its value gives perfect predictive ability of the
BA 40 Time t+1
node.

Amygdala

BA 40

BA 7

between the Amygdala Time t node and the BA 40 Time t+1 node, the bottom-left entry
of .33 indicates that when the amygdala is high, BA 40 has a 33% chance of being low in
the next time step.
Each column in a CPT is a single multinomial distribution detailing the behavior of the
child for a specific setting of the parents. The less uniform each of the multinomials is,
the better predictor the parent is of the child. As can be readily seen in the example, the
least uniform CPT results when the BA 7 Time t node is the parent and thus BA 7 is
considered the best predictor of the subsequent value of BA 40.
This simple example only includes pairwise relationships, but CPTs are capable of
modeling more complex higher order interactions among many parents and children
nodes. It is the ability of a multinomial CPT to describe complex non-linear relationships
that sets it apart from many continuous models (such as linear, bi-linear and Gaussian
models).
Ideally, a DBN modeling the temporal interaction among ROIs in fMRI data would
contain T columns of n nodes, where n is the number of ROIs and T is the number of time
steps in the experiment (i.e., the number of fMRI images). Each node would represent
the average activation of a single ROI at a specific time step and an arc between the
nodes would indicate a temporal causal relationship between the two ROIs. For example,
an arc from node
t
i
X
to node
1
t
j
X
+
would indicate a relationship between the value of the
i
th
ROI at time step t and the value of the j
th
ROI at time step t+1. Unfortunately, learning
the structure (i.e., which links to include) for such a DBN is intractable. To reduce the
size of the model, I make both the first order Markov and stationary assumptions
(Section 9.2). This reduces the DBN to two columns, each with n nodes. Nodes in each
of the columns no longer represent the values of ROIs at absolute time frames, but
instead represent the statistical dynamics of ROIs averaged across all time frames.
A commonly used special case of DBNs is the hidden Markov model (HMM) which
models the behavior of a hidden RV based on an observed RV (or a set of observed RVs).
The structure for a HMM is fixed. The hidden RV in the i
th
column is a parent of the
observed RV in the i
th

column and the hidden RV in the i+1 column. Further details of
the HMM are outside the scope of this paper, and I refer the interested reader to (Rabiner
1989).
9.3 Gaussian Nave Bayesian Networks

In a Gaussian nave Bayesian network (GNBN) there are two types of nodes: a single
class node, and multiple observable nodes. The class node is set as a parent of each of
the observable nodes which have no other parents. The conditional probability
distribution within each observable node is a conditional Gaussian distribution and is a
multinomial in the class node.
A GNBN makes several strong assumptions about the underlying system being modeled.
First, a node's values are Gaussianly distributed and second, the values in each observed
node are statistically independent of the values in every other observed node given the
value of the class node. These are both strong assumptions that will almost certainly be
violated to some degree, however, GNBNs have often been found to perform quite well
in many classification applications and are commonly used as a point of reference in
machine learning literature. Figure 13 (right) gives an example of a GNBN.
9.4 Lagrange Multipliers for Maximizing the Binomial Ratio Test

Assume each class of data is represented by a bucket of red and blue colored balls. Let r
1

= the number of red balls in class 1's bucket, r
2
= the number of red balls in class 2's
bucket, b
1
= the number of blue balls in class 1's bucket and b
2
= the number of blue balls
in class 2's bucket. The buckets are modeled by what I refer to as the binomial ratio test
(BRT). The BRT is parameterized by two RVs,  ={
r
, 
b
} where 
r
is the parameter
for the number of red balls and 
b
is the parameter for the number of blue balls.

The
likelihood of the model, M, given the data, D, is given in Equation (26), and the log-
likelihood in Equation (27),
( )
( ) ( )
( ) ( )
1 1
2 2
|,,
r b
r b
r b
r b
M D
 
 =
 
 (26)
(
)
(
)
(
)
(
)
(
)
1 1 2 2
|,log log log log,
r b r b
M D r b r b =  +     


(27)
given the constraints 
r
 , 
b
  where  is small non-zero positive number and 
r
+

b
= 1. To maximize the log likelihood function under these constraints, one Lagrange
multiplier term is added to the log likelihood function for each constraint. The
multipliers are:
Constraint
Lagrange term


r
  
1
(
r
± (S
1
)
2
-  )

b
  
2
(
b
± (S
2
)
2
-  )

r
+ 
b
= 1 
3
(
r
+ 
b
- 1 = 0)
The 's are the Lagrange multipliers and the S's are slack variables. Adding these terms
into the log likelihood results in Equation (28).
(
)
(
)
(
)
(
)
(
)
( )
( )
( )
( )
( )
1 1 2 2
2 2
1 1 2 2 3
|,log log log log
1
r b r b
r b r b
M D r b r b
S S    
 =  +     
+    +    +  + 


(28)
Taking the partial derivatives of (28) results in the following system of equations.

Partial Derivative

Result
Equations

a)

r

 


1 2
1 3
0
r r
r r
 
 + + =
 

2 1
1 3
r
r r
 

 =
+

b)

b

 


1 2
2 3
0
b b
b b
 
 + + =
 

2 1
2 3
b
b b
 

 =
+

c)

1





( )
2
1
0
r
S 
   =

( )
2
1r
S

 = +

d)

2





( )
2
2
0
b
S 
   =

( )
2
2b
S

 = 

e)

3





1 0
r b
 +  =

1
r b
 + =

f)
1
S




1 1
2 0
S
 =

1
0

=
or
( )
2
1
0
S
=

g)

2
S




2 2
2 0
S
 =

2
0

=
or
( )
2
2
0
S
=

To find the values for the parameters, this system of equations are solved for 
r
and 
b
.
However, the f) and g) partial derivatives result in two possible cases. As such, there are
a total of four possible parameter settings that must be considered to find the parameters
that maximize the likelihood.
Case
Parameter Values
Resulting Likelihood Functions

( )
2
1
0
S
=
,
2
0

=

r

 =
,
1
b

 = 

(
)
(
)
(
)
(
)
1 2 1 2
log log 1
r r b b
 =  +  


1
0

=
,
( )
2
2
0
S
=

1
r

 = 
,
b

 =

(
)
(
)
(
)
(
)
1 2 1 2
log 1 log
r r b b
 =   + 


1
0

=
,
2
0

=

2 1
2 1 2 1
r
r r
r r b b

 =
 + 

2 1
2 1 2 1
b
b b
r r b b

 =
 + 

( )
( )
2 1
1 2
2 1 2 1
2 1
1 2
2 1 2 1
log
log
r r
r r
r r b b
b b
b b
r r b b


= 
 
 + 
 


+ 
 
 + 
 


( )
2
1
0
S
=
,
( )
2
2
0
S
=

r

 =
,
b

 =

Violates constraint.  +   1
The final parameters are then set by the case with the highest resulting log likelihood. If
the highest likelihood is given by the third case, 
1
=0 and 
2
=0, the resulting parameters
must be checked against the constraints. If they violate the constraints, the maximum
likelihood of the first two cases determines the parameter values.
The first case will result if there are more blue balls in class 1 than class 2 and fewer red
balls in class 1 than in class 2. This case results in the maximum value for 
b
= 1-. The
second case will result if there are more red balls in class 1 than class 2 and fewer blue
balls in class 1 than class 2. This case results in the maximum value for 
r
= 1-. If there
are more blue balls and more red balls in class 1 than class 2, the parameters will be
ranked based on which balls had the largest increase in class 1.