C

1
C.
Project Description
1.
Results from Prior NSF Support
Summary of Previous and Current Awards for Hsu
I
NFORMATION
T
ECHNOLOGY
R
ESEARCH
(ITR)
FOR
N
ATIONAL
P
RIORITIES
,
2004

2007:
Hsu is a co

PI on
ASE
(sim+dmc)

0428826,
$750,000, “
ITR:
Parallel Data Mining for
Nanoscale Kinetic Monte
Carlo
Simulation Models
”
. This project
provides partial support
to
two undergraduate
programmers
in computer
science, one female, and is generating data used in the thesis of one M.S. student in mathematics and one
Ph.D. student i
n computer science.
Since the award date
(
September
, 2004), .
this project has
initiated
the
implementations
of
dynamic Bayesian networks
in Hsu’s software library,
Bayesian Network tools in Java (BNJ)
,
to be used
in Rahman and Kara’s parallel
kinetic Mont
e Carlo
simulator (
LEAP

KMC
)
.
This grant has also
partially
suported
the
publication of
a
pape
r
on technique selection and
one on
Monte Carlo methods for
probabilistic inference.
Hsu has primary development responsibility for the machine learning aspect
of this
project and is investigating techniques based on k

nearest neighbor (k

NN), support vector machine (SVM),
and symbolic regression approaches.
F
RONTIERS IN
I
NTEGRATIVE
B
IOLOGICAL
R
ESEARCH
(FIBR)
, 2004

200
9: Hsu is a senior person
on
FIBR

0425759
, “
M
olecular Evolutionary Ecology of Developmental Signaling Pathways in Complex
Environments
”. This project supports one M.S. student in mathematics and is generating data used in the
student’
s thesis and that of a
second Ph.D. student in computer science.
Since the award date (September,
2004),
this project has
led to
a new release (v3.2) of
BNJ
and
structure learning
modules to be used in causal
modeling in ecological genomics.
One journal paper and several conference papers are in preparation.
REU
D
EVELO
PMENT
(1999

2000):
During Hsu’s one

year appointment at NCSA he led a research program
on applied KDD for commercial decision support, was responsible for industrial data mining projects, and
developed machine learning and probabilistic reasoning software
for large

scale
KDD applications. This led
to the PI’s participation in two summer NSF REU programs whose tutorial components were partly
developed and piloted in a summer course on data mining at Kansas State. Hsu has been responsible for
major contribu
tions to the
D2K
reuse library, including modules for data clustering (1998), stochastic search
wrappers for feature selection (1999), web clickstream mining (2000), Bayesian network structure learning for
decision support (2000), and stochastic sampling

b
ased inference in Bayesian networks (2001).
Specifics of Previous and Current Awards for Rahman
“Chemisorption Studies at Metal Surfaces” CHE

9812397, (99

02) $280,000; CHE

0205064 (02

05) $315,000
(PI); “Kansas Center for Advanced Scientific Computing'' N
SF/EPSCoR006169, (96

99), $302,837 (CoPI);
“Upgrading of a High Performance Computational Facility”(CDA

9724289) (1997

00) $350,000 (PI); “US

Pakistan Workshop: 25th International Nathiagali Summer College” (2000), $15,000; 26
th
(2001), $15,000; 27
th
(2002
), INT0215511, $20,000 (PI); “Evolution of Nanoscale Film Morphology” (ERC0085604) (2000

2003)
$1,170,834 (with 3 Co

PI's); “Single Molecule Magnet for Quantum Computing,” NER/CIS

0304665(2003

04) $100,000 (PI).;”Theoretical Studies of Intermetallic Surfac
es”: US

Turkey Cooperative Research, INT

0244191 (2003

05) $40,000 (PI); “
ITR:
Parallel Data Mining for Nanoscale Kinetic Monte Carlo
Simulation Models
”,
ASE
(sim+dmc)

0428826 (2004

07),
$750,000
(PI).
Highlights of Completed Projects (Rahman)
Funds from ab
ove awards have provided financial support to eight graduate students and four post

doctoral
associates. Five of the above twelve individuals are females. The details can be fou
nd in the relevant
publications.
Some details of the work relevant to this
proposal follow:
Atomistic studies of initial stages of
homoepitaxial growth on Ag(111).
Using molecular statics (MS) and dyna
mics (MD) simulations, we show
that
while at low temperatures the formation of (100)

microfacetted step edges are favored over th
e (111)

type,
the situation is reversed at higher temperatures. These results point to the importance of temperature
dependent atomic vibrations in considerations of epitaxial growth.
Morphology of ledge patterns during step flow
C

2
growth
. During step flo
w growth, in the presence of meandering instability, of step edge patterns of vicinals of
Cu(001), we find an invariant shape of the step profile. The step morphologies change with increasing
coverage from a somewhat triangular shape to a more flat, invar
iant steady state form. Our KMC simulations
show the kink Ehrlich

Schwoebel barrier to be critical for determining the ledge morphology.
Evolution of step
morphology in thermal equilibrium.
Our KMC simulations of thermally induced changes in the step profi
les of
vicinals of Cu(001), using a set of critical energy barriers obtained from reliable manybody potentials, provide
a good agreement with results for kink formation energy, and exponents for time correlation functions with
those obtained from STM data,
for a large temperature range. A key element here is the ability to obtain
macroscopic properties of steps like the stiffness parameter from microscopic considerations.
Self

teaching
Kinetic Monte

Carlo method
. We are developing new KMC codes with autom
atic generation of microscopic
events using manybody potentials and accurate methods for the calculation of energy barriers, as needed.
Because of the automation and inherent pattern recognition ability, this code is expected to provide an
accelerated, mi
croscopic approach to examine issues related to non

equilibrium and equilibrium phenomena
on metal surfaces.
Diffusion of 2D Cu clusters on Cu(111):
Using a closed data base consisting of 294 transition
events involving periphery diffusion, we show that t
he dynamics of small clusters (8

38 atoms) is governed by
their size and shape, and produce an effective diffusion barrier of 0.65 ± 0.02 eV, in good agreement with
experiments. The larger islands (50

1000 atoms show an interesting scal
ing with size and te
mperature
.
Prefactors for interlayer diffusion on Ag/Ag(111):
From calculated energy barriers and kinetic Monte Carlo
simulations, we find that good agreement with experimental data requires that the prefactors for terrace and
step

edge diffusion dif
fer by
two orders of magnitude
.
Molecular dynamics of adatom and cluster diffusion on metal
surfaces:
these studies
are providing insights into novel complex, multi

atom diffusion mechanism that appear
as a function of surface temperature. By providing a measur
e of when anharmonic effects become important,
these studies provide the limits of validity of simulations emerging from KMC for which the harmonic
approximation is assumed. The above projects were performed under the grant ERC0085604 has now
expired, and
no new funds for them are available.
Theoretical Studies of Chmisorption at Metal Surfaces.
Rahman’s
group is engaged in
ab initio
ele
ctronic structure calculations
of a range of phenomena on metal surfaces
including examination
of structural relaxation,
c
hanges
in local electronic structure, reactivity
, chemisorption
,
vibrational dynamics, and surface stress
as induced by surface geometry, presence of steps and kinks and by
adsorbate
s. A number of efficient codes
base
d on density functional theory
with bot
h t
he local density
and
the gene
ralized gradient approximation
are available to the group and several current students are already ver
y
familiar with their usage.
Rahman’s group is now collaborating with a team from Computing and
Information Sciences on
P
arallel Data Mining for Nanoscale Kinetic Monte Carlo
Simulation Models
, which aims at
scaling up simulations for 2

D
epitaxial growth to much larger neighborhoods and irregular surface
configurations..
2.
Objectives
, Expected Outcomes and Long

Term Goals
The
main goal
of
the proposed work is
to build models for nanoscale
materials processes that
are
applicable
to
emerging technologies
for computation. These models need to be able to
represent
three

dimensional
ph
enomena at multiple time scales. Thus, our
su
pporting
technical objective
is to
e
xtend current
simulation
infrastructures
by developing more general geometric and temporal representations.
Examples of processes that
have been simulated in the past,
but present a challenge to scale up
,
include:
deposi
tion of thin layers (e.g.,
metal

on

metal
in
semiconductor wafers and data storage media
)
initiation and propagation of surface defects
dynamics
,
diffusion and adsorption
of
proteins, peptides, and other organic molecules
Such phenomena
are
not limited to
crystal lattice structures as in the case of many existing simulations, but
can include adsorption of organic compounds to inorganic surfaces, diffusion of proteins into media such as
electrophoresis gels, etc. The types of computational models that are c
urrently used to simulate
material
evolution
are usually based on two

dimensional algorithms and propagate information in the third dimension
using a dynamic programming or “sweep

plane” approach. A consequence of generalizing from these to real
C

3
3

D proce
sses is that
better representations are needed for
material evolution and
fault propagation
neighborhoods.
An interdisciplinary
team of researchers from Physics and Computer Science has formed to address the
problem of g
eneralizing
and extend
ing
existing f
rameworks for “multiscale” simulation
of physical
phenomena such as the above, from the atomic level up.
Previous and related work
addresses the
spatial
multiscale
aspect of simulating
material evolution
and grain boundary diffusion in solids and nanostru
ctures
.
However, r
epresenting and calculating the dynamics
,
using a mixture of models at different
temporal
scales
,
presents
yet
another
theoretical challenge.
For example, the events involved in initiation of
a surface defect
in
a data storage medium
ran
ge o
ver time scales
from picoseconds (10

12
s) to seconds
, while the propagation
of
the defect and its impact on
data storage
failure may be measured
on
scales
of
up to 10
7
seconds.
Additionally, s
ome components of this time axis are independent of the sp
atial scale.
The emphasis of this new work
is on formalizing
multi

time representations for 3

D
nanostructures
,
using
temporal graphical models
such as
dynamic
Bayesian networks and relational and object

oriented extensions
thereof
. Besides the
new applic
ations to
simulation of
nanoscale materials processes, the
novel contribution
s
of this approach include the scaling up of
stochastic simulations for
a more general class of
discrete
phenomena
over time, which has
potential
benefits for time series predicti
on,
process control,
and
planning.
Specific desired
outcomes include:
1.
implementations of existing and new machine learning algorithms for approximation of energy
functions (macroscopic and microscopic rates) in simulation of experimental processes
2.
r
eprese
ntations and
semistructured data model
s
for
multi

time nanoscale phenomena
3.
f
ielded applications such as parallel kinetic Monte Carlo simulators
using these representations
The computer scientists in the group are interested in
the challenge of sca
ling up
simulations that involve
state spaces
in excess of 2
100
configurations
, with about a
petaflop
(10
12
floating point operations)
required to
estimate the energy function for
each
atomic

scale configuration.
Such problems
are
combinatorially
intractable even
when
all
symmetries
,
caching and
parallelization
venues
are
exploited
. In
the
material
evolution
domain, spatial decomposition can only abstract
this
to
the billions of
distinct configuration classes
.
The challenge is then to inductively generalize over
previously seen configurations, in order to discover some
equivalence classes of, or local regularities in, the energy function
The wide range of time scales for
simulations presents both a challenge
and an opportunity:
On the one hand,
simulation at the
finest temporal
grain (15

20 decimal orders of magnitude shorter in duration than the longest episodes to be modeled)
is not
feasible; therefore,
learning mixed

time models can provide a way to obtain useful estimates for event times.
S
uch an abstraction
may be required, for instance
, to
model long

term
nanoparticle
diffusion and
sintering
(synthesis of materials from powder
) at a macroscopic
scale.
On the other hand, such an abstraction
is often
enough to approximate the outcome of
an experiment
to
suffi
cient accuracy. F
or example,
the experiment might
be to test
the effectiveness of a new catalyst with respect
to some qualitative outcome
of interest
such as:
“
D
oes
this nanoparticle resist
self

adhesion in application for
one year?”
The
physicists in our
group seek
to extend the
methodology of materials process
modeling in
revolutionary ways
, supporting design and control
of
the
properties of materials as needed for emerging
technological applications.
For them,
the long

term goal
is to have a set of gen
eral

purpose
computational
tool
s,
using which they can control the growth patterns of
Figure
1
. Illustration of the 3

shell, 36

atom
neighborhood representation currently used for
a (111) fcc 2

D system.
C

4
nanostructures: metallic films, nanoscale storage media, nanowires,
C
60

walled nanotubes,
etc
., as a function of
temperature, substrate geometry, and material composition
.
This in turn requires a
computational
modeling
tool that can
approximately
map the “system inputs” to
the
parameters of rate equations
for the structures of
interest
.
Together, we have identified
three orthogonal
improvements to
existing frameworks
that
are needed to
achieve the shared goal of
this c
omputational
modeling tool
. First, previous and
current funded
research
has
pushed towards finer

grained simulations
by means of parallelization and data mining
: “
more energy
evaluations, generalized to cove
r a
large
r
state space
, in the same amount of time
”
.
One goal of this
research project seeks to create a robust and reliable mapping function from a 2

D geometric specification of
a local 3

shell neighborhood about an active atom, to
the
activation
energ
y
for the state
.
This neighborhood
is depicted in
Figure
1
.
S
econd,
an
independent
problem
is to develop a
true 3

D model
rather than iterate
over 2

D computations for one layer; this will allow
interlayer
(vertical)
transitions i
n complex epitaxial
growth simulations and support modeling of 3

D
material evolution
.
Third,
temporal
abstractions are needed
to simulate
long

term
events
such as crack formation
(both macroscopic and microscopic)
from
short

term
,
atomistic
computations
.
As
Figure
2
illustrates
, t
his proposal addresses primarily the third
necessary
extension
and aspects of the second
one
that are relevant to dynamics
.
This block diagram is explained in
more
detail in Section
4
(P
roposed
R
esearch
)
.
Figure
2
.
Overview
: b
lock diagram of
improved
system
for
nanoscale process modeling.
C

5
3.
Present State of Knowledge
3.1
Background and Related Work
Theoretical physics research over the past ten years has
included many efforts to model discrete phenomena
at intermediary scales of time and distance
. These range
from basic processes
, which
are minimal both in
duration and distance
,
to long

term effects
.
Material evolution
:
Primitive
processes that can be ide
ntified and simulated at the nanoscopic level
in
material evolution
include: deposition,
diffusio
n, nucleation, attachment,
detachment
,
edge diffusion
,
diffusion down step,
nucleation on top of islands
, and
dimer diffusion
.
The prefactors and activation e
nergy
barriers for a
typical state
in
a 3
6

atom neighborhood
(3

shell atomic model)
require tens of CPU seconds.
Coating and crack initiation and propagation:
Research on surface coatings for prevention of fatigue
crack initiation typically focuses on s
uppressing the
development of the critical persistent slip band surface
morphology through high modulus coatings and the
associated dislocation image forces. Brittle coatings of
this type tend to have limited performance and impact
on crack initiation resi
stance because cracks can initiate
by other
mechanisms. In comparison to traditional
crack

resistant coatings, multilayered metallic thin films
can be tailored to generate exceptional hardness and to
retain substantial ductility from the constituents. It i
s
this extensive flexibility that facilitates material

specific
control of the critical surface morphology responsible
for fatigue crack initiation.
The propagation of cracks is facilitated by grain

boundaries or easy planes (stacking faults) where the
en
ergy to separate two planes is the least.
Adatom decoration of stepped surfaces
:
Deposition on
m
onoatomically stepped surfaces
has been demonstrated with metal
s such as gold
(Au) on platinum (Pt) and palladium (Pd).
Vicinially stepped surfaces such as
Pt(665)
have in
turn been used as templates for the formation of
nanowire
s and similar
metal deposits
.
Controllable synthesis of nanostructures is one
reason why a general architecture for
computational modeling of nanoscale phenomena
is important and is
needed for better
characterization of events, discovery of interesting
properties, and prediction of longer

term behavior
from specification.
Figure
3
. Fatigue crack initiation.
Figure
4
.
Generic monoatomically stepped surface.
C

6
3.2
Need for Extended Geometric and Temporal Representations of Materials Processes
Early experiments using parameter
s for 2

D Cu(111) islands (of 10

1000 atoms) at 400 Kelvin indicate that
the number of CPU

seconds required to achieve a requisite 95% is estimated to be greater than 4 * 10
4
, or
about half of a CPU

day on the test system. In preliminary benchmarking expe
riments, an existing high

performance FORTRAN code, courtesy of Oleg Trushin, was executed on a current

generation desktop PC
(single Pentium IV processor) and began to saturate its cache of precomputed states only after the most
frequent 1000 unique state
s had been visited. The first 10
4
seconds resulted in only 300 hundred unique
states being visited, after which, a dramatic increase in previously visited states resulted in over 3 million state
transition evaluations being achieved in the second 10
4
seco
nds. Realizing this speedup was largely a matter
of caching the results. This performance curve indicated that scaling up to hundred

atom neighborhoods
would pose a significant challenge. Throughput of such a sy
s
tem at 40 seconds of wall clock time per
state
evalution prompted us and our collaborators to look at parallelizing kinetic Monte Carlo (KMC) simulations.
Independent of this effort, we began to develop a representation that supported machine learning using
instance

based methods, multivariate re
gression, and kernel methods (support vector machines). The
resultant system aims at a combined speedup of a few hundred times
.
Scaling up through parallelization and data mining allows more detailed modeling to be done in the same
amount of wall clock ti
me.
However, this
only addresses the first technical improvement to the existing
frameworks, stated at the end of Section
2
abov
e
.
The need for
an extended geometric model can be seen in
Figure
3
and
Figure
4
above:
modeling of faults and wire growth is limited to monoatomically stacked or
stepped structures
unless
an ad hoc mechanism is defined for vertical jumps, or a true 3

D model is
developed. For general

purpose modeling
tools, the latter is
more
extensible
and therefore
preferable.
At
least as important to this proposed research is
the need for multi

time models in order to represent longer

term cumulative, periodic,
or gradual complex effects
.
4.
Proposed Research
This sec
tion describes specific technical contributions and outcomes of the proposed work. We
first review
the system architecture depicted in
Figure
2
and list the key tasks and desired outcomes of current work to
delineate it from incre
mental extensions and more significant, fundamental improvements.
The top box in Figure 2 lists five sources of input data for data mining using temporal graphical models:
1.
the
previously computed exact parameters
for
robust estimators
that, as documented i
n Section
3
,
include prefactors and activation energy barriers
; also expresses some background knowledge
2.
the
process specification
, a history of all relevant energy values, which in some cases may also
include higher

level adjus
table parameters such as priors for the rate equations to be expressed by the
temporal graphical models. (In current system, this is
a simple qualitative selector and the rest of the
specification is
captured by the state specification
. When
different le
vels of temporal abstraction
are
introduced
,
however,
Markovity may be violated.)
3.
the
spatial and geometric state specification
, consisting of at least a bit vector representation of
the “initial conditions”, describing the neighborhood about one or more a
ctive atoms
4.
a
semi

structured event language
for propagation and caching of state transition information,
prefactors and factors (
in atemporal, i.e., spatial,
abstractions), and mixture coefficients for the multi

time model
5.
functional, parametric, or const
raint network

based representation of
simulations
: these propagate
information from higher and lower levels of a hierarchical temporal model such as a hierarchical
hidden Markov model (HHMM)
The upper
three
boxes comprise the “user data”, whether given in
configuration files or interactively; the
lower box contains computed inputs and all of the variational factors that can be solved for within one
atomistic simulation (at a single time “slice”).
C

7
The middle box decomposes the technical objective of finding
the right temporal abstraction to simulate
long
(er)

term events
, into pattern recognition (inference), representation, and learning aspects. We now
document these aspects. For clarity, we proceed in the order of representation (Section
4.1
), learning (Section
4.2
), and parameter estimation by inference (Section
0
)
.
4.1
Representation:
General Infrastructure
for
Nanoscal
e
Materials Process Simulations
This
section describes t
he proposed extensions to the computational framework on the input end that lead us
to a new 3

D, multi

time model for
material evolution and other
phenomena
(the central “representation”
box of the middle layer in
Figure
2
)
.
In or
der to develop a data model that facilitates learning and multiple and mixed time scales, we must first
consider the limitations of current practice
:
1.
Simulators of 3

D
material evolution
(simple homoepitaxial growth and more complex varieties
, wire
growth,
self

assembly,
etc.
) currently use a “sweep

plane” representation that propagates the effects of
isolated 2

D computations. While this approach is easier to parallelize using boundary methods, it
cannot not capture all 3

D effects, such as some vertical
jumps, that we are interested in. This has an
unwanted side effect on multi

time modeling: for many growth and diffusion processes, the vertical
axis is tied to the time parameter. Iterating at a fixed layer “thickness” means iterating at a
uniform
time
granularity, whereas we seek to develop an adaptive multi

time model.
2.
3

D materials processes are numerous and those that are being computationally simulated are
becoming more
complex. As we progress towards multi

time models, organizing and interfacing
s
imulators will present an increasingly difficult information management task.
An extensible
ontology and semi

structured
data model, such as the investigators are developing for the thin
film
and epitaxial
growth domain
s
, is needed for a more general fami
ly of phenomena.
A corollary of the above points is that
the spatial state specification needs to be extended to a spherical
neighborhood
, rather than the “cylindrical” or “stack of hexes” neighborhood
indicated by
Error! Reference
source not found.
. For t
his purpose
we will adapt uniform (voxel) and adaptive (octree) representations
[Foley
et al.
, 1996]
from the field of volume graphics
.
Voxel

based representations are more data parallel and
easier to manipulate, but adaptive spatial decomposition makes a
tradeoff between increased bookkeeping
overhead and potential savings in dealing with volumes in bulk. We face this same representational tradeoff
in
designing
our
adaptive multi

time representation
.
Therefore, making synergistic design choices for the
process specification and geometric state specification (the upper level of boxes in
Figure
2
) has a high impact
on the whole framework: not only on the simulation, but also on inductively learning from the outputs and
reasoning wi
th them to iteratively improve the adaptive subdivision. The consequence of this choice is that
with the right (generative) data model, we can bootstrap the process of searching for a good spatial and
temporal abstraction.
Probabilistic graphical models
1
provide one such generative representation. We will now consider a
framework for learning them from simulation data and outline some challenges.
4.2
Learning
: Graphical Models for
New Applications of
Parallel
Kinetic Monte Carlo
Figure
5
depicts a typical
BNJ
workflow for learning graphical models from scientific (or
industrial) data. In
the next section we discuss the combined application of learning and inference modules in BNJ and give a list
of technical development milestones.
Ke
y technologies
applied in the deployment of such an experimenter’s workbench are:
1.
Parallel, distributed computation
1
The term “graphical” is overloaded
in this context
–
in the cross

cutting fields of probabilistic reasoning and
machine learning, it refers
to models and algorithms based on graph (i.e., network) representations, not necessarily
to computer graphics and graphical user interfaces (GUIs).
C

8
2.
Reusable software modules for high

level performance tuning in learning graphical models
3.
A semi

structured data format for BNs (the XML
Baye
sian Network Interchange Format
)
Figure 6 depicts a traditional dynamic Bayesian network (DBN), represented using a two

time slice temporal
Bayesian network and applied using an unrolling procedure as shown. Circles denote hidden states in this
figure. Sq
uares, whch usually controllable quantities in a decision network, represent observable (possibly
controllable) variables, including energies of activation and prefactors.
Figure
5
. Example
BNJ
workf
low for learning graphical models from data.
Figure
6
. Example Dynamic Bayesian Network for a generic material evolution process.
S
pecific
important
challenges to learning of temporal graphical models include
C

9
1.
aggregating and abs
tracting over time slices
(as depicted by the inner rounded box labeled “temporal
abstraction”)
to obtain a multi

time model
2.
incorporating other models of time: continuous time, extended lag or “long time lag” models,
etc.
In addition, spatial decompositio
n
for
partial evaluation
and modeling 3

D geometry
(not shown)
are
part
s
of
our current work that will need to be integr
ated into the multi

time models, as described in Section
2
.
4.3
Inference:
Graphical Models
for
Estimation and P
rediction at
Mixed Time Scales
General
Function
Specific
Technique
Software Module
Year (to be) implemented
Inference
Exact
Clustering
(Junction Tree)
Infrastructure complete
(optimizations, animation
in BNJ v3.1
)
Conditioning
Infrastructure complete
(
opt.
still needed
)
Variable Elimination
Infrastructure complete
(ditto)
Pointwise Product
Year 1
Approximate
Bounded loop cutset
conditioning
Year 1
Stochastic Sampling
Adaptive importance sampling (Cheng and
Druzdzel, 2000) done
; others in prog
ress
Other (hybrids)
Years 1

3
Learning
Structure
Constraint

based
Year 1: CMU
Causality Lab
integration
Score

based
K2
done
; others in years 1

3
Distributions
Years 1

2;
gradient descent done
;
forward

backward a
lgorithm
in progress
Representati
on
Objective
&
relational (PRMs, OOBNs, etc.)
Year 2
Dynamic Bayesian networks
Years 1

2
;
Boyen

Koller, factored frontier
Decision networks, influence diagrams
Infrastructure extended
; inference and
learning, Years 1

2
Continuous chance nodes
Year 1
Continuous time
Years 2

3
New & hybrid
representations
Hierarchical hidden Markov models
Years 1

3
ILP, object/relational models
Years 1

3
Latent variable models
(Meek/Chickering, PC, FCI)
Years 1

3
Applications
3

D diffusion
Year 1, initial implem
entation
year 1
Crack propagation
Years 1

2, initial implementation
first 18 mos.
Other 3

D nanostructures:
nanowire/nanotube
Years 2

3
Table
1
. BNJ feature ove
rview and development timetable for the EMT project.
This section
briefly surveys the current state of practice in this area of research, technological needs that must
be addressed to meet the goal of
developing graphical models and providing a usable software toolkit to the
computational physics user community.
General

purpose software tools for data mining
are abundant, but similar tools for learning models from data
are not as accessible to
computational science and engineering
researchers and students.
Outside the cross

cutting disciplines of artificial intelligence
and statistical computation, this proves to be an educational lack.
Specifically,
many of the curricula
in knowledge discovery in databases (KDD) using
real
data has
covered
on
ly a few aspects of Naïve Bayes, clustering, and Bayesian statistics, and in th
e case of time series prediction,
autoregressive moving average (ARIMA) process models and some simple state transition models, usually
without covering learning and inference. As a general practice, most such teaching programs have not
incorporated learn
ing and
inference in graphical models in general, even where efficient and scalable
algorithms are available.
C

10
Table 1 lists the learning and inference modules that make up the middle section of
Error! Reference source
not found.
. Those modules that are m
ost directly relevant to this EMT project are bolded.
4.4
Value Added
to EMT
: Beyond
Material Evolution
A significant part of this proposal is to develop the framework for bridging the gap between the length scales
feasible for
ab initio
calculations of comple
x systems and those necessary for examining processes like
diffusion of atoms and vacancies in grain boundaries in Fe and Fe

Ni based alloys. One way to bridge this
gap is through a combination of
ab initio
electronic structure calculation with kinetic Mo
nte Carlo (kMC)
simulations. In the proposed work we will first consider first the simpler case of diffusion in grain

boundaries in the homogeneous system consisting of pure Fe and then move onto considerations of those in
Fe

Ni based alloys. For both sys
tems activation energy barriers for selected cases will be obtained from
ab
initio
calculations of the total energy using the nudged elastic band method [
Johnson
et al.
, 1998
] and
compared with those from model potential. While this is a relatively new and
reliable technique, it has already
found its way in state

of

the

art electronic structure codes which are used by Rahman and Kara.
The calculations for multi component systems are indeed computationally intensive but they have become
feasible on present

d
ay computers. Some model interaction potentials are already available and will be used.
Proposed
ab initio
will also facilitate the development of robust interatomic potentials for further application
to the projects proposed here.
At the initial stage e
nergy barriers for important processes will provide the input for a kMC simulation of the
diffusion of atoms and vacancies in the alloys under consideration, as function of temperature. These
calculations will be done through the usage of the model potenti
als. Classical molecular dynamics simulations
will be carried out to obtain further insights into the importance and relevance of additional diffusion
mechanisms, at elevated temperatures, since
ab initio
results are most appropriate for low temperatures.
Activation energy barriers for these new processes will then obtained from
ab initio
calculations and the data
base for kMC simulations will be enlarged. With further refinements these sets of calculations at several
lengths and time scales will greatly en
hance our knowledge of the characteristics of diffusion in these alloys.
Available experimental data on the systems will attest to the validity of the calculations. The calculations in
turn will motivate experimentalists to investigate further these proces
ses under controlled conditions.
Results from the
ab initio
study mentioned in this proposal and the model interatomic potentials that we
propose to develop will set the stage for a detailed, temperature dependent investigation of fracture related
phenomen
a in steel alloys as a function of composition, segregation and stoichiometry. These large scale
atomistic simulations will be carried out in several stages.
MC simulations to determine segregation profiles at and near grain boundaries
KMC simulations with
energy barriers calculated using
ab initio
techniques and model potentials.
For the study of fracture dynamics in systems containing grain

boundaries, it is desirable to obtain a realistic,
temperature dependent, configuration near and at the grain bounda
ries. It is well known that near a grain
boundary or near a surface, the concentration of the two elements in a binary alloy differs from that of the
bulk resulting in strong segregation profiles. We propose to perform grand canonical MC simulations for a
series of cells containing different types of (symmetrical tilt) grain boundaries, at different Fe

Ni
stoichiometries and temperatures. For each system and temperature, energetics and dynamics will be
monitored and analyzed in order to extract information
about the mode of the crack propagation and
fracture. Of interest will be the region near the crack front where dislocation nucleation and motion will be
expected as well as the general behavior, at the atomic level, the displacement field near the slip p
lane.
A new feature we will add to our novel approach, where
ab initio
techniques and model potentials are used in
tandem, is a continuous quality test of the model potentials used in the simulations. This will be done by
regularly taking newly developed s
tructures during the fracture dynamics and compare the energetics and
forces from the model potentials used and those obtained from
ab initio
. If large discrepancies are observed, a
re

fit of the model potentials, including the newly calculated energies an
d forces from
ab initio
, will be
C

11
obtained and the procedure continued till convergence is reached. This quality

test and improvement of the
tailored model

potentials will provide, hopefully, robust and with predictive power results.
In the proposed studies
we will calculate the pre

exponential factors for the diffusion of Fe, Ni and vacancies
at grain boundaries. Of course a number of steps will be involved for accurate, realistic calculations of the
diffusion coefficient. Note that the kMC simulations prop
osed above with
a
b initio
calculations of energy
activation barriers for possible diffusion mechanisms, will not give any information about the diffusion
prefactor We are thus proposing the development of robust model potentials and their usage for the
cal
culations of the vibrational entropy contribution to the diffusion coefficient. The details for this can be
found elsewhere
[Kuerpick
et al.
, 1997
]. Recently, we have been engaged in a number of advanced and
accelerated computational techniques [
Rahman
et
al.
, 2004
]
such as:
kinetic Monte Carlo, nudged elastic band
method for calculation of diffusion paths and barriers,
and
accelerated molecular dynamics simulations for
examining growth processes on metal surfaces. We propose to apply these techniques to
examine the process
of diffusion in the various phases of Fe

Ni.
4.5
Evaluation Plan
Our
evaluation
approach
can be divided among
model
development
,
refinement
,
and
application
phases
.
D
EVELOPMENT
: The first
12

month
shall produce
the
temporal data models
(eve
nt language)
and a
standardized, documented training corpus for l
earning
temporal
probabilistic models from
our own
simulators for material evolution, generalizing to 3

D boundary diffusion. Meanwhile, we will adapt data
models for crack propagation and o
ther surface defects. In an
overlapping
18 month phase
,
we will develop
new simulators using the DBN structure learning and inference
E
VALUATION
: We will develop algorithms for both
learning
and
inference
of graphical models and compare
them to existi
ng o
nes for DBN structure learning and exact and approximate inference
. This project focuses
on
extending structure learning to
relational, spatial, and multi

time
models
and the evaluation of
robustness by
statistical validation of
discovered models. We pro
pose to generalize and refine existing
evaluation methods by (
i
) conducting ablation studies to test the graceful degradation of the system given
resource bounds
; (
ii
) checking
dynamical
models against
samples of
exact computations
. Because bootstrap
meth
ods for model evaluation are computationally intensive, we
will
use high

performance Grid applications
that we are
developing using the existing cluster facility
to accomplish the indicated experiments.
A
PPLICATION
: To validate the resultant m
odels, we wi
ll use exact parameters for the external energy
function as calculated using conjugate gradient, molecular dynamics (MD) cooling, and global Markov Chain
Monte Carlo (MCMC) optimization when computationally feasible
.
M
odel application is
intended
to lead
to
a process of iterative improvement of model
s wherein the time series learning, representation, and reasoning.
5.
Broader Impacts
5.1
Value Added beyond EMT
Probabilistic graphical models have been used in numerous recent applications to classification, forecas
ting
and in causal or compositional inference in many domains. These include ecological data, economic data,
climate data, spectroscopic data of many kinds, and medical data, in all of which Bayesian networks have been
shown to perform well in the above t
asks. The PI
recently received an NSF EPSCoR First Award (June,
2002
–
August, 2003)
for a research project on building
probabilistic network models of cell cycle

regulated
genes in yeast.
Several additional research projects at KSU, Iowa State, and CMU
focus
on algorithms for
learning graphical models from data. We intend to continue this work through the
three
years of this project
,
exposing students to open research problems and current challenges
. Bayesian network structure learning is
an important
but intractable subproblem of probabilistic reasoning. Therefore, greedy score

based
algorithms
for structure learning
are sometimes used, but these are sensitive to the order in which variables
are scored. Unfortunately, finding the optimal ordering of
inputs entails search
through the permutation
space of variables. Furthermore, in real

world applications of structure learning, the gold standard (ground

C

12
truth) network is typically unknown. In response, we have developed a scoring method for orderings
that
uses a well

known greedy algorithm for structure learning (
K2
) and exact and approximate
inferential loss,
given specified evidence.
[Hsu
et al.
, 2002]
describe
s
how this scoring module fits within a
n
optimization
framework
that evaluates the fitness
of variable orderings.
One the computer science side, o
ur project
has three points of greatest educational impact:
Fundamental concepts of graphical models:
There are many courseware packages that introduce
undergraduates to concepts of graph theory and
probability theory, but integrative introductions to
graphical models of probability are rare and tend to focus on just a few features. Furthermore, there
are few such tools in the public domain, CMU’s Causal and Statistical Reasoning tutor [Glymour and
Scheines, 2004] being the only one that to our knowledge can demonstrate fundamental concepts
using general

purpose networks.
Learning and inference:
The primary value added by
BNJ
is the ability to demonstrate and
experiment with many algorithms for reaso
ning with graphical models and learning model structure
and parameters from data. This is presented in a single, unified framework with many reusable,
extendable classes [Kruger, 1992; Gamma
et al.
, 1995] for visualization, error measurement.
Interfacing
to real databases and working with network and data formats
:
BNJ
can help
introduce students to semistructured data, using a new XML Bayesian Network format that
integrates and converts among existing formats from Microsoft Research,
Hugin
,
Netica
,
etc.
, i
ncluding
several legacy formats; it can also use WEKA’s ARFF format for training and inference data.
Applications, examples, and student workbooks developed using
BNJ
will give students in computational
sciences and applied mathematics an opportunity to
st
udy
,
produce, and use real models
in interactive
exercises. It will also support guided study with a variety of example networks, including not only small
networks from many online repositories but also network structures and distributions automatically g
enerated
to specification or learned from real world data. This real

world data includes experimental data that is a
product of research uses of
BNJ
. Data from such projects is already being used in an educational setting in
advanced undergraduate and gr
aduate

level courses on machine learning and data mining. Equally important,
each tutorial on an algorithm will include a discussion on
known conditions for its reliability
under various
conditions. As part of this
EMT
project, we will develop supplementa
ry data sets and a courseware module
for
BNJ
that uses them to illustrate practical applications of data preparation, learning, and inference in
graphical models.
Desired o
utcomes and benefits
from
use of
BNJ

based materials
E
DUCATIONAL
B
ENEFITS
:
Our pedag
ogical goals are to help students attain understanding and interest in
the areas of computational science and engineering (CSE) and applied mathematics that relate to CSE.
Higher education should train students to be effective
solvers
and
posers of proble
ms
. This is the focal
challenge in every discipline of mathematics, science, and engineering. In the developing field of
computational genomics, however, new application domains are identified and methodology is advanced,
continuously and very rapidly.
This poses a novel and complex challenge. We must prepare students for
lifelong learning in computational sciences by imbuing them not only with technical background, but with
enthusiastic interest for the subject matter and its theoretical and practical
significance. Specifically, we must
help undergraduates in computational
physics
to develop an awareness of computer science and applied
mathematics as important facets of their discipline
and
to appreciate them as integrative subjects for
professional pr
actice and potential postgraduate study.
R
EUSE
:
The most significant side benefit of the proposed work
is our plan to develop an
extensible
architecture and reuse library for
research in materials evolution, 3

D propagation of cracks, and other
nanoscale p
rocesses
. The architecture is based upon data models and application programmer interfaces
(APIs) that we have developed and are continuing to refine in our research. Into this framework, our student
participants can incorporate codes that implement algo
rithms for analyzing data, then test, correct, compare,
and refine them. The suite of software tools shall in turn serve as a foundation for training
users
to design
C

13
and carry out experiments in scientific applications of graphical models, by writing or a
dapting programs for
their own use. Transfer of this training technology to other computational science and engineering curricula
to serve the aims stated in
Section
1
is one measure of
success in achieving broader impacts
.
We
have developed new courses and a formal interdisciplinary curriculum at both the graduate and
undergraduate levels and integrated them into a research program that emphasizes rigorous specification,
development, and assessment of intelligent systems for
co
mputational science and engineering
. We are
delivering these to both traditional campus

based students and remote students, some of whom we anticipate
shall be veterans of the
computational science and engineering
industry, seeking continuing education. W
e are
also working to develop programs for outreach at pre

collegiate levels. Our desired pedagogical benefits,
concrete outcomes
, and approaches are:
Curriculum improvement
: courses, degree programs,
materials
to facilitate active learning
Early educati
onal outreach
: improved
recruitment and retention
of underrepresented groups
Undergraduate involvement in research
: collaborative experience
;
technology transfer
Increased competence
: mentoring from pre

collegiate through postdoctoral level
Test Sites and
Advisory Group
W
e propose to leverage our existing research by incorporating our basic theoretical advances in machine
learning and probabilistic reasoning into our intelligent systems courses, demonstrating their benefit to
students through hands

on devel
opment experience with the software packages we have developed:
Machine
Learning in Java (MLJ)
and
Bayesian Network Tools in Java (
BNJ
).
This leads to concurrent engineering of our
research codes with the educational code base, offering undergraduates and
new graduate students the chance
to learn about state

of

the

field software tools from the original developers. We have found that
providing
visual explanation of technologies to students facilitates the process of active learning
by allowing
them to int
eract with models, data, or algorithms. We shall provide students with visual programming
infrastructures for development of distributed, high

performance KDD and collaborative filtering systems
and require them to develop user interfaces and visualizatio
ns of models. We expect benefits of this
visualization approach to accrue to new interdisciplinary teaching programs in problem solving in the colleges
of engineering, arts and sciences, agriculture, and architecture.
The PI at KSU has been using
BNJ
in a
n introductory course in AI, an undergraduate course in data mining,
and a graduate course on machine learning.
To date, a total of 15 institutions
(including KSU, Iowa State,
and CMU)
have elected to adopt
BNJ
as test sites for at least the duration of t
he proposed 3

year project.
Other universities and companies have also made informal commitments to use
BNJ
on
a trial basis.
BNJ
v3
is to be piloted as a teaching tool at
a total of
25 universities.
5.2
Dissemination Plan for Research and Educational Module
s
External Evaluation and Coordination with EMT Technical Leads
To address the comprehensive nature of this proposed project, the evaluation plan is multidimensional in
nature and is formed around three main elements: 1) the learning experience for the stu
dent; 2) the research
environment; and 3) educational outreach. The framework of the evaluation design is based on the logic
model, recommended by NSF. The logic model highlights the breadth and depth of the possible i
m
pacts of
the
BNJ
program as well as
the long

term n
a
ture of the project. The model provides a visual representation
of the interrelated aspects of the project and aligns with the Principal Investigators’ approach to research on
student understanding, development of strategies on building c
apacity in computational science and
engineering (CSE), as well as the extensive approach to the development of instructional materials.
The outcomes of the
BNJ
program will be expressed in terms of student motiv
a
tion and learning, and
changes in faculty’s
knowledge and application of the
BNJ
toolkit. The evaluation will document the impact
on undergraduate students, graduate students, and fa
c
ulty. The program can also be expected to influence
curric
u
lum and the general culture of the learning environment.
Thus, the evaluation will capture anticipated
C

14
and unantic
i
pated changes the program may bring about. The evaluation will use a variety of indic
a
tors to
measure the breadth and depth of impact of
the
Bayesian Network tools in Java
on individuals and partner
ing
inst
i
tutions.
The evaluation plan is consistent with the NSF and other professional guidelines on evaluation
for a pr
o
ject of this magnitude. On

going data generation and analysis (
formative
feedback) will be provided to
the pr
o
ject leadership. This s
ystem of recursive review and analysis will allow the project leaders to make
necessary modifications to the program activities and products throug
h
out the implementation of the
project’s goals.
Summative
evaluation feedback will be provided to the projec
t personnel
annually and at
NSF’s request.
T
ECHNOLOGY
T
RANSFER
: Our key dissemination effort is the development of
research codes
applicable
to real

world learning and inference problems, including collaborative filtering and the specific
computational
sci
ence and engineering
applications we have discussed. This began over
two
years ago with the first
experimental prototype of
Bayesian Network tools in Java
.
BNJ
has been downloaded over
5000
times from
SourceForge
since 04 May 2002 and has over
250
regist
ered users worldwide
at the time of this writing
. We
anticipate that they shall be refined the next
three
years through interaction with our local and international
collaborators and instructors of KDD

related courses at this and other universities, as do
cumented in the
supplementary letters of support.
L
OCAL AND
I
NTERNATIONAL
M
ENTORING
: In addition to funneling production

level research and
development tools such as
BNJ
and
MLJ
back into the classroom, we have devoted focused effort to
mentoring of studen
ts with potential to conduct research and become teachers in our subject area. This
began in 1998 with our supervision of graduate research assistants and undergraduate programmers at NCSA
and continued with our participation (1999

2000) in the Engineerin
g Learning Enhancement
Action/Resource Network (LEA/RN). Since 2001, the PI has served as the computer science undergraduate
honors advisor, organized spring seminars and summer workshops for this program, mentored a student who
received a Goldwater schol
arship and an
NSF Graduate Fellowship
for her proposed contributions to
BNJ
,
and served as faculty advisor on a 2

student project in the Computing Research Association Collaborative
Research Experience for Women (CREW) program (2002

2003). We also intend
to work with
undergraduates from the KSU Developing Scholars Program (DSP), which is devoted to research experiences
for students from underrepresented groups.
All faculty participants will incorporate postgraduate and postdoctoral mentoring into synergist
ic activities
such as interdepartmental seminars, an activity supported by our respective
universities
.
D
EVELOPMENT OF
S
OFTWARE
T
OOLS AND
I
NFRASTRUCTURE
:
Table 1 in Section 3.1 lists the courseware
modules to be implemented. To produce the tutorials and s
tudent workbook, we
will develop
visualization
classes, front

end applications, example networks, and examples generated by recording
learning and
inference using
these networks. Most of these tutorials emphasize fundamental theory and published
algorithm
s, as Table 1 illustrates, but recently published research algorithms for
structure learning and exact
and approximate inference
will also be included to give students exposure to comparative experimental
methods
. This project focuses on extending structu
re learning to semi

structured relational models and the
evaluation of robustness by statistical validation of discovered models. We propose to generalize and refine
existing evaluation methods by (
i
) validating more model features; (
ii
) checking automati
cally extracted
models against published
gold standard networks
. Because bootstrap methods for model evaluation are
computationally intensive,
the
BNJ
development team
will
develop a
high

performance Grid
interface to
support batch (non

interactive) bench
marking and experimentation with the core
BNJ
infrastructure.
E
DUCATIONAL
O
UTREACH
:
Our integrative research and education program includes planned
demonstrations, workshops, and web learning materials for 8

12
th
grade outreach. We aim towards
developing
workshop activities for our university’s science and technology program for teen women, and one
of the co

PIs (
Hsu
) has administered the summer science institute for high school juniors and seniors in our
department for the past 2 years. The latter has ab
out 40% female participation, significantly higher than the
admissions or retention rates in the CS undergraduate program. We have noted a high level of interest
C

15
among women in such summer programs who are prospective majors in our CS p
rogram or in the
co
mputational physical sciences
.
We introduce undergraduates
–
as early as their second year
–
to AI, machine learning, simulation, and
visualization algorithms that they help implement and use in experiments. Furthermore, we believe that early
undergraduat
e research experiences are a key to retention of female students and other underrepresented
groups of students in engineering, giving them
exposure to
both theoretical background and commercial
and industrial applications
. Our efforts are supported by col
laboration with our university’s Women in
Engineering and Science Program. As a research scientist at NCSA, the
KSU
PI (
Hsu
) was able to hire 40

60% women research programmers at the 11

12 through undergraduate level. To bring participation of
women clos
er to this level in our programs and encourage retention through graduate study, we have
regularly lectured at our undergraduate engineering honors and ethics and applications (
KSU
CIS 492:
Computers and Society) seminars. The PI has served since 2001 as
departmental honors chair. We are also
planning integrative research experiences for undergraduates, including but not limited to honors students.
All but one of the upper

division courses offered in our KDD program are open to undergraduates.
D
ISTRIBUTI
ON OF
EMT
S
OFTWARE
T
OOLS
:
The BNJ web site [Hsu
et al.
, 2004] is
http://bnj.sourceforge.net
.
Dissemination of results shall be achieved by development and distribution of
open

source courseware built upon the KSU
BNJ
and CMU
Causality Lab
infrastructure, together with
models in an open, semi

structured data format (the XML Bayesian Network Interchange Format). The
software modules shall be added to
BNJ
, an open

source toolkit developed and disseminated by the res
earch
lab of one of the co

PIs (
Hsu
). Meanwhile, training materials produced for and as part of this
research
program
in the form of digitally

recorded lectures, electronic manuals, and tutorials for BNJ software will be
freely available. We will distrib
ute these through our
web

based distance learning infrastructure
as we
have done for the past two years.
A web and file server, tape backup system, and DVD+RW drive to be used
for dissemination of the electronic materials and courseware are budgeted as sm
all computer items, and the
expenditure sc
hedule is given in the budget justification
.
Integrating and complementing our efforts toward dissemination and outreach is the planned development of
web training materials
on graphical models and on applied machi
ne learning and probabilistic reasoning.
We shall provide Java source code from our source code control (CVS) repository and from multiple mirrors
of the open source developer network,
Sourceforge
):. Rather than providing a heterogeneous, disorganized
mi
xture of implementations of learning algorithms, our courseware project undertakes to document these
algorithms in the simplest and clearest way by illustrating common data structures and representations of
hypotheses using many common technical computing
tools. Our goal is to bring probabilistic and reasoning
and the tools supporting our proposed research to as wide an audience as possible. Therefore we use a
complementary approach of developing open

source software both in general

purpose imperative lan
guages
such as Java
and
prototyping using
technical computing languages
such as
MATLAB
,
Mathematica
, and
R
.
This trades off portability and interoperability (Java) against readability and accessibility (
MATLAB
, etc.) to
students who may be new to both pro
gramming and intelligent systems.
W
ORKSHOPS
:
Regional workshops hosted on the KSU campus in Manhattan, KS, or at the University of
Kansas test site in Lawrence, KS, will permit the lead PIs to meet annually for project coordination. The KS
location is pr
oximate to KSU and within short travel distance to several of the test sites. The workshops will
include presentations and discussion panels on
BNJ
development and instructional use
of
BNJ
, including in
research seminars and graduate

level courses. The
P
I
(Hsu)
has successfully proposed, organized, and hosted
three such satellite workshops of international conferences (IJCAI 2001, AAAI/UAI/KDD 2002, IJCAI
2003).
O
rganization
al
costs for
these
workshops, including print copies of the workshop proceedings,
documentation, mailings and postage, and presentation equipment rental
, will be covered via a
nominal
registration fee ($25 per attendee
)
for professional and faculty
attendees not from the advisory group or test
site group. Students will not be assessed
a registration fee; instead, the nominal cost of proceedings will be
covered out o
f a requested operating budget.
Comments 0
Log in to post a comment