Classification for High
Dimensional Problems Using
Bayesian Neural Networks
and Dirichlet Diffusion Trees
Radford M. Neal and Jianguo Zhang
the winners of NIPS2003 feature selection challenge
University of Toronto
The results
•
Combination of Bayesian
neural networks and
classification based on Bayesian
clustering with a Dirichlet
diffusion tree model.
•
A Dirichlet diffusion tree
method is used for Arcene.
•
Bayesian neural networks (as
in BayesNN

large) are used for
Gisette, Dexter, and Dorothea.
•
For Madelon, the class
probabilities from a Bayesian
neural network and from a
Dirichlet diffusion tree method
are averaged, then thresholded
to produce predictions.
Their General Approach
Use simple techniques to reduce the
computational difficulty of the problem,
then apply more sophisticated
Bayesian methods.
–
The simple techniques: PCA and feature
selection by significance tests.
–
Bayesian neural networks.
–
Automatic Relevance Determination.
(I) First level feature
reduction
Feature selection using
significance tests (first level)
An initial feature subset was found by
simple univariate significance tests.
(correlation coefficient, symmetrical
uncertainty )
Assumption: Relevant variables will be at
least somewhat relevant on their own.
For all tests, a p

value was found by
comparing to the distribution found when
permuting the class labels.
Dimensionality reduction with
PCA (an alternative for FS)
There are probably better dimensionality
reduction methods than PCA, but that
’
s
what we used. One reason is that it
’
s
feasible even when p is huge, provided n is
not too large

time required is of order
min(pn
2
, np
2
).
PCA was done using all the data (training,
validation, and test).
(II) Building learning model &
Second level feature Selection
Bayesian Neural Networks
Conventional neural network
learning
Bayesian Neural Network
Learning
Based on the statistic interpretation of
the conventional neural network
learning
Bayesian Neural Network
Learning
Bayesian predictions are found by integration rather
than maximization. For a test case x, y is predicted:
Conventional neural network only consider
parameters with maximum posterior
Bayesian Neural Network consider all possible
parameters in the parameter space.
Can be implemented by Gaussian approximation and
MCMC
ARD Prior
Still remember the decay?
How? (by optimize the decay parameter)
–
Associate weights from each input with a decay parameter
–
There are theories for optimizing the decays.
Result.
If an input feature x is irrelevant, its relevance hyper

parameter
β=1/a
will tend to be small, forcing the relevant
weight from that input to be near zero.
Some Strong Points of This
Algorithm
Bayesian learning integrates over the
posterior distribution for the network
parameters, rather than picking a single
“
optimal
”
set of parameters. This farther
helps to avoid overfitting.
ARD can be used to adjust the relevance of
input features
We can using prior to incorporate external
knowledge
Dirichlet Diffusion Trees
An Bayesian hierarchical clustering
method
The methods
BayesNN

small
features selected using significance tests.
BayesNN

large
principle components
BayesNN

DFT

combo
the class probabilities from a Bayesian neural
network and from a Dirichlet diffusion tree method
are averaged, then thresholded to produce
predictions.
About the datasets
The results
•
http://www.nipsfsc.ecs.soton.ac.uk/
Thanks.
Any Question?
Comments 0
Log in to post a comment