BayesLecture_hot - Computing

brewerobstructionAI and Robotics

Nov 7, 2013 (3 years and 9 months ago)

69 views

CSM10: Bayesian neural networks


Summary, we:



Looked at radial basis function networks.



Compared them with multi layer perceptrons and statistical techniques


Aims:



To examine the development of neural networks.



To critically investigate different types o
f neural network and to identify the key parameters which
determine network architectures suitable for specific applications


Learning outcomes:



Demonstrate an understanding of the key principles involved in neural network application
development.



Analyse

practical problems from a neural computing perspective and be able to select a suitable
neural network architecture for a given problem.



Differentiate between supervised and unsupervised methods of training neural networks, and
understand which of these m
ethods should be used for different classes of problems.



Demonstrate an understanding of the data selection and pre
-
processing techniques used in
constructing neural network applications.


An example of Bayesian statistics: “The probability of it raining t
omorrow is 0.3”


Suppose we want to reason with information that contains probabilities such as:
''There is a 70
\
% chance
that the patient has a bacterial infection''.

Bayes theories rest on the belief that for everything there is a
prior probability that

it could be true. Given a prior probability about some hypothesis (e.g. does the
patient have influenza?) there must be some evidence we can call on to adjust our views (beliefs) on the
matter. Given relevant evidence we can modify this prior probability
to produce a posterior probability of
the same hypothesis given new evidence. If the following terms are used:




p(X)

means prior probability of
X



p(X|Y)

means probability of
X

given that we have observed evidence
Y



p(Y)

is the probability of the evidence Y

occurring on its own.



p(Y|X)

is the probability of the evidence Y occurring given the hypothesis X is true (the likelihood).


We make of
Bayes Theorem
:


)
(
)
(
)
|
(
)
|
(
Y
p
X
p
X
Y
p
Y
X
p


In other words:


evidence
prior
likelihood
posterior




We know what p(X) is
-

the prior pro
bability of patients in general having influenza. Assuming that we find
that the patient has a fever, we would like to find P(X:Y) the probability of this particular patient having
influenza given that we can see that they have a fever (Y). If we don't act
ually know this we can ask the
opposite question, i.e. if a patient has influenza, what is the probability that they have a fever? Fever is
probably certain in this case, we'll assume that it is 1. The term p(Y) is the probability of the evidence
occurring

on it's own, i.e. what is the probability of
anyone

having a fever (whether they have influenza
or not? p(Y) can be calculated from:

Y)
notX)p(not
|
p(Y


Y)p(X)
|
p(X


p(Y)



This states that the probability of a fever occurring in anyone is the probability of a fever occ
urring in an
influenza patient times the probability of anyone having influenza plus the probability of fever occurring in
a non
-
influenza patient times the probability of this person being a non
-
influenza case. From the original
prior probability of p(X)h
eld in our knowledge base we can calculate p(X|Y) after having asked about the
patients fever, we can now forget about the original p(X) and instead use the new p(X|Y) as a new p(X).
So the whole process can be repeated time and time again as new evidence
comes in from the keyboard
(i.e. the user enters answers). Each time an answer is given the probability of the illness being present is
shifted up or down a bit using the Bayesian equation, each time a different prior probability being used
which has been
derived from the last posterior probability.

An example using Bayes Theorem
:



Suppose that the hypothesis X is that ‘X is a man’ and notX is that ‘X is a woman’, and we want to
calculate which is the most likely given the available evidence.



Suppose that we

have evidence that the prior probability of X, p(X) is 0.7, so that p(not X) = 0.3.



Suppose we have evidence Y that X has long hair, and suppose that p(Y|X) is 0.1 {i.e. most men
don’t have long hair} and p(Y) is 0.4 {i.e. quite a few
people

have long hai
r}.



Our new estimate of P(X|Y) i.e. that X is a man given that we now know that X has long hair is:

p(X|Y) = p(Y|X)P(X)/P(Y)

= (0.1*(0.7))/0.4

= 0.175

So our probability of ‘X is a man’ has moved from 0.7 to 0.175, given the evidence of long hair. In thi
s way
new P(X|Y) are calculated from old probabilities given new evidence. Eventually, having gathered all the
evidence concerning all of the hypotheses, we, or the system, can come to a final conclusion about the
patient. What most systems using this form

of inference do is set an upper and lower threshold. If the
probability exceeds the upper threshold that hypothesis is accepted as a likely conclusion to make. If it
falls below the lower threshold then it is rejected as unlikely.

Problems with Bayesian i
nference



Computationally expensive



The Prior probabilities are not always available and are often subjective



Often the Bayesian formulae don’t correspond with the expert’s degrees of belief. For Bayesian
systems to work correctly, an expert should tell us
that ‘The presence of evidence Y enhances the
probability of the hypothesis X, and the absence of evidence Y decreases the probability of X’, but in fact
many experts will say that ‘The presence of Y enhances the probability of X, but the absence of Y has
no
significance’, which is not true in a strict Bayesian framework.



Assumes independent evidence

Bayesian methods are often used in both statistics and Artificial Intelligence based around expert
systems. However, they can also be used with neural networks
. Conventional training methods for
multilayer perceptrons (such as backpropagation) can be interpreted in statistical terms as variations on
maximum likelihood estimation. The idea is to find a single set of weights for the network that maximize
the fit t
o the training data, perhaps modified by some sort of weight penalty to prevent overfitting.


The Bayesian school of statistics is based on a different view of what it means to learn from data, in which
probability is used to represent uncertainty about t
he relationship being learned (a use that is shunned in
Conventional frequentist statistics). Before we have seen any data, our
prior

opinions about what the true
relationship might be can be expressed in a probability distribution over the network weights

that define
this relationship. After we look at the data (or after our program looks at the data), our revised opinions
are captured by a
posterior

distribution over network weights. Network weights that seemed plausible
before, but which don't match the
data very well, will now be seen as being much less likely, while the
probability for values of the weights that do fit the data well will have increased.


Typically, the purpose of training is to make predictions for future cases where only the inputs to

the
network are known. The result of conventional network training is a single set of weights that can be used
to make such predictions. In contrast, the result of Bayesian training is a posterior distribution over
network weights. If the inputs of the ne
twork re set to the values for some new case, the posterior
distribution over network weights will give rise to a distribution over the outputs of the network, which is
known as the predictive distribution for this new case. If a single
-
valued prediction i
s needed, one might
use the mean of the predictive distribution, but the full predictive distribution also tells you how uncertain
this prediction is.


Why bother with all this? The hope is that Bayesian methods will provide solutions to such fundamental
problems as:




How to judge the uncertainty of predictions. This can be solved by looking at the predictive
distribution, as described above.



How to choose an appropriate network architecture (e.g., the number hidden layers, the number of
hidden units in
each layer).



How to adapt to the characteristics of the data (e.g., the smoothness of the function, the degree to
which different inputs are relevant).


Good solutions to these problems, especially the last two, depend on using the right prior distributi
on, one
that properly represents the uncertainty that you probably have about which inputs are relevant, how
smooth the function you are modelling is, how much noise there is in the observations, etc. Such carefully
vague prior distributions are usually de
fined in a hierarchical fashion, using
hyperparameters
, some of
which are analogous to the weight decay constants of more conventional training procedures. An
"Automatic Relevance Determination" scheme can be used to allow many possibly
-
relevant inputs to
be
included without damaging effects.


Selection of an appropriate network architecture is another place where prior knowledge plays a role. One
approach is to use a very general architecture, with lots of hidden units, maybe in several layers or
groups,
controlled using hyperparameters.


Implementing all this is one of the biggest problems with Bayesian methods. Dealing with a distribution
over weights (and perhaps hyperparameters) is not as simple as finding a single "best" value for the
weights. Exact
analytical methods for models as complex as neural networks are out of the question. Two
approaches have been tried:




Find the weights/hyperparameters that are most probable, using methods similar to conventional
training (with regularization), and then a
pproximate the distribution over weights using information
available at this maximum.



Use a Monte Carlo method to sample from the distribution over weights. The most efficient
implementations of this use dynamical Monte Carlo methods whose operation resem
bles that of backprop
with momentum.


Monte Carlo methods for Bayesian neural networks have been developed by Neal. In this approach, the
posterior distribution is represented by a sample of perhaps a few dozen sets of network weights


Pros of Bayesian ne
ural networks



Bayesian methods are a superset of conventional neural network methods.



Network complexity (such as number of hidden units) can be chosen as part of the training process,
without using cross
-
validation.



Better when data is in short supply as
you can (usually) use the validation data to train the network.



For classification problems the tendency of conventional approached to make overconfident
predictions in regions of sparse training data can be avoided.



Regularisation is a way of controlling
the complexity of a model by adding a penalty term to the error
function (such as weight decay). Regularization is a natural consequence of using Bayesian methods,
which allow us to set regularisation coefficients automatically (without cross
-
validation).
Large numbers of
regularisation coefficients can be used, which would be computationally prohibitive if their values had to
be optimised using cross
-
validation.



Confidence intervals and error bars can be obtained and assigned to the network outputs when th
e
network is used for regression problems.



Allows straightforward comparison of different neural network models (such as MLPs with different
numbers of hidden units or MLPs and RBFs) using only the training data.



Guidance is provided on where in the input

space to seek new data (active learning allows us to
determine where to sample the training data next).



Relative importance of inputs can be investigated (Automatic Relevance Detection)



Very successful in certain domains



Theoretically the most powerful me
thod


Cons of Bayesian neural networks



Requires to choose prior distributions, mostly based on analytical convenience rather than real
knowledge about the problem



Computationally intractable


Model comparison



Due to probability measure, models can be compa
red:



Complex models have lower probability density over large range of data sets



Simple models have high probability density over a small range of data sets



Thus, there should be a compromise in terms of complexity and confidence in the model


Ockham’s raz
or



More complex networks (e.g. more hidden units) give better fit to the training data.



However, the best generalisation requires that the model be neither too complex nor too simple.



The conventional solution is to use cross
-
validation which is time consu
ming and wasteful of data.



Bayesian methods have an automatic Ockham’s razor to select the appropriate level of model
complexity.


Consider three neural network models
H
1
,
H
2
,
H
3

of successively greater complexity (e.g. neural networks
with increasing numb
ers of hidden units. Then evaluate the probabilities of the models given the data
D
,
using Bayes theorem:


)
(
)
(
)
|
(
)
|
(
D
p
H
p
H
D
p
D
H
p
i
i
i



Assuming equal priors
p(H
i
)

then models are ranked according to
p(D| H
i
)

A particular data set
D
0

might favour the model
H
2

which has intermediate complexity:


So to extend Bayesian learning to neural networks we:



Assume probability distribution over network weights and some prior



Need to interpret network outputs probabilistically (use appropriate output activation function)


Then we apply Bayesian learning procedure:



Start with prior distribution p(w) and choose appropriate parameters (guess, usually broad distribution
to reflect uncertainty)



Observe data, calculate posterior of parameters by Bayes rule.



Continue updating if
more data comes in, replacing the prior with the posterior



In order to make prediction, the expectation given the posterior distribution has to be found (might be
very complex)


Summary



In practice, Bayesian networks often outperform standard networks (suc
h as MLPs trained with
backpropagation).



However there are several unresolved issues (such as how best to choose the priors) and more
research is needed.



Bayesian networks are computationally intensive and therefore take a long time to train.