A Common Neural Network Model for Unsupervised Exploratory Data Analysis and Independent Component Analysis

jiggerluncheonAI and Robotics

Oct 19, 2013 (3 years and 2 months ago)

86 views

1
A Common Neural Network Model for Unsupervised
Exploratory Data Analysis and Independent
Component Analysis
Keywords
: Unsupervised Learning, Independent Component Analysis, Data
Clustering, Data Visualisation, Blind Source Separation.
Mark Girolami
‡†
, Andrzej Cichocki, and Shun-Ichi Amari



RIKEN Brain Science Institute
Laboratory for Open Information Systems &
Laboratory for Information Synthesis
Hirosawa 2-1, Wako-shi, Saitama 351-01,
Japan.
Tel (+81) 48 467-9669
Fax (+81) 48 467-9687
{cia, amari}@brain.riken.go.jp
giro@open.brain.riken.go.jp

Currently on Secondment from
Department of Computing and Information Systems
The University of Paisley
Paisley, PA1 2BE,
Scotland
Tel (+44) 141 848 3963
Fax (+44) 848 3404
giro0ci@paisley.ac.uk
IEEE Membership Number: 40216669
Manuscript Number TNN#3323
Accepted for I.E.E.E Transactions on Neural Networks
BRIEF PAPER
Submitted :13 – 08 - 97
Accepted : 28– 07 - 98


Corresponding Author
2
Abstract
This paper presents the derivation of an unsupervised learning algorithm, which
enables the identification and visualisation of latent structure within ensembles of
high dimensional data. This provides a linear projection of the data onto a lower
dimensional subspace to identify the characteristic structure of the observations
independent latent causes. The algorithm is shown to be a very promising tool for
unsupervised exploratory data analysis and data visualisation. Experimental results
confirm the attractiveness of this technique for exploratory data analysis and an
empirical comparison is made with the recently proposed Generative Topographic
Mapping (GTM) and standard principal component analysis (PCA). Based on
standard probability density models a generic nonlinearity is developed which allows
both; 1) identification and visualisation of dichotomised clusters inherent in the
observed data and, 2) separation of sources with arbitrary distributions from mixtures,
whose dimensionality may be greater than that of number of sources. The resulting
algorithm is therefore also a generalised neural approach to independent component
analysis (ICA) and it is considered to be a promising method for analysis of real
world data that will consist of sub and super-Gaussian components such as biomedical
signals.
1.

Introduction
This paper develops a generalisation of the adaptive neural forms of the
independent component analysis (ICA) algorithm primarily as a method for
interactive unsupervised data exploration, clustering and visualisation.
The ICA transformation has attracted a great deal of research focus in an
attempt to solve the signal-processing problem of blind source separation (BSS) [1, 2,
3, 6, 7, 8, 10, 14, 15, 16, 18]. However, the work reported in this paper has been
motivated by unsupervised data exploration and data visualisation. Unsupervised
statistical analysis for classification or clustering of data is a subject of great interest
when no classes are defined
a priori.
The projection pursuit (PP) methodology as
detailed in [13], was developed to seek or pursue
P
dimensional projections of multi-
dimensional data, which would maximise some measure of statistical interest, where
P
= 1 or 2 to enable visualisation. Projection pursuit therefore provides a means of
latent data structure identification through visualisation [13].
3
The link with projection pursuit and independent component analysis (ICA) is
discussed in [18], and a neural implementation of projection pursuit is developed and
utilised for BSS in [15]. It is argued [18] that the maximal independence index of
projection offered by ICA best describes the fundamental nature of the data. This is of
course in accordance with the latent variable model of data [12], which invariably
assumes that the latent variables are orthogonal i.e. independent.
A stochastic gradient-based algorithm is developed which is shown,
ultimately, to be an extension of the natural gradient family of algorithms proposed in
[1, 2]. The paper has the following structure; Section 2 introduces the ICA or latent
variable model of data and briefly reviews the ICA data transformation. Section 3
presents the derivation of the algorithm for data visualisation and ICA. In Section 4
the classical Pearson Mixture of Gaussians (MOG) density model [19] for cluster
identification is utilised as the non-linear term for the developed algorithm. It is found
that this provides an elegant closed form generic activation function, which also
provides a method of separating arbitrary mixtures of non-Gaussian sources. Section 5
reports on a data exploration simulation and a source separation experiment. The
concluding section discusses the potential extensions of the proposed approach.
2. The Independent Component Analysis Data Model
The particular ICA data model considered in this paper is defined as follows
( ) ( ) ( )
t
tt
n
As
x
+=
(1)
The observed zero mean data vector is real valued such that
( )
N
t ℜ∈
x
the vector of
underlying independent sources or factors is given as
( )
M
t ℜ∈
s

such that
M ≤ N
, and
due to source independence the multivariate joint density is factorable
( ) ( )

=
=
M
i
ii
spp
1

s
. The noise vector
( )
t
n
is assumed to be Gaussian with a diagonal
covariance matrix
{
}
T

nn
R
nn
E=
,
E
denotes expectation, where

the variance of each
noise component is usually assumed as constant such that
I
R
nn
2
n
σ=
.
The unknown
real valued matrix
MN
×
ℜ∈
A

is designated the mixing matrix in ICA literature [6] or
in factor analysis, the factor loading matrix [12].
Our objective is then to find a linear transformation
Wx
y
=

(2)
4
with

NP×
ℜ∈
W
where
P << N
typically with
P
= 2 for visualisation purposes, such
that the elements of
y
are as non-Gaussian and mutually independent as possible.
The objective of standard ICA is to recover all or some of the original source
signals
(
)
t
s
, or, indeed, to extract one specific source, when only the observation
vector
(
)
t
x
is available [16]. Alternatively the objective may be to estimate the
mixing matrix
A
. In contrast to these objectives in this paper our primary task is not to
estimate or extract any specific source signals
(
)
t
s
i
but rather to cluster the data into
logical groupings and allow their visualisation. The non-Gaussian nature of the
marginal components of
s
, in terms of exploratory data analysis, is indicative of
interesting structure such as bi-modalities i.e clusters and intrinsic classes. This
indicates the potential for ICA to be applied to the clustering of data and this shall be
explored further herein. The following section derives an unsupervised learning
algorithm, which will be capable of identifying latent structure within data.
3. A Gradient Algorithm Based on Maximal Marginal Negentropy Criterion
The projection pursuit methodology, which seeks linear projections of the observed
data, can be considered as a means of seeking latent non-Gaussian structure within the
observations. In many ways the ICA model which assumes non-Gaussian sources can
be a more realistic data model than the independent Gaussian generated FA model. In
[13] the maximisation of higher order moments is utilised to pursue projections that
will identify structure associated with the maximised moment i.e. multiple modes or
skewness. However, if the resulting moments are small thus describing a mesokurtic
(slightly deviated from Gaussian) structure then moment based indices may not be
suitable. Marriot in [13] argues that information theoretic criteria for maximisation
may require to be considered in this case. The most obvious choice of an information
theoretic measure signifying departure from Gaussian will be negentropy, [9].
Negentropy [9] is defined as the Kullback-Leibler divergence of a distribution
(
)
iy
y
p
from a Gaussian distribution with identical mean and variance
(
)
iG
y
p
. In the
univariate case this is,
(
)
(
)
(
)
(
)
i
iG
iy
iy
i
G
dy
y
p
y
p
y
p
p
p
KL
log

=
(3)
5
where the subscript
i
denotes the
i
’th
marginal density of a data vector
y
. Negentropy
will always be greater than zero and only vanishes when the distribution
( )
y
p
y
is
normally distributed [9]. This will then lend itself to stochastic gradient ascent
techniques.
The derivation of a learning rule for a simple single layer structure, which will
drive the output of each neuron maximally from normality, is the goal of this
particular section. Each output neuron will be parameterised individually as it is the
intention for each neuron to respond optimally to differing independent features
within the data. This is also in accordance with the factorial representation of the
density of the underlying sources (or latent variables) in both the FA and ICA data
models. We then use the factorial parametric form for the density of the network
output where
Wx
y
=
and
NP
×
ℜ∈
W
,
P ≤ N
. The following criterion is proposed as
a measure of non-Gaussianity for the
P
outputs

( )
( )
( )
( )
y
y
y
y
d
p
yp
pppKL
G
P
i
ii
GF

=

=
1
log (4)
where the subscript
F
denotes the factored form. This criterion can then be written as

( )
( )
( )
( )
( )
( )

=

+=
P
i
P
GF
d
i
y
i
ppeppKL
1
log det2log
2
1
y
y
R
y
yy
π (5)
The covariance matrices of the observed data and the transformed data
R
xx
{
}
T
xx
E=
and
R
yy
{
}
T
T
W
WR
yy
xx
== E

are positive definite matrices respectively. By
considering the maximisation of this criterion we note that the two individual terms
play a significant role in the emergent properties of the learning. Consider the left-
hand term,
( )
( )
( )
yy
R
det2log
2
1
P
eπ. The Haddamard inequality [9] is given as
( )

=

P
i
y
i
1
2
d
et σ
yy
R
(6)
Where
{
}


22
iyi
yE=σ is the variance of the data from the
i-th
output. This indicates
that the maximal value of the term will be attained when the covariance matrix of the
network output is diagonalised. This is precisely the effect of the ‘sphereing’ data
transformation often discussed in the projection pursuit literature [12]. Likewise, the
second term will be maximal once the transformed data conforms to a factorial
6
representation thus ensuring that each neuron will indeed be responding to distinct
independent underlying characteristics of the data.
In order to derive the learning algorithm let us compute the gradients of the
proposed criterion over the weights
ij
w
of the transformation matrix.

( )
( )( )
( ){ }






++



=
P
i
i
y
i
p
E
e
P
1
T
log detlog
2
1
2log
2
W
WR
W
xx
π (7)
It is interesting to note that the term
(
)
(
)
T
det
l
og
W
WR
xx
ensures that the rank of
W
is
equal to
P
and so each of the rows of
W
will be linearly independent. This is a
naturally occurring term, which ensures that each output neuron will seek mutually
independent directions. Now it is straightforward to show that

( )( ) ( )
xx
xx
xx
WR
W
WR
W
WR
W
1
TT
detlog

=


(8)
Taking the gradient of the output entropy gives
( ){ }
( )
{
}
T
1
log
x
y
f
W
E
i
y
i
p
E
P
i
=



=
(9)
The function
f
(
y
) operates element-wise on the vector
y
, such that
( ) ( ) ( )
[ ]
T
,,
MMii
y
f
y
f
=
y
f
and
( )
( )
( )
ii
ii
ii
y
p
y
p
y
f
'
=
. The final gradient term is then
( )
( )
( )
{
}
T
1
T

x
y
f
WR
W
WR
W
xx
xx
E
p
p
KL
GF
+=



(10)
The standard instantaneous gradient ascent technique can now be used in a stochastic
update equation, however we consider utilising the more efficient natural gradient for
the weight update [1, 2]. In the particular case under consideration where
P << N
the
square symmetric matrix term
W
W
T
will be positive semi-definite so the standard
natural gradient formula:
( )
t
tGF
t
tt
p
p
KL
W
W
W
W
T










=Δ η (11)
can not be used directly. We propose the modified formula

( )
( )
I
W
W
W
W

T
tttGF
t
tt
p
p
KL
εη +












(12)
Where ε
t
is a small positive constant which ensures that the term
(
)
I
W
W

T
ttt
ε+
is
always positive definite. This approach is similar to that used in many optimisation
methods to keep the inverse of the Hessian positive definite.
7
Using (10) and (12) yields the following stochastic weight update.
( )
( ) ( )
( )
(
)
T
1
TT

x
y
f
WR
W
WR
W
y
y
f
I
W
xx
xx
+++=Δ

ttttttt
εηη (13)
For small values of
t
ε then
( ) ( )
ttt
O
O
εηη
>>
and the rightmost term will have a
negligible effect on the weight update so the learning equation can be approximated
by

( )
(
)
t
tttt
W
y
y
f
I
W
T
+=Δ η (14)
The weight update (13) and (14) will seek maximally non-Gaussian
projections onto lower dimensional sub-spaces for unsupervised exploratory data
analysis. However (14) can also be seen to be a generalisation of the original
equivariant ICA algorithm [1, 2, 3, 6, 7, 8] for ICA, as it is capable of finding
P
independent components in an
N
-dimensional subspace. An alternative approach to
this problem is proposed in, for example, [16] where the notion of ‘extracting’ sources
sequentially from the observed mixture is utilised.
4. A Mixture Model to Identify Latent Clustered Data Structure
Distributions that are bi-modal exhibit one form of latent structure which is of interest
in identifying. Multiple modes may indicate specific clusters and classes inherent in
the data. Maximum likelihood estimation (MLE) approaches to data clustering [11]
employ Mixtures of Gaussian (MOG) models to seek transition regions of high and
low density and so identify potential data clusters.
One particular univariate MOG model which is of particular interest in this
study was originally proposed in [19]. The generic form of the Pearson model is given
below
( ) ( )
(
)
(
)

, , 1
2
22
2
11
σ
μ
σ
μ
y
ap
y
p
a
y
p
GG
+−=
(15)
where
1
0

≤≤
a
(see Figure 1).

It is clear that the distribution is symmetric possesing
two distinct modes when the mixing coefficient
a
= ½.
For the strictly symmetric case where
a
= ½ ,
μ
μ
μ
==
21
and
2
2
2
2
1
σσσ ==
the above MOG density model (15) can be written as
( )























−=
22
2
2
2

cosh
2
expexp
2
1

σ
μ
σσ
μ
πσ
y
y
y
p
(16)
8
Employing (16) to compute the individual nonlinear terms in (14) produces the
following

( )
( )
( )


tanh
222
'






+

==
σ
μ
σ
μ
σ
yy
yp
yp
yf

(17)
This is a particularly interesting form of nonlinearity as the both the linear and
hyperbolic tangent terms have been studied as the activations for single layer
unsupervised neural networks. A density model has now been idenitified with these
particular activations and now allows a probabilistic interpretation of the hyperbolic
tangent activation function.
To gain some insight regarding the statistical nature of the proposed MOG
model the associated cumulant generating function (CGF) is employed [14]. The
explicit form of the CGF for the generic Pearson model (15) in the case where
μ
μ
μ
==
21
and
2
2
2
2
1
σσσ ==
is simply
( ) ( ) ( ) ( )( )
BaAaw
expexp1log
+−=
φ
where
2

22
wwiA σ
μ
−=
,
2

22
wwiB σ
μ
+=
[14]. The related cumulants of the
distribution can be now be computed and after some tedious algebraic manipulation
the kurtosis for the distribution under consideration is
( )
( )
( )
( )

1416 61
1
6
2
2224
σ
μ
μ
+−+−− aaaaaa
(18)
For the symmetric case where
a
= ½ then the expression for kurtosis reduces to
( )
2
224
2 σ
μ
μ
+−
(19)
which, interestingly, takes on strictly negative values for all
0
>
μ
[14]. This is of
particular significance as this distribution and nonlinear function can also be utilised
for BSS of strictly sub-Gaussian sources.
For the case where two distinct modes are defined such that
2


=
μ
and
1
2

(see Figure 1) the nonlinearity takes the very simple form of
( ) ( )
y
yyf −=
2tanh2 (20)
The following weight adaptation will then seek projections, which identify maximally
dichotomised clustered structure inherent in the data.

( )
[ ]
t
tttttt
W
y
y
y
y
I
W
TT
2tanh2
−+=Δ η (21)
An alternative density model, which defines a unimodal super-Gaussian density, is
given as
( )
(
)
()
y
yyp
sech 2exp
2
−∝
,
where the normalising constant is neglected
9
[14]
.
The associated derivative of log-density is then
( ) ( )
y
yyf −−=
tanh
.
This can
be combined with the nonlinear term based on the symmetric Pearson model when
1


=
μ
and
1
2
=σ, yielding, in vector format
( ) ( )
y
y
K
y
ŠŠ=
tanh
4
f
(22)
The square diagonal matrix, which contains each individual output kurtosis sign, is
defined as
( ) ( ) ( )
[ ]
M
diag
4
2
4
1
44
sgn,sgn,sgn κκκ =
K
(23)
The kurtosis of each output can be estimated online using the following moving
average estimator
[ ] [ ] [ ]
p
tttpttp
yymy
m

ˆ
1
ˆ
1
μ
μ
+−=
+
(24)

[ ] [ ] [ ]
(
)
3



ˆˆ

ˆ
2
244
−=
ttt
ymymyκ
(25)
The sample moments of order
p
are estimated using (24); in this case the second and
fourth order moments are required. The sample kurtosis estimate is then given by
(25).
The generic term (22) can be substituted into (14) finally giving
( )
(
)
t
ttttttt
W
y
y
y
y
K
I
W
TT
,4
tanh
−−=Δ η (26)
This update equation can then also be applied to general ICA where the number of
outputs is less than the number of sensors. From the form of (26) it is clear that this
adaptation rule can also be utilised to separate scalar mixtures which may contain
arbitrary numbers of both sub and super-Gaussian sources. The use of (21) will
specifically seek linear projections identifying bi-modal and dichotomised clustered
structure within the data.
5. Experimental Results
5.1 Data Visualisation
The dataset
1
used in this experiment arises from synthetic data modelling of a non-
invasive monitoring system which is used to measure the relative quantity of oil
within a multi-phase pipeline carrying oil, water and gas. The data consists of twelve
dimensions, which correspond to the measurements from six dual powered gamma

1
This data set is available from http://www.ncrg.aston.ac.uk/GTM/3PhaseData.html
10
ray densitometers [4]. There are three particular flow regimes which may occur within
the pipeline namely laminar, annular and homogenous.
The Generative topographic mapping (GTM) [4] has been applied successfully
to the problem of visualising the latent structure within this data set and is used here
as a means of comparison with the derived adaptation rule (21). The data is first made
zero-mean and then sequentially presented to the network until the weights achieve a
steady value. A fixed learning rate of value 0.001 was used in this simulation.
The results using the nonlinear GTM mapping under the conditions reported in
[4] are given in Figure 2a. It is clear that the three clusters corresponding to the
different phases have been clearly identified and separated. In comparison to principal
component analysis (PCA) the results from GTM provide considerably more distinct
separation of the clusters corresponding to the three flow regimes. Figure 2b shows
the results using the adaptation rule (21), again it is clear that the points relating to the
laminar, annular and homogenous flow regimes have been distinctly clustered
together. However, it is of significance to note that there exist two clusters
corresponding to the laminar flow.
As the proportions of each phase changes within the laminar flow over time
there will be a change in the physical boundary between the phases which will trigger
a step change in the across pipe beams. It is this physical effect which gives rise to the
distinct clusters within the laminar flow. This identification of the additional clustered
structure within the laminar flow requires the use of a linear hierarchical approach to
data visualisation and is demonstrated in [5].
5.2 Blind Source Separation
This simulation focuses on image enhancement and is used here to demonstrate the
algorithm performance when applied to ICA. The problem consists of three original
source images which are mixed by a non-square linear memoryless mixing matrix
such that the number of mixtures is greater than the number of sources.
The pixel distribution of each image is such that two of them are sub-gaussian
with negative values of kurtosis and the other is super-gaussian with a positive value
of kurtosis. The values of kurtosis for each image are computed as 0.3068, -1.3753
and –0.2415. It is interesting to note that two of the images (Figure 3) have relatively
small absolute values of kurtosis and as such are approximately Gaussian. This is a
particularly difficult problem due to the non-square mixing and the presence of both
11
sub and super Gaussian sources within the mixture. This difficulty is also
compounded with the small absolute values of kurtosis of two sources.
The first problem that has to be addressed is identifying the number of
sources. Simply computing the rank of the covariance matrix of the mixture can do
this. Historically the next problem would be two-fold as the mixture consists of a
number of sources which are sub-Gaussian and some which are super-Gaussian. This
of course affects the choice of the nonlinearity required to successfully separate the
sources. However, from (26) all that is required is to ‘learn’ the diagonal terms of the
K4
matrix. Figure 3, shows the observations and the final separated sources. Each
value is drawn randomly from the mixture and (26) is used to update the network
weights. The learning rate is kept at a fixed value of 0.0001. It should be stressed that
it is not required to make any assumptions on the type of non-Gaussian sources
present in the mixture, nor is choosing another form of nonlinearity and changing the
simple form of the algorithm required. Figure 3 shows the final separated images
indicating the good performance of the algorithm.
6. Conclusions
By considering an information theoretic index of projection based on negentropy a
generalised learning algorithm has been derived and this may be applied to both
unsupervised exploratory data analysis and independent component analysis with an
arbitrary number of outputs. The powerful capability of this approach for
unsupervised exploratory data analysis has been demonstrated using the oil pipeline
data and compared with the probabilistic (GTM). This technique has been applied to
other classical data-sets such as the Iris, Crab and Swiss Banknotes. In each case the
intrinsic clustered nature of the data is revealed by the use of the proposed learning
algorithm (26).
In terms of ICA a particularly difficult image enhancement problem has been
used to demonstrate the algorithm performance for blind source separation. Current
work, which is being addressed from the perspective of data analysis, is a means by
which this technique can be extended to a hierarchical method of dichotomising and
clustering data.
Acknowledgements
M Girolami is supported by a grant from the Knowledge Laboratory Advanced
Technology Centre, NCR Financial Systems Limited, Dundee, Scotland. We are
12
indebted to Dr. J.F. Cardoso for helpful discussions regarding this work. Mark
Girolami is grateful to Dr. Michael Tipping and Prof. Chris Bishop for providing the
oil pipeline data and giving helpful insights regarding the physical interpretation of
the data analysis. This work was completed whilst Mark Girolami was an invited
visiting researcher at the Laboratory for Open Information Systems, Brain Science
Institute, Riken, Institute of Chemical and Physical Research, Wako-shi, Japan.
References
[1] Amari, S., Chen, T, P., and Cichocki, A., Stability Analysis of Learning
Algorithms for Blind Source Separation,
Neural Networks
, Vol.10, No8, 1345-1351,
1997.
[2] Amari, S., Cichocki, A, and Yang, H, A New Learning Algorithm for Blind
Signal Separation.
Neural Information Processing
, Vol 8, pp. 757-763. M.I.T Press,
1995.
[3] Bell, A and Sejnowski, T, An Information Maximisation Approach to Blind
Separation and Blind Deconvolution.
Neural Computation

7
, 1129 – 1159, 1995.
[4] Bishop, C., Svensen, M., Williams, C., GTM: The Generative Topographic
Mapping., Neural
Computation
, Vol. 10, Number 1, pp215 - 234.
[5] Bishop, C., and Tipping, M., A Hierarchical Latent Variable Model for Data
Visualisation, Technical Report NCRG/96/028, Aston University, 1997.
[6] Cardoso, J, F. and Laheld, B, H, Equivarient Adaptive Source Separation
I.E.E.E
transactions on Signal Processing
, SP-
43
, pp 3017 – 3029, 1997.
[7] Cichocki, A., Unbehauen, R. and Rummert, E, Robust Learning Algorithm for
Blind Separation of Signals,
Electronics Letters
,.
30
, No.17, pp 1386-1387, 1994.
[8] Cichocki, A. and Unbehauen, R., Robust Neural Networks With On-Line learning
For Blind Identification and Blind Separation of Sources,
IEEE transactions on
Circuits and Systems – I: Fundamental Theory and Applications
, Vol.43, pp 894-906.
[9] Cover, T. and Thomas, J, A, Elements of Information Theory,
Wiley Series in
Telecommunications
, 1991.
[10] Douglas, S, C., Cichocki, A., and Amari, S, Multichannel Blind Separation and
Deconvolution of Sources with Arbitrary Distributions.
Proc. I.E.E.E Workshop
Neural Networks for Signal Processing
, pp.436-444, 1997.
[11] Everitt, B, S., Cluster Analysis, Heinemann Educational Books, 1993.
13
[12] Everitt, B, S, An Introduction to Latent Variable Models, London: Chapman and
Hall, 1984.
[13] Friedman, J. H, Exploratory Projection Pursuit.
Journal of the American
Statistical Association
,
82
(397):pp 249-266, 1987.
[14] Girolami, M. An Alternative Perspective on Adaptive Independent Component
Analysis Algorithms.
Neural Computation
, Vol. 10, No. 8, pp 2103 - 2114,1998.
[15] Girolami, M and Fyfe, C. Extraction of Independent Signal Sources using a
Deflationary Exploratory Projection Pursuit Network with Lateral Inhibition
. I.E.E
Proceedings on Vision, Image and Signal Processing
, Vol 14, No 5, pp 299 - 306,
1997.
[16] Hyvarinen, A., and Oja, E, A Fixed-Point Algorithm for Independent Component
Analysis.
Neural Computation
, Vol. 9, No. 7, pp. 1483-1492, 1997.
[17] Jones, M. C. and Sibson, R, What is Projection Pursuit
. The Royal Statistical
Society
. 150(1), pp. 1 – 36, 1987.
[18] Karhunen, J., Oja, E., Wang, L., Vigario, R., Joutsensalo J, A Class of Neural
Networks for Independent Component Analysis.
IEEE Transactions on Neural
Networks
,
8
, pp 487 – 504, 1997.
[19] Pearson, K., Contributions to the Mathematical Study of Evolution.
Phil. Trans.
Roy. Soc
.
A 185
, 71, 1894.
14
Figure 1
15
Laminar
Annular
Homogenou
s
Figure 2 a
16
Laminar
Annular
Homogenou
s
Figure 2 b
17
Laminar
Annular
Homogenou
s
Figure 2 c
18
Image No 1 : Kurtosis = -1.37 Image No 2 : Kurtosis = +0.31 Image No 3 : Kurtosis = -0.25









Figure 3
MN
×
ℜ∈
A
N
×
??
M
W
Original Source Images
Observed Mixed Images
19
Figure Captions
Figure 1:
Examples of the uni-variate Pearson mixture model for μ = 2 and σ
2
= 1
and various parameter values
a
.
Figure 2 a
:
Plot of the posterior mean for each point in latent space using the GTM,
clearly the three flow regimes responsible for generating the twelve dimensional
measurements have been clustered successfully.
Figure 2 b
:

Plot of the twelve dimensional data projected onto a two dimensional
subspace which maximises the negentropy of the data within the subspace. The three
flow regimes responsible for the measurements have been clearly defined and in
addition the clustered nature of the laminar flow has been identified.
Figure 2 c
:

Plot of the twelve dimensional data projected onto the two dimensional
subspace whose basis is the first two principal components. The three flow regimes
responsible for the measurements have not been clearly defined or separated into
distinct clusters.
Figure 3 :
Independent Component Analysis performed on a 5 x 3 mixture of three
images. One is super gaussian with a kurtosis value of + 1.37 another has a very small
value of positive kurtosis +0.307, with the third having a negative kurtosis of -0.25.
The two images with the small absolute values of kurtosis could be considered as
approximately mesokurtic.