A hierarchical unsupervised growing
neural network for clustering gene
expression patterns
Javier Herrero, Alfonso Valencia & Joaquin Dopazo
Seminar “Neural Networks in Bioinformatics”
by Barbara Hammer
Presentation by Nicolas Neubauer
January, 25th, 2003
25.1.2002
SOTA
2
Topics
•
Introduction
–
Motivation
–
Requirements
•
Parent Techniques and their Problems
•
SOTA
•
Conclusion
25.1.2002
SOTA
3
Motivation
•
DNA arrays create huge
masses of data.
•
Clustering may provide
first orientation.
•
Clustering: Group
vectors so that similar
vectors are together.
•
Vectorizing DNA array
data:
–
Each gene is one point
in input space
–
Each condition (i.e.,
each DNA array)
contributes 1 component
of an input vector.
•
in reality:
–
several thousands of genes,
–
several dozens of DNA
arrays
.3
.7
.2
.5
.2
.6
.3
.5
.1
.5
.4
.5
.1
.5
.4
.5
.2
.6
.3
.5
.3
.7
.2
.5
( )( )( )( )
25.1.2002
SOTA
4
Requirements
•
Clustering algorithm should…
…
be tolerant to noise
…
capture high

level (inter

cluster) relations
…
be able to scale topology based on
•
topology of input data
•
users’ required level of detail
•
Clustering is based on similarity
measure
–
Biological sense of distance function must
be validated
25.1.2002
SOTA
5
Topics
•
Introduction
•
Parent Techniques and their Problems
–
Hierarchical Clustering
–
SOM
•
SOTA
•
Conclusion
25.1.2002
SOTA
6
Hierarchical Clustering
•
Vectors are arranged
in a binary tree
•
Similar vectors are
close in the tree
hierarchy
•
One node for each
vector
•
Quadratic runtime
•
Result may depend
on order of data
25.1.2002
SOTA
7
Hierarchical Clustering (II)
Clustering algorithm should…
…
be tolerant to noise
no

data is directly used to define position
…
capture high

level (inter

cluster) relations
yes

tree structure gives very clear relationships
between clusters
…
be able to scale topology based on
–
topology of input data
yes

tree is built to fit distribution in input data
–
users’ required level of detail
no

tree is fully built; may be reduced by later analysis,
but has to be fully built before
25.1.2002
SOTA
8
Self

Organising Maps
•
Vectors are assigned
to clusters
•
Clusters are defined
by the neurons
which serve as
“prototypes” for that
cluster
•
Many vectors per
cluster
•
Linear runtime
25.1.2002
SOTA
9
Self

Organising Maps (II)
Clustering algorithm should…
…
be tolerant to noise
yes

data is not aligned directly but in relation to
prototypes which are averages
…
capture high

level (inter

cluster) relations
?

paper says no, but neighbourhood of neurons?
…
be able to scale topology based on
–
topology of input data
no

number of clusters is set before

hand, data may be
stretched to fit the SOM’s topology
“if some particular type of profile is abundant, … this type
of data will populate the vast majority of clusters”
–
users’ required level of detail
yes

choice of number of clusters influences detail
25.1.2002
SOTA
10
Topics
•
Introduction
•
Parent Techniques and their Problems
•
SOTA
–
Growing Cell Structure
–
Learning Algorithm
–
Distance Measures
–
Abortion Criteria
•
Conclusion
25.1.2002
SOTA
11
SOTA Overview
•
SOTA stands for
self

organising tree algorithm
•
SOTA combines best things from
hierarchical clustering and SOMs:
–
Align clusters in a hierarchical structure
–
Use cluster prototypes created in a SOM

like way
•
New idea: Growing Cell Structures
–
Topology is built up incrementally as data
requires it
25.1.2002
SOTA
12
Growing Cell Structures
•
Topology consists of
–
Cells, the clusters, and
–
Nodes, connections
between the cells
•
Cells can become
nodes and get two
daughter cells
•
Result: Binary tree
•
SOTA: Cells have
codebook serving as
prototypes for clusters
•
Good splitting
criteria: topology
adapts to data
25.1.2002
SOTA
13
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
•
Find winning cell
•
Adapt cells
Until updating

finished()
Split cell containing
most heterogenity
Until all

finished()
•
Compare pattern P
j
to
cell C
i
•
Cell for which
d(P
j
, C
i
) is smallest wins
25.1.2002
SOTA
14
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
•
Find winning cell
•
Adapt cells
Until updating

finished()
Split cell containing
most heterogenity
Until all

finished()
C
i
(t+1) = C
i
(t)+n*(P
j

C
i
(t))
•
Move cell into direction of
of pattern, with learning
factor n depending on
proximity
•
Only three cells (at most)
are updated:
–
winner cell,
–
ancestor cell and
–
sister cell
•
n
winner
> n
ancestor
> n
sister
•
If sister is no longer cell,
but node, only winner is
adapted
25.1.2002
SOTA
15
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
•
Find winning cell
•
Adapt cells
Until updating

finished()
Split cell containing
most heterogenity
Until all

finished()
When is updating finished?
•
Each pattern “belongs” to
its winner cell from this
epoch
•
So, each cell has a set of
patterns assigned
•
Resource (R
i
) is the
average distance between
the cell’s codebook and its
patterns
•
The sum of all R
i
s is the
error
t
of epoch t
•
Stop repeating epochs if
(
t

t

1
)/(
t

1
) < E
25.1.2002
SOTA
16
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
•
Find winning cell
•
Adapt cells
Until updating

finished()
Split cell containing
most heterogenity
Until all

finished()
•
A new cluster is created.
•
Most efficient: split cell
where patterns are most
heterogenous
•
Measure for heterogenity:
–
Resource
mean distance between
patterns and cell
–
Variability
maximum distance
between patterns
•
Cell turns into node
•
Two daughter cell inherit
mother’s codebook
25.1.2002
SOTA
17
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
•
Find winning cell
•
Adapt cells
Until updating

finished()
Split cell containing
most heterogenity
Until all

finished()
The big question:
When to stop iterating cycles
•
When each pattern has its
own cell
•
When maximum number
of nodes is reached
•
When maximum resource
or variability value drops
under a certain level
–
See later for sophisticated
calculations of such a
threshold
25.1.2002
SOTA
18
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
•
Find winning cell
•
Adapt cells
Until updating

finished()
Split cell containing
most heterogenity
Until all

finished()
25.1.2002
SOTA
19
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
•
Find winning cell
•
Adapt cells
Until updating

finished()
Split cell containing
most heterogenity
Until all

finished()
25.1.2002
SOTA
20
Distance Measures
•
How does d(x,y)
really look like?
•
Distance function
has to contain
biological similarity!
•
Euclidean distance:
d(x,y) = √
(x
i

y
i
)
2
•
Pearson correlation:
d(x,y)=(1

r)
r = (e
xi

ê
x
)(e
yi

ê
y
)
S
ex
S
ey

1≤ r ≤ 1
.1
.5
.4
.5
.2
.6
.3
.5
.3
.7
.2
.5
•
Empirical evidence
suggests that Pearson
correlation better
grasps similarity in case
of DNA array data.
25.1.2002
SOTA
21
SOTA evaluation
Clustering algorithm should…
…
be tolerant to noise
yes

data is averaged via codebooks just as in SOM
…
capture high

level (inter

cluster) relations
yes

hierarchical structure as in hierarch. clustering
…
be able to scale topology based on
–
topology of input data
yes

tree is extended to meet the distribution of variance
in the input data
–
users’ required level of detail
yes

tree can be adjusted to desired level of detail;
criteria may also be set to meet certain confidence levels...
25.1.2002
SOTA
22
Abortion Criteria
•
What we are looking for:
“an upper level of
distance at which two
genes can be considered
to be similar at their
profile expression levels”
•
Distribution of distances
has to do with non

biological characteristics
of the data
–
Many points with few
components cause a lot
of high correlations
25.1.2002
SOTA
23
Abortion Criteria (II)
•
Idea:
–
If we knew the
distribution in the
data that is random,
–
A confidence level
could be given
–
Meaning that having a
given distance given
two unrelated genes
is not more probable
than
.
•
Problem:
–
We cannot know the
random distribution:
–
We only know the real
distribution which is
partially due to
random properties,
partially due to “real”
correlations
•
Solution:
–
Approximation by
shuffling
25.1.2002
SOTA
24
Abortion Criteria (III)
•
Shuffling
–
For each pattern, the
components are
randomly shuffled
–
Correlation is
destroyed
–
Number of points,
ranges of values,
frequency of values
are conserved
•
Claim:
–
Random distance
distribution in this
data approximates
random distance
distribution in real
data
•
Conclusion
–
If p(corr.>a)<=5%
in random data,
–
Finding corr.>a in
the real data is
meaningful with
95% confidence
25.1.2002
SOTA
25
•
Only 5% of the
random data pairs
have a correlation >
.178
•
Choosing .178 as
threshold, there is a
95% confidence that
genes in the same
cluster are there for
non

statistical, but
biological reasons
25.1.2002
SOTA
26
Topics
•
Introduction
•
Parent Techniques and their Problems
•
SOTA
•
Conclusion
–
Additional nice properties
–
Summary of differences compared to parent
techniques
25.1.2002
SOTA
27
Additional properties
•
As patterns do not have to be compared
to each other, runtime is approximately
linear in the number of patterns
–
Like SOM
–
Hierarchical clustering uses a distance
matrix relating each pattern to each other
pattern
•
The cell’s vectors approach very closely
the average of the assigned data points
25.1.2002
SOTA
28
Summary of SOTA
•
Compared to SOMs
–
SOTA builds up a
topology that reflects
higher

order relations
–
Level of detail can be
defined very flexible
•
Nicer topological
properties
•
Adding of new data
into an existing tree
would be
problematic(?)
•
Compared to
standard hierarchical
clustering
–
SOTA is more robust
to noise
–
Has better runtime
properties
–
Has a more flexible
concept of cluster
Comments 0
Log in to post a comment