Joaquín Dopazo. CNIO.
SOM y SOTA:
Clustering methods in
the analysis
of massive biological data
From genotype
to phenotype.
(only the genetic component)
>protein kunase
acctgttgatggcgacagggactgtatgctgatct
atgctgatgcatgcatgctgactactgatgtgggg
gctattgacttgatgtctatc....
…code for the
structure of
proteins...
…which accounts for
the function...
…providing they are expressed in
the proper moment and place...
…in
cooperation
with other
proteins…
…conforming complex
interaction networks...
Genes in the
DNA...
…whose final
effect can be
different because
of the variability.
Between 30.000 and
100.000.
40

60% display
alternative splicing
Each protein has an average
of 8 interactions
A typical tissue is
expressing among
5000 and 10000
genes
That undergo
post

translational
modifications
More than 3 millon
SNPs have been
mapped
>protein kunase
acctgttgatggcgacagggactgtatgctga
tctatgctgatgcatgcatgctgactactgatg
tgggggctattgacttgatgtctatc....
Pre

genomics scnario in the lab
Sequence
Molecular
databases
Search results
Phylogenetic
tree
alignment
Conserved
region
Motif
Motif
databases
Information
Secondary and tertiary
protein structure
Bioinformatics tools for pre

genomic
sequence data analysis
The aim:
Extracting as much
information as
possible for one
single data
Genome
sequencing
2

hybrid systems,
Mass spec
trometry
for
protein complexes
Post

genomic vision
Expression
Arrays
Literature,
databases
Who?
Where, when and how much?
What do
we know?
In what way?
SNPs
And who else?
genes
interactions
Post

genomic vision
Gene
expression
Information
polimorphisms
Information
Databases
The new tools
:
Clustering
Feature selection
Multiple correlation
Datamining
Brain and computers
Brain computes in a different way from digital computers
Structural components
Brain
Computers
Neurons
(Ramón y Cajal, 1911)
chips
Speed
slow (10

3
s)
fast
(10

9
s)
Procesing units
10 billion neurons,
massively
interconnected (60
trillion synapses)
One or few
Brain is a highly complex, nonlinear, and parallel computer
Neurons are organized to perform complex computations many times faster
than the fastest computers.
Neural Networks
What is a neural network?
A Neural network is a massively parallel distributed
processor able to store experiential knowledge and to
make it available for use.
It resembles to brain in two respects:
Knowledge is acquired by the network through a
learning process
Interneuron connection strengths (synaptic
weights) are used to store the knowledge.
Neural Net classifiers
Supervised
Unsupervised
Perceptrons
Kohonen SOM
Growing cell
structures
SOTA
Summing
junction
Output
S
w
1
w
2
Activation
function
u
k
J(.)
Threshold
q
k
X
1
X
2
:
:
:
:
X
p
Input
signals
w
2
:
X
1
X
2
:
:
:
:
X
p
Supervised learning: the
perceptron
Supervised learning : training
Summing
junction
S
w
1
w
2
Activation
function
J(.)
11111110000000
00000001111111
Training
set
up
down
u
u
=x
1
*w
1
+x
2
*w
2
W
1
= 1
W
2
=0
up =1
down = 0
1 if u
1
0 if u<1
J(
u
)=
Supervised learning: application
Summing
junction
S
1
0
J(.)
X
1
0
u
u
=1*1+0*0= 1
1 if u
1
0 if u<1
J(
u
)=
1
up
Supervised vs. Unsupervised learning
Supervised:
The structure of the data is known beforehand. After a training process in
which the network learns how to distinguish among classes, you use the
network for assigning new items to the predefined classes
Unsupervised:
The structure of the data is not know beforehand.
The network learns how data are distributed among classes, based on a
function of distance
Sensory pathways in the brain are organised in such a
way that its arrangement reflects some physical
characteristic of the external stimulus being sensed.
Brain of higher animals seems to contain many kind of
“maps” in the cortex.
In visual areas there are orientation and color maps
In the auditory cortex there exist the so

called tonotopic
maps
•
The somatotopic maps represents the skin surface
The basis
Unsupervised learning:
Kohonen self

organizing maps
Kohonen SOM
The causes of self

organisation
Kohonen SOM
mimics two

dimensional arrangements of
neurons in the brain. Effects leading to spatially organized
maps are:
•
Spatial concentration of the network activity on the
neuron best tuned to the present input
•
Further sensitization of the
best matching
neuron and its
topological neighborhood
.
Kohonen SOM
The topology
Two

dimensional network of
cells with a hexagonal or
rectangular (or other)
arrangement.
x
1
, x
2
..x
n
input
Output
nodes
Neighborhood
Neighborhood
of a cell is defined as a time dependent function
Kohonen SOM
The algorithm
Step 1.
Initialize nodes
to random values.
Set the initial radius of the
neighborhood
.
Step 2.
Present new input:
Compute distances to all nodes.
Euclidean distances are commonly used
Step 3.
Select
output node
j*
with
minimum distance
d
j
.
Update
node
j*
and neighbors. Nodes updated for the
neighborhood
NE
j*
(t)
as:
w
ij
(t+1) = w
ij
(t) +
(t)(x
i
(t)

w
ij
(t))
; for
j
NE
j*
(t)
(t)
is a gain term than decreases in time.
Step4
Repeat by going to
Step 2
until convergence.
Input
Kohonen SOM
Limitations
Arbitrary number of clusters
The number of clusters is arbitrarily fixed from the beginning.
Some clusters can remain unoccupied.
Lack of the tree structure
The use of a two

dimensional structure for the net makes
impossible to recover a tree structure that relates the clusters
and subclusters among them
.
Non proportional clustering
Clusters are made based on the number of items so, distances
among them are not proportional.
Growing cell structures
Growing cell structures
produce
distribution

preserving
mapping. The
number of clusters and the connections
among them are dynamically assigned
during the training of the network.
Kohonen SOM
produce
topology

preserving
mapping. That is, the
topology of the network and the
number of clusters are fixed before
to the training of the network
Insertion and deletion of neurons
•
After a fixed number
of adaptations,
every neuron
q
with a signal counter
value
h
q
>
h
c
(a threshold) is used to
create a new neuron
The direct neighbor
f
of the neuron
q
having the greatest signal counter value
is used to insert a new neuron between
them.
The new neuron is connected to
preserve the topology of the network.
•
Signal counter values are adjusted in
the neighborhood
Similarly, neurons with signal counter
values below a threshold can be
removed.
Growing cell structures
Network dynamics
Similar to the used by Kohonen SOM, but with
several important differences:
Adaptation strength is constant over time (
e
b
and
e
n
for the
best matching cell and its neighborhood).
Only the best

matching cell and its neighborhood are adapted.
Adaptation implies the increment of signal counter for the
best

matching cell and the decrement in the remaining cells.
New cells can be inserted and existent cells can be removed in
order to adapt the output map to the distribution of the input
vectors.
Growing cell structures
Limitations
Arbitrary number of clusters
The number of clusters is arbitrarily fixed from the beginning.
Some clusters can remain unoccupied.
Lack of the tree structure
The use of a two

dimensional structure for the net makes
impossible to recover a tree structure that relates the clusters
and subclusters among them
.
Non proportional clustering
Clusters are made based on the number of items so, distances
among them are not proportional.
Many molecular data have
different levels of structured
information.
Ej, phylogenies, molecular population
data, DNA expression data (to same
extent), etc.
But, sometimes behing the real
world there is some hierarchy...
A
B
C
D
20 items
Simulation
Mapping a hierarchical structure using a
non

hierarchical method (SOM)
A,B
C,D
E,F
G
H
Self Organising Tree Algorithm
(SOTA)
A new neural network designed to deal with data that are
related among them by means of a binary tree topology
Dopazo & Carazo, 1997,
J. Mol. Evol
.
44
:226

233
Derived from the
Kohonen SOM
and the
growing
cell structures
but with several key differences:
The topology of the network is a binary tree.
Only growing of the network is allowed.
The growing mimics a speciation event, producing two new
neurons from the most heterogeneous neuron.
Only terminal neurons are directly adapted by the input data,
internal neurons are adapted through terminal neurons.
SOTA:
The algorithm
Input
SOTA, unlike other hierarchical methods, grows
from top to bottom until an appropriate level of
variability is reached
The Self Organising Tree Algorithm (SOTA) is a
hierarchical divisive method based on a neural
network
Step 1.
Initialize nodes
to random values.
Step 2.
Present new input:
Compute distances to all
terminal
nodes.
Step 3.
Select
output node
j*
with
minimum distance
d
j
.
Update
node
j*
and neighbors. Nodes updated for the
neighborhood
NE
j*
(t)
as:
w
ij
(t+1) = w
ij
(t) +
(t)(x
i
(t)

w
ij
(t))
; for
j
NE
j*
(t)
(t)
is a gain term than decreases in time.
Step 4
Repeat by going to
Step 2
until convergence.
Step 5
Reproduce the node with highest variability.
Dopazo, Carazo (1997)
Herrero, Valencia, Dopazo (2001)
SOTA algorithm (neighborhood)
w
a
s
Initial state
Actualization
Growing and different
neighborhoods
SOTA algorithm
EPOC
H
YES
NO
NO
Cycle
Cycle
convergence?
Add
cell
winner
sister
mother
Initialise
system
Network
convergence?
YES
End
Cycle
:
repeat as many
epochs
as
necessary to get
convergence
in the
present state of the network.
Convergence
: relative error of the
network falls below a threshold
When a cycle finishes,
the network
size increases
: two new neurons are
attached to the neuron with higher
resources
. This neuron becomes
mother neuron and does not receive
direct inputs any more.
Applications
Sequence analysis
Microarray data analysis
Population data analysis
•
Massive data
•
Information
•
redundancy
Sequence analysis in the genomics era
Codification
Indeterminaciones.
R = {A ó G}; N= {A ó G ó C ó T}
Vectores de N x 4 (nucleótidos) o N x 20
(aminoácidos); más una componente para
representar las deleciones
Other possible codifications: Frequencies of dipeptides or
dinucleotides
Updating the neurons
Missing
Updated
Classifying proteins with SOM
Ferrán, Pflugfelder and Ferrara
(1994) Self

organized neural maps
of human protein sequences.
Prot.
Sci
.
3
:507

521.
Cy5
Cy3
cDNA arrays
Oligonucleotide arrays
Gene expression analysis using DNA microarrays
Research paradigm is shifting
Hipothesis driven: one PhD per gene
Ignorance driven: paralelized automated approach
Gb
Kb
Mb
Tb

Pb
sequences
DNA arrays
Expression patterns
1
2
3
4
Patterns can be:
•
time series
•
dosage series
•
different patients
•
different tissues
•
etc.
Different DNA

arrays
The data
Characteristics of the data:
•
Low signal to noise ratio
•
High redundancy and intra

gene
correlations
•
Most of the genes are not
informative with respect to the trait
we are studying (account forunrelated
physiological conditions, etc.)
•
Many
genes have no annotation!!
Genes
(thousands)
Experimental conditions
(from tens up to no more than a few houndreds)
A
B
C
Expression profile
of a gene across the
experimental
conditions
Expression
profile of all the
genes for a
experimental
condition (array)
Different classes
of experimental
conditions, e.g.
Cancer types,
tissues, drug
treatments, time
survival, etc.
Co

expressing genes...
What do they
have in common?
Different phenotypes...
What genes are
responsible for?
Genes interacting in a
network (A,B,C..)...
How is the
network?
A
B
C
D
E
Genes of a class
What profile(s) do they
display? and...
Are there more
genes?
Molecular classification
of samples
Study of many conditions.
Types of problems
Can we find groups of
experiments with
similar gene expression
profiles?
Unsupervised
Supervised
Reverse engineering
100/1
=
100
2
10/1
=
10
1
1/1
=
1
0
1/10
=
0.1

1
1/100
=
0.01

2
What are we measuring?
red
green
A (background)
B (expression)
Differential
expression
B
/
A
Problem: is asymetrical
solution: log

transformation
transformation
Distance
A
B
C
Differences
B<=>C
Correlation
A<=>B
Clustering methods
deterministic
NN
Non hierarchical
Hierarchical
K

means, PCA
UPGMA
SOM
SOTA
Provides
different
levels of
information
Robust
Properties
Aggregative hierarchical
clustering
CLUSTER
Relationships among profiles are
represented by branch lengths.
Links recursively the closest pair of
profiles until the complete hierarchy
is reconstructed
Allows to explore the relationship
among groups of related genes at
higher levels.
Problems
•
lack of robustness
•
solution may be not unique
•
dependent on the data order
Aggregative
hierarchical
clustering
What level would you
consider for defining
a cluster?
Subjective
cluster definition
Properties of neural networks for
molecular data classification
•
Robust
•
Manage real

world data sets containing noisy,
ill

defined items with irrelevant variables and
outliers
•
Statistical distributions do not need to be
parametric
•
Fast and scalable to big data sets
Kohonen SOM
Applied to microarray data
t
1
t
2
.. t
p
sample
1
a
11
a
12
.. a
1p
sample
2
a
21
a
22
.. a
2p
: : : :
sample
n
a
n1
a
n2
.. a
np
Group
11
sample
a
, sample
b
...
Group
12
sample
a
, sample
b
...
Group
13
sample
a
, sample
b
...
Group
14
sample
a
, sample
b
...
node
44
x
1
x
2
.. x
p
node
34
y
1
y
2
.. y
p
node
24
z
1
z
2
.. z
p
Kohonen SOM
microarray patterns
gen
1
gen
2
.. gen
p
sample
1
a
11
a
12
.. a
1p
sample
2
a
21
a
22
.. a
2p
: : : :
sample
n
a
n1
a
n2
.. a
np
Kohonen SOM
Example
Response of human fibroblasts
to serum
Iyer et al., 1999
Science
283
:83

87
The Self Organising Tree
Algorithm (SOTA)
SOTA,opposite to other clustering
methods, grows from top to bottom:
growing can be stopped at the
desired level of variability
SOTA nodes are weighted averages of
every item under the node
SOTA
The Self Organising Tree Algorithm
(SOTA) is a divisive hierarchical
method based on a neural network
Advantages of SOTA
Clusters
´
patterns
Each node of the tree has a pattern associated
wich corresponds to the cluster under itself.
Divisive algorithm
SOTA grows from top to bottom: growing can be
stopped at any desired level of variability.
Distribution preserving
The number of clusters depends on the
variability of the data.
Robusteness against noise
TEST
From low
resolution...
...to high resolution.
Where stop
growing?
exp
1
exp
2
.. exp
p
gen
1
a
11
a
12
.. a
1p
gen
2
a
21
a
22
.. a
2p
: : : :
gen
n
a
n1
a
n2
.. a
np
exp
1
exp
2
.. exp
p
gen
1
a
14
a
17
.. a
1q
gen
2
a
23
a
21
.. a
2r
: : : :
gen
n
a
n9
a
n4
.. a
ns
95%
Permutation test for cluster
size definition
TEST
are d
ij
> 0.4?
SOTA/SOM vs classical
clustering (UPGMA
)
SOTA vs SOM
Acuracy: the silhouette
Is the object closer
to its cluster or to
the closer cluster?
Comments 0
Log in to post a comment