SOM y SOTA: Clustering methods in the analysis of massive biological data

chickenchairwomanAI and Robotics

Oct 19, 2013 (3 years and 9 months ago)

98 views

Joaquín Dopazo. CNIO.

SOM y SOTA:
Clustering methods in
the analysis
of massive biological data

From genotype
to phenotype.


(only the genetic component)

>protein kunase

acctgttgatggcgacagggactgtatgctgatct
atgctgatgcatgcatgctgactactgatgtgggg
gctattgacttgatgtctatc....

…code for the
structure of
proteins...

…which accounts for
the function...

…providing they are expressed in
the proper moment and place...

…in
cooperation
with other
proteins…

…conforming complex
interaction networks...

Genes in the
DNA...

…whose final
effect can be
different because
of the variability.

Between 30.000 and
100.000.

40
-
60% display
alternative splicing

Each protein has an average
of 8 interactions

A typical tissue is
expressing among
5000 and 10000
genes

That undergo
post
-
translational
modifications

More than 3 millon
SNPs have been
mapped

>protein kunase

acctgttgatggcgacagggactgtatgctga
tctatgctgatgcatgcatgctgactactgatg
tgggggctattgacttgatgtctatc....

Pre
-
genomics scnario in the lab

Sequence

Molecular
databases

Search results

Phylogenetic
tree

alignment

Conserved
region

Motif

Motif
databases

Information

Secondary and tertiary
protein structure

Bioinformatics tools for pre
-
genomic

sequence data analysis


The aim:

Extracting as much
information as
possible for one
single data

Genome
sequencing

2
-
hybrid systems,

Mass spec
trometry

for
protein complexes

Post
-
genomic vision

Expression

Arrays

Literature,
databases

Who?

Where, when and how much?

What do
we know?

In what way?

SNPs

And who else?

genes

interactions

Post
-
genomic vision

Gene
expression

Information

polimorphisms

Information

Databases

The new tools
:

Clustering

Feature selection

Multiple correlation

Datamining

Brain and computers

Brain computes in a different way from digital computers

Structural components

Brain

Computers

Neurons

(Ramón y Cajal, 1911)

chips

Speed

slow (10
-
3
s)


fast
(10
-
9
s)

Procesing units

10 billion neurons,
massively
interconnected (60
trillion synapses)

One or few

Brain is a highly complex, nonlinear, and parallel computer

Neurons are organized to perform complex computations many times faster
than the fastest computers.

Neural Networks

What is a neural network?

A Neural network is a massively parallel distributed
processor able to store experiential knowledge and to
make it available for use.


It resembles to brain in two respects:

Knowledge is acquired by the network through a
learning process

Interneuron connection strengths (synaptic
weights) are used to store the knowledge.

Neural Net classifiers

Supervised

Unsupervised

Perceptrons

Kohonen SOM

Growing cell
structures

SOTA

Summing
junction

Output

S

w
1

w
2

Activation
function

u
k

J(.)

Threshold

q
k

X
1

X
2

:

:

:

:

X
p

Input
signals

w
2

:

X
1

X
2

:

:

:

:

X
p

Supervised learning: the
perceptron

Supervised learning : training

Summing
junction

S

w
1

w
2

Activation
function

J(.)

11111110000000

00000001111111

Training
set

up

down

u

u

=x
1
*w
1
+x
2
*w
2

W
1

= 1

W
2
=0

up =1

down = 0

1 if u

1

0 if u<1

J(
u
)=

Supervised learning: application

Summing
junction

S

1

0

J(.)

X

1

0

u

u

=1*1+0*0= 1

1 if u

1

0 if u<1

J(
u
)=

1

up

Supervised vs. Unsupervised learning

Supervised:

The structure of the data is known beforehand. After a training process in

which the network learns how to distinguish among classes, you use the

network for assigning new items to the predefined classes

Unsupervised:

The structure of the data is not know beforehand.

The network learns how data are distributed among classes, based on a

function of distance

Sensory pathways in the brain are organised in such a
way that its arrangement reflects some physical
characteristic of the external stimulus being sensed.

Brain of higher animals seems to contain many kind of
“maps” in the cortex.


In visual areas there are orientation and color maps


In the auditory cortex there exist the so
-
called tonotopic
maps


The somatotopic maps represents the skin surface

The basis

Unsupervised learning:

Kohonen self
-
organizing maps

Kohonen SOM

The causes of self
-
organisation

Kohonen SOM

mimics two
-
dimensional arrangements of
neurons in the brain. Effects leading to spatially organized
maps are:


Spatial concentration of the network activity on the
neuron best tuned to the present input


Further sensitization of the
best matching

neuron and its
topological neighborhood
.

Kohonen SOM

The topology

Two
-
dimensional network of
cells with a hexagonal or
rectangular (or other)
arrangement.

x
1
, x
2
..x
n

input

Output

nodes

Neighborhood



Neighborhood

of a cell is defined as a time dependent function


Kohonen SOM

The algorithm

Step 1.

Initialize nodes

to random values.


Set the initial radius of the
neighborhood
.


Step 2.

Present new input:

Compute distances to all nodes.
Euclidean distances are commonly used


Step 3.

Select
output node
j*

with
minimum distance

d
j
.
Update

node
j*

and neighbors. Nodes updated for the
neighborhood
NE
j*
(t)

as:

w
ij
(t+1) = w
ij
(t) +

(t)(x
i
(t)
-

w
ij
(t))
; for
j



NE
j*
(t)


(t)
is a gain term than decreases in time.


Step4

Repeat by going to
Step 2

until convergence.

Input

Kohonen SOM


Limitations

Arbitrary number of clusters

The number of clusters is arbitrarily fixed from the beginning.

Some clusters can remain unoccupied.

Lack of the tree structure

The use of a two
-
dimensional structure for the net makes
impossible to recover a tree structure that relates the clusters
and subclusters among them
.

Non proportional clustering

Clusters are made based on the number of items so, distances
among them are not proportional.

Growing cell structures

Growing cell structures

produce
distribution
-
preserving

mapping. The
number of clusters and the connections
among them are dynamically assigned
during the training of the network.

Kohonen SOM

produce
topology
-
preserving

mapping. That is, the
topology of the network and the
number of clusters are fixed before
to the training of the network

Insertion and deletion of neurons






After a fixed number


of adaptations,
every neuron
q

with a signal counter
value
h
q

>
h
c

(a threshold) is used to
create a new neuron




The direct neighbor
f

of the neuron
q

having the greatest signal counter value
is used to insert a new neuron between
them.




The new neuron is connected to
preserve the topology of the network.



Signal counter values are adjusted in
the neighborhood


Similarly, neurons with signal counter
values below a threshold can be
removed.


Growing cell structures

Network dynamics

Similar to the used by Kohonen SOM, but with
several important differences:



Adaptation strength is constant over time (
e
b

and
e
n

for the
best matching cell and its neighborhood).


Only the best
-
matching cell and its neighborhood are adapted.


Adaptation implies the increment of signal counter for the
best
-
matching cell and the decrement in the remaining cells.


New cells can be inserted and existent cells can be removed in
order to adapt the output map to the distribution of the input
vectors.

Growing cell structures


Limitations

Arbitrary number of clusters

The number of clusters is arbitrarily fixed from the beginning.

Some clusters can remain unoccupied.

Lack of the tree structure

The use of a two
-
dimensional structure for the net makes
impossible to recover a tree structure that relates the clusters
and subclusters among them
.

Non proportional clustering

Clusters are made based on the number of items so, distances
among them are not proportional.







Many molecular data have
different levels of structured
information.

Ej, phylogenies, molecular population
data, DNA expression data (to same
extent), etc.


But, sometimes behing the real
world there is some hierarchy...

A

B

C

D

20 items

Simulation

Mapping a hierarchical structure using a
non
-
hierarchical method (SOM)

A,B

C,D

E,F

G

H

Self Organising Tree Algorithm
(SOTA)

A new neural network designed to deal with data that are
related among them by means of a binary tree topology

Dopazo & Carazo, 1997,
J. Mol. Evol
.
44
:226
-
233

Derived from the
Kohonen SOM

and the
growing
cell structures

but with several key differences:



The topology of the network is a binary tree.


Only growing of the network is allowed.


The growing mimics a speciation event, producing two new
neurons from the most heterogeneous neuron.


Only terminal neurons are directly adapted by the input data,
internal neurons are adapted through terminal neurons.

SOTA:

The algorithm

Input

SOTA, unlike other hierarchical methods, grows
from top to bottom until an appropriate level of
variability is reached

The Self Organising Tree Algorithm (SOTA) is a
hierarchical divisive method based on a neural
network

Step 1.

Initialize nodes

to random values.


Step 2.

Present new input:

Compute distances to all
terminal

nodes.


Step 3.

Select
output node
j*

with
minimum distance

d
j
.
Update

node
j*

and neighbors. Nodes updated for the
neighborhood
NE
j*
(t)

as:

w
ij
(t+1) = w
ij
(t) +

(t)(x
i
(t)
-

w
ij
(t))
; for
j



NE
j*
(t)


(t)
is a gain term than decreases in time.


Step 4

Repeat by going to
Step 2

until convergence.


Step 5

Reproduce the node with highest variability.


Dopazo, Carazo (1997)

Herrero, Valencia, Dopazo (2001)

SOTA algorithm (neighborhood)


w


a


s

Initial state

Actualization

Growing and different
neighborhoods

SOTA algorithm

EPOC
H

YES

NO

NO

Cycle

Cycle

convergence?

Add

cell

winner

sister

mother

Initialise

system

Network

convergence?

YES

End

Cycle
:

repeat as many
epochs

as
necessary to get
convergence

in the
present state of the network.

Convergence
: relative error of the
network falls below a threshold


When a cycle finishes,
the network

size increases
: two new neurons are
attached to the neuron with higher
resources
. This neuron becomes
mother neuron and does not receive
direct inputs any more.

Applications


Sequence analysis

Microarray data analysis

Population data analysis



Massive data



Information


redundancy

Sequence analysis in the genomics era

Codification

Indeterminaciones.

R = {A ó G}; N= {A ó G ó C ó T}

Vectores de N x 4 (nucleótidos) o N x 20
(aminoácidos); más una componente para
representar las deleciones

Other possible codifications: Frequencies of dipeptides or
dinucleotides

Updating the neurons

Missing

Updated

Classifying proteins with SOM

Ferrán, Pflugfelder and Ferrara
(1994) Self
-
organized neural maps
of human protein sequences.
Prot.
Sci
.
3
:507
-
521.


Cy5

Cy3

cDNA arrays

Oligonucleotide arrays

Gene expression analysis using DNA microarrays

Research paradigm is shifting

Hipothesis driven: one PhD per gene

Ignorance driven: paralelized automated approach

Gb

Kb

Mb

Tb
-

Pb

sequences

DNA arrays

Expression patterns

1

2

3

4

Patterns can be:



time series



dosage series



different patients



different tissues



etc.

Different DNA
-
arrays

The data

Characteristics of the data:



Low signal to noise ratio



High redundancy and intra
-
gene
correlations



Most of the genes are not
informative with respect to the trait
we are studying (account forunrelated
physiological conditions, etc.)



Many

genes have no annotation!!

Genes

(thousands)

Experimental conditions
(from tens up to no more than a few houndreds)

A

B

C

Expression profile
of a gene across the
experimental
conditions

Expression
profile of all the
genes for a
experimental
condition (array)

Different classes
of experimental
conditions, e.g.
Cancer types,
tissues, drug
treatments, time
survival, etc.

Co
-
expressing genes...

What do they
have in common?

Different phenotypes...

What genes are
responsible for?

Genes interacting in a
network (A,B,C..)...

How is the
network?

A

B

C

D

E

Genes of a class

What profile(s) do they
display? and...

Are there more
genes?

Molecular classification
of samples

Study of many conditions.

Types of problems

Can we find groups of
experiments with
similar gene expression
profiles?

Unsupervised

Supervised

Reverse engineering

100/1

=

100




2

10/1

=

10




1

1/1

=

1




0

1/10

=

0.1




-
1

1/100

=

0.01




-
2

What are we measuring?

red

green

A (background)

B (expression)

Differential
expression

B
/
A

Problem: is asymetrical


solution: log
-
transformation

transformation

Distance

A

B

C

Differences


B<=>C


Correlation


A<=>B

Clustering methods

deterministic

NN

Non hierarchical

Hierarchical

K
-
means, PCA

UPGMA

SOM

SOTA

Provides
different
levels of
information

Robust

Properties

Aggregative hierarchical
clustering

CLUSTER

Relationships among profiles are
represented by branch lengths.

Links recursively the closest pair of
profiles until the complete hierarchy
is reconstructed

Allows to explore the relationship
among groups of related genes at
higher levels.

Problems



lack of robustness



solution may be not unique



dependent on the data order

Aggregative
hierarchical
clustering

What level would you
consider for defining
a cluster?

Subjective
cluster definition

Properties of neural networks for
molecular data classification


Robust



Manage real
-
world data sets containing noisy,
ill
-
defined items with irrelevant variables and
outliers



Statistical distributions do not need to be
parametric



Fast and scalable to big data sets


Kohonen SOM

Applied to microarray data


t
1

t
2

.. t
p

sample
1
a
11

a
12

.. a
1p

sample
2
a
21

a
22

.. a
2p


: : : :

sample
n
a
n1

a
n2

.. a
np

Group
11

sample
a
, sample
b

...

Group
12

sample
a
, sample
b

...

Group
13

sample
a
, sample
b

...

Group
14

sample
a
, sample
b

...

node
44
x
1

x
2

.. x
p

node
34
y
1

y
2

.. y
p

node
24
z
1

z
2

.. z
p

Kohonen SOM

microarray patterns


gen
1

gen
2

.. gen
p

sample
1
a
11

a
12

.. a
1p

sample
2
a
21

a
22

.. a
2p


: : : :

sample
n
a
n1

a
n2

.. a
np

Kohonen SOM

Example

Response of human fibroblasts
to serum

Iyer et al., 1999
Science

283
:83
-
87

The Self Organising Tree
Algorithm (SOTA)

SOTA,opposite to other clustering
methods, grows from top to bottom:
growing can be stopped at the
desired level of variability

SOTA nodes are weighted averages of
every item under the node

SOTA

The Self Organising Tree Algorithm
(SOTA) is a divisive hierarchical
method based on a neural network

Advantages of SOTA

Clusters
´
patterns

Each node of the tree has a pattern associated
wich corresponds to the cluster under itself.

Divisive algorithm

SOTA grows from top to bottom: growing can be
stopped at any desired level of variability.

Distribution preserving

The number of clusters depends on the
variability of the data.

Robusteness against noise

TEST

From low
resolution...

...to high resolution.

Where stop
growing?


exp
1

exp
2

.. exp
p

gen
1
a
11

a
12

.. a
1p

gen
2
a
21

a
22

.. a
2p


: : : :

gen
n
a
n1

a
n2

.. a
np


exp
1

exp
2

.. exp
p

gen
1
a
14

a
17

.. a
1q

gen
2
a
23

a
21

.. a
2r


: : : :

gen
n
a
n9

a
n4

.. a
ns

95%

Permutation test for cluster
size definition

TEST

are d
ij

> 0.4?

SOTA/SOM vs classical
clustering (UPGMA
)

SOTA vs SOM

Acuracy: the silhouette

Is the object closer
to its cluster or to
the closer cluster?