Lecture 3
Data Types in computational
biology/Systems biology
Useful websites
Handling Multivariate data: Concept and
types of metrics, distances etc.
K

mean clustering
What is systems biology?
Each lab/group has its own definition of systems biology.
This is because systems biology requires the understanding and
integration different levels of OMICS information utilizing the
knowledge from different branches of science and individual
labs/groups are working on different area.
Theoretical target: Understanding life as a system.
Practical Targets: Serving humanity by developing new generation
medical tests, drugs, foods, fuel, materials, sensors, logic gates
……
Understanding life or even a cell as a system is complicated and
requires comprehensive analysis of different data types and/or sub

systems.
Mostly individual groups or people work on different sub

systems


Some of the currently
partially available
and useful data types:
Genome sequences
Binding motifs in DNA sequences or CIS regulatory region
CODON usage
Gene expression levels for global gene sets/
microRNAs
Protein sequences
Protein structures
Protein domains
Protein

protein interactions
Binding relation between proteins and DNA
Regulatory relation between genes
Metabolic Pathways
Metabolite profiles
Species

metabolite relations
Plants usage in traditional medicines
Usually in wet
labs,
experiments are conducted to generate such data
In dry labs like ours we analyze these data to extract targeted information using
different algorithms and statistics etc.
Data Types in computational biology/Systems biology
>gi15223276refNP_171609.1 ANAC001 (Arabidopsis NAC domain containing protein 1); transcription factor [Arabidopsis thalia
na]
MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRE
NNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDL
LPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQ
QFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLV
DERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPL
HNYKAQEQPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFIS
VISWIILVG
Sequence data (Genome /Protein sequence)
Usually BLAST algorithms
based on dynamic programming are
used to determine how
two or
more
sequences are matching with each other
Sequence matching/alignments
CODONS
CODON USAGE
CODON USAGE
Multivariate data (Gene expression data/Metabolite profiles)
There are many types of clustering algorithms applicable
to multivariate data e.g. hierarchical, K

mean, SOM etc.
Multivariate data also can be modeled using multivariate
probability distribution function
Binary relational Data (Protein

protein
interactions, Regulatory relation between
genes, Metabolic Pathways) are networks.
Clustering is usually used to extract
information from networks.
Multivariate data and sequence data also can
be easily converted to networks and then
network clustering can be applied.
AtpB
AtpA
AtpG
AtpE
AtpA
AtpH
AtpB
AtpH
AtpG
AtpH
AtpE
AtpH
Useful Websites
www.geneontology.org
www.genome.ad.jp/kegg
www.ncbi.nlm.nih.gov
www.ebi.ac.uk/databases
http://www.ebi.ac.uk/uniprot/
http://www.yeastgenome.org/
http://mips.helmholtz

muenchen.de/proj/ppi/
http://www.ebi.ac.uk/trembl
http://dip.doe

mbi.ucla.edu/dip/Main.cgi
www.ensembl.org
Some websites
Some websites
where we can find
different types of
data and links to
other databases
Source: Knowledge

Based Bioinformatics: From Analysis to Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
Source: Knowledge

Based Bioinformatics: From Analysis to Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
Source: Knowledge

Based Bioinformatics: From Analysis to Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
Source: Knowledge

Based Bioinformatics: From Analysis to Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
Source: Knowledge

Based Bioinformatics: From Analysis to Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
NETWORK TOOLS
Source: Knowledge

Based
Bioinformatics: From
Analysis to Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
NETWORK TOOLS
Source: Knowledge

Based
Bioinformatics: From
Analysis to Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
Source: Knowledge

Based
Bioinformatics: From Analysis to
Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
Source: Knowledge

Based Bioinformatics: From Analysis to Interpretation
Gil
Alterovitz
, Marco
Ramoni
(Editors)
Handling Multivariate data: Concept and types of metrics
Multivariate data format
Multivariate data example
Distances, metrics, dissimilarities and similarities are related concepts
A metric is a function that satisfy the following properties:
A function that satisfy only conditions (
i
)

(iii) is referred to as distances
Source: Bioinformatics and Computational Biology Solutions Using R and
Bioconductor
(Statistics for Biology and Health)
Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine
Dudoit
(Editors)
Example:
Let,
X = (4, 6, 8)
Y = (5, 3, 9)
These measures consider the expression measurements as points in some
metric space.
Widely used metrics for finding similarity
Correlation
These measures consider the expression measurements as points in some metric space.
Statistical distance between points
The Euclidean distance between point Q and P is larger than that between Q and
origin but it seems P and Q are the part of the same cluster but not Q and O.
Statistical distance /
Mahalanobis
distance between two vectors can be calculated if the
variance

covariance matrix is known or estimated.
Distances between distributions
Different from the previous approach (i.e. considering expression measurements as
points in some metric space) the data for each feature can be considered as independent
sample from a population.
Therefore the data reflects the underlying population and we need to measure
similarities between two densities/distributions.
Kullback

Leibler
Information
Mutual information
KLI measures how much the
shape of one distribution
resembles the other
MI is large when the joint
distribution is quiet different
from the product of the
marginals
.
K

mean clustering
Source: “Clustering Challenges in Biological Networks” edited by S.
Butenko
et. al
.
Source:
Teknomo
,
Kardi
. K

Means Clustering Tutorials
http:
\
\
people.revoledu.com
\
kardi
\
tutorial
\
kMean
\
1.
Initial value of
centroids
: Suppose
we use medicine A
and medicine B as
the first
centroids
. Let
c1 and c2
denote the
coordinate of the
centroids
, then c1 =
(1,1) and c2 = (2,1)
Hierarchical clustering
Hierarchical Clustering
AtpB
AtpA
AtpG
AtpE
AtpA
AtpH
AtpB
AtpH
AtpG
AtpH
AtpE
AtpH
Data is not always
available as binary
relations as in the case of
protein

protein
interactions where we
can directly apply
network clustering
algorithms.
In many cases for
example in case of
microarray gene
expression analysis
the data is
multivariate type.
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
We can convert multivariate data into networks and can apply
network clustering algorithm about which we will discuss in
some later
class.
If dimension of multivariate data is 3 or less we can cluster
them by plotting directly.
Hierarchical Clustering
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
However, when dimension is more than 3, we can apply
hierarchical clustering to multivariate data.
In hierarchical clustering the data are not partitioned into a
particular cluster in a single step. Instead, a series of partitions
takes place.
Some data reveal good cluster structure when plotted but some
data do not.
Data plotted in 2
dimensions
Hierarchical Clustering
Hierarchical clustering is a technique that organizes
elements into a tree.
A tree is a graph that has no cycle.
A tree with n nodes can have maximum n

1 edges.
A Graph
A tree
Hierarchical Clustering
Hierarchical Clustering is subdivided into 2 types
1.
agglomerative
methods, which proceed by series of fusions of the n objects
into groups,
2.
and
divisive
methods, which separate n objects successively into finer
groupings.
Agglomerative techniques are more commonly used
Data can be viewed as a single
cluster containing all objects
to n clusters each containing a
single object .
Hierarchical Clustering
Distance measurements
Euclidean distance between
g
1
and g
2
Hierarchical Clustering
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
In stead of Euclidean distance correlation can also be used as
a distance measurement.
For biological analysis involving genes and proteins, nucleotide
and or amino acid sequence similarity can also be used as
distance between objects
Hierarchical Clustering
•
An agglomerative hierarchical clustering procedure produces
a series of partitions of the data, P
n
, P
n

1
, ....... , P
1
. The first P
n
consists of n single object 'clusters', the last P1, consists of
single group containing all n cases.
•
At each particular stage the method joins together the two
clusters which are closest together (most similar).
(At the first
stage, of course, this amounts to joining together the two
objects that are closest together, since at the initial stage each
cluster has one object.)
Hierarchical Clustering
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
Differences between methods arise because of the
different ways of defining distance (or similarity)
between clusters.
Hierarchical Clustering
How can we measure distances between clusters?
Single linkage clustering
Distance between two clusters A and B,
D(A,B)
is computed as
D(A,B)
= Min { d(i,j) : Where object i is in cluster A and
object j is cluster B}
Hierarchical Clustering
Complete linkage clustering
Distance between two clusters A and B,
D(A,B)
is computed as
D(A,B)
= Max { d(i,j) : Where object i is in cluster A and
object j is cluster B}
Hierarchical Clustering
Average linkage clustering
Distance between two clusters A and B,
D(A,B)
is computed as
D(A,B) = T
AB
/ ( N
A
* N
B
)
Where
T
AB
is the sum of all pair wise distances between objects
of cluster
A
and cluster
B. N
A
and
N
B
are the sizes of the clusters
A
and
B
respectively
.
Total
N
A
* N
B
edges
Hierarchical Clustering
Average group linkage clustering
Distance between two clusters A and B,
D(A,B)
is computed as
D(A,B) =
=
Average { d(i,j) : Where observations i and j are in
cluster t, the cluster formed by merging clusters A and B }
Total n(n

1)/2 edges
Hierarchical Clustering
Alizadeh et al.
Nature 403: 503

511
(2000).
Hierarchical Clustering
Classifying bacteria
based on 16s rRNA
sequences.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment