Lecture 3 (pptx) - the Computational Systems Biology Lab.

hordeprobableΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

62 εμφανίσεις

Lecture 3

Data Types in computational
biology/Systems biology

Useful websites

Handling Multivariate data: Concept and
types of metrics, distances etc.

K
-
mean clustering


What is systems biology?


Each lab/group has its own definition of systems biology.


This is because systems biology requires the understanding and
integration different levels of OMICS information utilizing the
knowledge from different branches of science and individual
labs/groups are working on different area.


Theoretical target: Understanding life as a system.


Practical Targets: Serving humanity by developing new generation
medical tests, drugs, foods, fuel, materials, sensors, logic gates
……

Understanding life or even a cell as a system is complicated and
requires comprehensive analysis of different data types and/or sub
-
systems.

Mostly individual groups or people work on different sub
-
systems
-
--


Some of the currently
partially available
and useful data types:


Genome sequences

Binding motifs in DNA sequences or CIS regulatory region

CODON usage

Gene expression levels for global gene sets/
microRNAs

Protein sequences

Protein structures

Protein domains

Protein
-
protein interactions

Binding relation between proteins and DNA

Regulatory relation between genes

Metabolic Pathways

Metabolite profiles

Species
-
metabolite relations

Plants usage in traditional medicines

Usually in wet
labs,
experiments are conducted to generate such data

In dry labs like ours we analyze these data to extract targeted information using
different algorithms and statistics etc.

Data Types in computational biology/Systems biology

>gi|15223276|ref|NP_171609.1| ANAC001 (Arabidopsis NAC domain containing protein 1); transcription factor [Arabidopsis thalia
na]

MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRE

NNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDL

LPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQ

QFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLV

DERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPL

HNYKAQEQPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFIS

VISWIILVG


Sequence data (Genome /Protein sequence)

Usually BLAST algorithms
based on dynamic programming are
used to determine how
two or
more
sequences are matching with each other

Sequence matching/alignments

CODONS

CODON USAGE

CODON USAGE

Multivariate data (Gene expression data/Metabolite profiles)

There are many types of clustering algorithms applicable
to multivariate data e.g. hierarchical, K
-
mean, SOM etc.


Multivariate data also can be modeled using multivariate
probability distribution function

Binary relational Data (Protein
-
protein
interactions, Regulatory relation between
genes, Metabolic Pathways) are networks.


Clustering is usually used to extract
information from networks.


Multivariate data and sequence data also can
be easily converted to networks and then
network clustering can be applied.

AtpB

AtpA

AtpG

AtpE

AtpA

AtpH

AtpB

AtpH

AtpG

AtpH

AtpE

AtpH

Useful Websites

www.geneontology.org



www.genome.ad.jp/kegg



www.ncbi.nlm.nih.gov



www.ebi.ac.uk/databases



http://www.ebi.ac.uk/uniprot/



http://www.yeastgenome.org/



http://mips.helmholtz
-
muenchen.de/proj/ppi/



http://www.ebi.ac.uk/trembl



http://dip.doe
-
mbi.ucla.edu/dip/Main.cgi



www.ensembl.org


Some websites

Some websites
where we can find
different types of
data and links to
other databases

Source: Knowledge
-
Based Bioinformatics: From Analysis to Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

Source: Knowledge
-
Based Bioinformatics: From Analysis to Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

Source: Knowledge
-
Based Bioinformatics: From Analysis to Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

Source: Knowledge
-
Based Bioinformatics: From Analysis to Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

Source: Knowledge
-
Based Bioinformatics: From Analysis to Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

NETWORK TOOLS

Source: Knowledge
-
Based
Bioinformatics: From
Analysis to Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

NETWORK TOOLS

Source: Knowledge
-
Based
Bioinformatics: From
Analysis to Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

Source: Knowledge
-
Based
Bioinformatics: From Analysis to
Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

Source: Knowledge
-
Based Bioinformatics: From Analysis to Interpretation

Gil
Alterovitz
, Marco
Ramoni

(Editors)

Handling Multivariate data: Concept and types of metrics

Multivariate data format

Multivariate data example

Distances, metrics, dissimilarities and similarities are related concepts


A metric is a function that satisfy the following properties:



A function that satisfy only conditions (
i
)
-
(iii) is referred to as distances

Source: Bioinformatics and Computational Biology Solutions Using R and
Bioconductor

(Statistics for Biology and Health)

Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine
Dudoit

(Editors)


Example:

Let,

X = (4, 6, 8)

Y = (5, 3, 9)

These measures consider the expression measurements as points in some
metric space.

Widely used metrics for finding similarity




Correlation







These measures consider the expression measurements as points in some metric space.

Statistical distance between points

The Euclidean distance between point Q and P is larger than that between Q and
origin but it seems P and Q are the part of the same cluster but not Q and O.

Statistical distance /
Mahalanobis

distance between two vectors can be calculated if the
variance
-
covariance matrix is known or estimated.

Distances between distributions


Different from the previous approach (i.e. considering expression measurements as
points in some metric space) the data for each feature can be considered as independent
sample from a population.


Therefore the data reflects the underlying population and we need to measure
similarities between two densities/distributions.



Kullback
-
Leibler

Information






Mutual information




KLI measures how much the
shape of one distribution
resembles the other

MI is large when the joint
distribution is quiet different
from the product of the
marginals
.

K
-
mean clustering


Source: “Clustering Challenges in Biological Networks” edited by S.
Butenko

et. al
.

Source:

Teknomo
,
Kardi
. K
-
Means Clustering Tutorials
http:
\
\
people.revoledu.com
\
kardi
\

tutorial
\
kMean
\

1.
Initial value of
centroids
: Suppose
we use medicine A
and medicine B as
the first
centroids
. Let
c1 and c2

denote the
coordinate of the
centroids
, then c1 =
(1,1) and c2 = (2,1)

Hierarchical clustering

Hierarchical Clustering

AtpB

AtpA

AtpG

AtpE

AtpA

AtpH

AtpB

AtpH

AtpG

AtpH

AtpE

AtpH

Data is not always
available as binary
relations as in the case of
protein
-
protein
interactions where we
can directly apply
network clustering
algorithms.

In many cases for
example in case of
microarray gene
expression analysis
the data is
multivariate type.

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

We can convert multivariate data into networks and can apply
network clustering algorithm about which we will discuss in
some later
class.

If dimension of multivariate data is 3 or less we can cluster
them by plotting directly.

Hierarchical Clustering

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

However, when dimension is more than 3, we can apply
hierarchical clustering to multivariate data.


In hierarchical clustering the data are not partitioned into a
particular cluster in a single step. Instead, a series of partitions
takes place.

Some data reveal good cluster structure when plotted but some
data do not.

Data plotted in 2
dimensions

Hierarchical Clustering

Hierarchical clustering is a technique that organizes
elements into a tree.

A tree is a graph that has no cycle.

A tree with n nodes can have maximum n
-
1 edges.



A Graph

A tree

Hierarchical Clustering

Hierarchical Clustering is subdivided into 2 types

1.
agglomerative
methods, which proceed by series of fusions of the n objects
into groups,

2.
and
divisive
methods, which separate n objects successively into finer
groupings.

Agglomerative techniques are more commonly used

Data can be viewed as a single
cluster containing all objects
to n clusters each containing a
single object .

Hierarchical Clustering

Distance measurements

Euclidean distance between
g
1

and g
2

Hierarchical Clustering

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

In stead of Euclidean distance correlation can also be used as
a distance measurement.

For biological analysis involving genes and proteins, nucleotide
and or amino acid sequence similarity can also be used as
distance between objects

Hierarchical Clustering


An agglomerative hierarchical clustering procedure produces
a series of partitions of the data, P
n
, P
n
-
1
, ....... , P
1
. The first P
n

consists of n single object 'clusters', the last P1, consists of
single group containing all n cases.




At each particular stage the method joins together the two
clusters which are closest together (most similar).


(At the first
stage, of course, this amounts to joining together the two
objects that are closest together, since at the initial stage each
cluster has one object.)



Hierarchical Clustering

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

Differences between methods arise because of the
different ways of defining distance (or similarity)
between clusters.

Hierarchical Clustering

How can we measure distances between clusters?

Single linkage clustering


Distance between two clusters A and B,
D(A,B)

is computed as


D(A,B)

= Min { d(i,j) : Where object i is in cluster A and
object j is cluster B}

Hierarchical Clustering

Complete linkage clustering


Distance between two clusters A and B,
D(A,B)

is computed as


D(A,B)

= Max { d(i,j) : Where object i is in cluster A and
object j is cluster B}


Hierarchical Clustering

Average linkage clustering


Distance between two clusters A and B,
D(A,B)

is computed as


D(A,B) = T
AB

/ ( N
A

* N
B
)

Where
T
AB

is the sum of all pair wise distances between objects
of cluster
A

and cluster
B. N
A

and
N
B

are the sizes of the clusters
A
and
B
respectively
.




Total
N
A

* N
B

edges

Hierarchical Clustering

Average group linkage clustering


Distance between two clusters A and B,
D(A,B)

is computed as


D(A,B) =
=
Average { d(i,j) : Where observations i and j are in
cluster t, the cluster formed by merging clusters A and B }



Total n(n
-
1)/2 edges

Hierarchical Clustering

Alizadeh et al.
Nature 403: 503
-
511
(2000).


Hierarchical Clustering

Classifying bacteria
based on 16s rRNA
sequences.