BioInformatics (3)

underlingbuddhaΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

76 εμφανίσεις

BioInformatics (3)

Computational Issues


Data Warehousing:


Organising Biological Information into a Structured Entity
(World’s Largest Distributed DB)


Function Analysis (Numerical Analysis) :


Gene Expression Analysis : Applying sophisticated data
mining/Visualisation to understand gene activities within an
environment (Clustering )


Integrated Genomic Study : Relating structural analysis with
functional analysis


Structure Analysis (Symbolic Analysis) :


Sequence Alignment: Analysing a sequence using
comparative methods against existing databases to develop
hypothesis concerning relatives (genetics) and functions
(Dynamic Programming and HMM)


Structure prediction : from a sequence of a protein to predict
its 3D structure (Inductive LP)




Data Warehousing : Mapping
Biologic into Data Logic

Structure Analysis :

Alignments & Scores

Global (e.g. haplotype)


ACCACACA


::xx::x:


ACACCATA

Score= 5(+1) + 3(
-
1) = 2

Suffix (shotgun assembly)


ACCACACA


:::


ACACCATA

Score= 3(+1) =3


Local (motif)


ACCACACA


::::

ACACCATA

Score= 4(+1) = 4


A comparison of the homology search and the motif search
for functional interpretation of sequence information
.

Homology Search

Motif Search

New sequence

Retrieval

Similar

sequence

Expert

knowledge

Sequence interpretation

Sequence database

(Primary data)

Knowledge

acquisition

Motif library

(Empirical rules)

Expert

knowledge

New sequence

Inference

Sequence interpretation


Search and learning problems in sequence analysis

(Whole genome)

Gene Expression Analysis


Quantitative Analysis of Gene Activities
(Transcription Profiles)


Gene

Expression

Biotinylated RNA

from experiment

GeneChip expression

analysis probe array

Image of hybridized probe array

Each probe cell contains

millions of copies of a specific

oligonucleotide probe

Streptavidin
-

phycoerythrin

conjugate

(Sub)cellular inhomogeneity

( see figure)

Cell
-
cycle
differences in
expression.



XIST RNA localized
on inactive

X
-
chromosome

Cluster Analysis

Protein/protein complex

Genes

DNA regulatory elements

Functional Analysis via
Gene
Expression

Pairwise Measures

Clustering

Motif Searching/...

Clustering Algorithms

A clustering algorithm attempts to find natural groups of components
(or data) based on some similarity. Also, the clustering algorithm finds
the centroid of a group of data sets.To determine cluster membership,
most algorithms evaluate the distance between a point and the cluster
centroids. The output from a clustering algorithm is basically a
statistical description of the cluster centroids with the number of
components in each cluster.

Clusters of Two
-
Dimensional Data

Key Terms in Cluster Analysis


Distance & Similarity measures


Hierarchical & non
-
hierarchical


Single/complete/average linkage


Dendrograms & ordering

Distance Measures: Minkowski Metric

ref

Most Common Minkowski Metrics

An Example

4

3

x

y


Manhattan distance is called
Hamming
distance

when all features are binary.

Gene Expression Levels Under 17 Conditions (1
-
High,0
-
Low)

Similarity Measures: Correlation Coefficient

Similarity Measures: Correlation Coefficient

Time

Gene A

Gene B

Gene A

Time

Gene B

Expression Level

Expression Level

Expression Level

Time

Gene A

Gene B

Distance
-
based Clustering


Assign a distance measure between data


Find a partition such that:


Distance between objects within partition (i.e. same cluster) is
minimized


Distance between objects from different clusters is maximised


Issues :


Requires defining a distance (similarity) measure in situation
where it is unclear how to assign it


What relative weighting to give to one attribute vs another?


Number of possible partition is super
-
exponential

Normalized Expression Data

hierarchical & non
-

Hierarchical Clustering Techniques

Hierarchical Clustering

Given a set of N items to be clustered, and an NxN distance (or
similarity) matrix, the basic process hierarchical clustering is this:



1.Start by assigning each item to its own cluster, so that if you have N items,
you now have N clusters, each containing just one item. Let the distances
(similarities) between the clusters equal the distances (similarities) between the
items they contain.



2.Find the closest (most similar) pair of clusters and merge them into a single
cluster, so that now you have one less cluster.



3.Compute distances (similarities) between the new cluster and each of the old
clusters.



4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

The distance between two clusters is
defined as the distance between


Single
-
Link

Method / Nearest Neighbor


Complete
-
Link

/ Furthest Neighbor


Their
Centroids
.


Average

of all cross
-
cluster pairs.

Computing Distances




single
-
link clustering (also called the connectedness or minimum
method) :
we consider the distance between one cluster and another cluster to be
equal to the shortest distance from any member of one cluster to any member of
the other cluster. If the data consist of similarities, we consider the similarity
between one cluster and another cluster to be equal to the greatest similarity from
any member of one cluster to any member of the other cluster.




complete
-
link clustering (also called the diameter or maximum
method):
we consider the distance between one cluster and another cluster to be
equal to the longest distance from any member of one cluster to any member of the
other cluster.





average
-
link clustering :
we consider the distance between one cluster and
another cluster to be equal to the average distance from any member of one cluster
to any member of the other cluster.


Single
-
Link Method

b

a

Distance Matrix

Euclidean Distance

(1)

(2)

(3)

a,b,c

c

c

d

a,b

d

d

a,b,c,d

Complete
-
Link Method

b

a

Distance Matrix

Euclidean Distance

(1)

(2)

(3)

a,b

c

c

d

a,b

d

c,d

a,b,c,d

Compare Dendrograms

2

4

6

0

Single
-
Link

Complete
-
Link

Ordered dendrograms

2
n
-
1

linear orderings of n elements

(n= # genes or conditions)


Maximizing adjacent similarity is
impractical. So order by:


Average expression level,


Time

of max induction, or


Chromosome positioning

Eisen98

Which clustering methods do you suggest for
the following two
-
dimensional data?

Nadler and Smith, Pattern Recognition Engineering, 1993

Problems of Hierarchical
Clustering


It concerns more about complete tree
structure than the optimal number of
clusters.


There is no possibility of correcting for a
poor initial partition.


Similarity and distance measures rarely
have strict numerical significance.


Normalized Expression Data

Tavazoie
et al. 1999 (http://arep.med.harvard.edu)

Non
-
hierarchical clustering

Clustering by K
-
means


Given a set S of N p
-
dimension vectors without any prior knowledge about the
set, the K
-
means clustering algorithm forms K disjoint nonempty subsets such
that each subset minimizes some measure of dissimilarity locally. The algorithm
will globally yield an optimal dissimilarity of all subsets.


K
-
means algorithm has time complexity O(RKN) where K is the number of
desired clusters and R is the number of iterations to converges.



Euclidean distance metric between the coordinates of any two genes in the space
reflects ignorance of a more biologically relevant measure of distance. K
-
means
is an unsupervised, iterative algorithm that minimizes the within
-
cluster sum of
squared distances from the cluster mean.


The first cluster center is chosen as the centroid of the entire data set and
subsequent centers are chosen by finding the data point farthest from the centers
already chosen. 200
-
400 iterations.

K
-
Means Clustering Algorithm

1) Select an initial partition of k clusters

2) Assign each object to the cluster with the closest
center:

3) Compute the new centers of the clusters:



4) Repeat step 2 and 3 until no object changes cluster


Time
-
point 1

Time
-
point 3

Gene 1

Gene 2

Normalized
Expression Data
from microarrays

T1

T2

T3

Gene 1

Gene N

.

Representation of expression data

d
ij

Identifying prevalent expression patterns
(gene clusters)

Time
-
point 1

Time
-
point 3

-
1.8

-
1.3

-
0.8

-
0.3

0.2

0.7

1.2

1

2

3

-
2

-
1.5

-
1

-
0.5

0

0.5

1

1.5

1

2

3

-
1.5

-
1

-
0.5

0

0.5

1

1.5

1

2

3

Time
-
point

Time
-
point

Time
-
point

Normalized

Expression

Normalized

Expression

Normalized

Expression

Glycolysis

Nuclear Organization

Ribosome

Translation

Unknown

Genes

MIPS functional category

Evaluate Cluster contents