O G C A

hostitchAI and Robotics

Oct 23, 2013 (3 years and 10 months ago)

93 views

O
VERVIEW

OF

G
ENE

C
LUSTERING

AND

A
LGORITHMIC

M
ETHODOLOGIES

Beth
Benas

Rizwan

Habib

Alexander
Lowitt

Piyush

Malve

C
ONTENTS

1


Overview of Gene Clustering:
Biological Definitions

2


Algorithmic Methodologies

3


Software


Cluster 3 Program

4


Case Study / Article Overview

W
HAT

IS

G
ENE

C
LUSTERING
?


Two or more genes that code for the same or similar
products


Two different processes for duplication of original
genes via:


1) Homologous recombination


2) Transposition events




H
OMOLOGOUS

R
ECOMBINATION


Genetic recombination where nucleotides are
exchanged between similar or identical strands of
DNA


Breaking and rejoining strands of DNA


Established in meiosis to provide for more genetic
variability

H
OMOLOGOUS

R
ECOMBINATION

*http://www.web
-
books.com/MoBio/Free/Ch8D1.htm

M
ISALIGNMENT

D
URING

H
OMOLOGOUS

R
ECOMBINATION

http://jeb.biologists.org/cgi/reprint/203/6/1059.pdf

R
ETROTRANSPOSON


Transposons



mobile DNA


Sequences of DNA that are capable of moving to
alternative positions along the genome of a single cell


“jumping genes”


Retrotransposition



type of
transposon

able to
become amplified within a genome


Relatively stable and tend to withstand natural
selection


Thus, prevalent across generations


M
UTATIONS

IN

D
UPLICATED

G
ENE


Second copy generated is free from selective
pressure


Second copy can mutate quicker


Not necessarily lasting changes



W
HAT

D
OES

A
LL

T
HIS

M
EAN
?


Useful technique to group similar genetic code
together


Relational understanding between homologous
objects


Trending / patterns of genetic expression


Functional relatedness


Phenotypic relatedness


W
HAT

IS

G
ENE

C
LUSTERING
?


Presume


Genome is a 2D Cartesian space or a graph paper


Genes are now points on this graph paper


Let see how many lines and hyperbolas are there?


Gene clustering is the process of assigning two or
more genes to a “gene cluster” that serve to encode for
the same or similar products


As populations from a common ancestor tend to
possess the same varieties of gene clusters, they are
useful for tracing back recent evolutionary history.


An example of a gene cluster is the Human β
-
globin

gene cluster, which contains five functional genes and
one non
-
functional gene for similar proteins.


All Hemoglobin molecules contain any two identical
proteins from this gene cluster, depending on their specific
role.



C
ONTENTS

1


Overview of Gene Clustering:
Biological Definitions

2


Algorithmic Methodologies

3


Software


Cluster 3 Program

4


Case Study / Article Overview

H
IERARCHICAL

C
LUSTERING



Allows organization of the clustering data to be
represented in a tree (
dendrogram
)


Agglomerative

(Bottom Up): each observation
starts as own cluster. Clusters are merged based
on similarities


Divisive

(Top Down): all observations start in
one cluster, and splits are performed recursively
as one moves down the hierarchy.


In general, splits in the tree are determined in a
greedy manner.




H
IERARCHICAL

C
LUSTERING

Divisive

Agglomerative

H
IERARCHICAL

C
LUSTERING


A measure of dissimilarity between sets of
observations is required for combination and
division of clusters.


This is achieved by use of an appropriate metric (a
measure of distance between pairs of
observations), and a linkage criteria which
specifies the dissimilarity of sets as a function of
the
pairwise

distances of observations in the sets.

H
IERARCHICAL

C
LUSTERING


The choice of an appropriate metric will influence
the shape of the clusters, as some elements may
be close to one another according to one distance
and farther away according to another.


The linkage criteria determines the distance
between sets of observations as a function of the
pairwise

distances between observations.



A
DVANTAGE


Hierarchical clustering has the distinct
advantage that any valid measure of distance can
be used. In fact, the observations themselves are
not required: all that is used is a matrix of
distances

K
-

M
EANS

C
LUSTERING


k
-
means clustering

is a method of cluster
analysis which aims to partition
n

observations
into
k

clusters in which each observation belongs
to the cluster with the nearest mean.


It is similar to the expectation
-
maximization
algorithm for mixtures of Gaussians in that they
both attempt to find the centers of natural
clusters in the data.

K
-

M
EANS

C
LUSTERING


Regarding computational complexity, the
k
-
means clustering problem is:


NP
-
hard in general Euclidean space
d

even for 2
clusters.


NP
-
hard for a general number of clusters
k

even
in the plane.


If
k

and
d

are fixed, the problem can be exactly
solved in time
O(n
dk+1

log n)
, where
n

is the
number of entities to be clustered.


Thus, a variety of heuristic algorithms are
generally used.


K
-

M
EANS

C
LUSTERING


Heuristic algorithm


no guarantee that it will
converge to the global optimum


Algorithm is usually very fast


it is common to
run it multiple times with different starting
conditions.


It has been shown that there exist certain point
sets on which
k
-
means takes super polynomial
time: 2
Ω(√
n
)

to converge.

K
-

M
EANS

C
LUSTERING


Two key features of
k
-
means efficiency


The number of clusters
k

is an input parameter: an
inappropriate choice of
k

may yield poor results.


Euclidean distance is used as a metric and variance
is used as a measure of cluster scatter.


Often regarded as its biggest drawbacks.

A
PPLICATIONS

OF

K
-
M
EANS


Image segmentation


The
k
-
means clustering algorithm is commonly
used in computer vision as a form of image
segmentation.


The results of the segmentation are used to aid
border detection and object recognition.


Standard Euclidean distance is usually
insufficient in forming the clusters.


Instead, a weighted distance measure utilizing pixel
coordinates, RGB pixel color and/or intensity, and
image texture is commonly used.


SOM


Self organizing map (SOM) is a learning method which
produces low dimension data (e.g. 2D) from high dimension
data (
nD
)


E.g. an apple is different from a banana in more then two
ways but they can be differentiated based on their size and
color only.


If we present apples and bananas with points and similarity
with lines then


Two points connected by a shorter line are of same kind


Two points connected by a longer line are of different kind


Shorter line = line with length less then threshold t


Longer line = line with length greater then threshold t


We just created a map to differentiate an apple from banana
based on two traits only.


We have successfully “trained” the SOM, now anyone can use
to “map” apples from banana and vice versa



DEMO for SOM Training


DEMO for SOM Mapping


A
PPLICATION

OF

SOM


Genome Clustering


Goal:
trying to understand the
phylogenetic

relationship between different genomes.


Compute:
bootstrap support of individual genomes for
different
phylogentic

tree topologies, then cluster
based on the topology support.


Clustering Proteins based on the architecture of
their activation loops


Align the proteins under investigation


Extract the functional centers


Turn 3D representation into 1D feature vectors


Cluster based on the feature vectors

PCA


Principal component analysis

(PCA) is a
mathematical procedure that transforms a
number of possibly correlated variables into a
smaller number of uncorrelated variables called
principal components


Also know as Independent component analysis or
dimension reduction technique


SOM and PCA are related (SOM is non
-
linear
PCA)


PCA decomposes complex data relationship into
simple components and then represent all data in
terms of these simple components


SOM is efficient then PCA but PCA is more
versatile.

PCA E
XAMPLE


Suppose three entities X1, X2 and X3 acts
together to define a process. i.e. their graph will
have three dimensions




A 3D P
LOT

A
PPLY

PCA


It is hard to guess the relationships


X1 vs. X2


X2 vs. X3


X3 vs. X1


PCA can transform this 3D graph into four 2D
graph to reveal individual relationship among
each of three Xi.


1 3D = 4 2D

O
NE

OF

THE

FOUR

2D
S

C
ONTENTS

1


Overview of Gene Clustering:
Biological Definitions

2


Algorithmic Methodologies

3


Software


Cluster 3 Program

4


Case Study / Article Overview :

C
LUSTER

3.0


Implements most commonly used clustering
methods for gene expression data analysis


provides a computational and graphical
environment for analyzing data from DNA
microarray experiments, or other genomic
datasets


Data_set.txt => Cluster 3.0 => cluster_output.txt


Cluster_output.txt =>
TreeView

=> Visualization


Cluster 3.0,

TreeView

are both open source and
Sample data

is also provided to play around with
it.






L
OADING

F
ILE





Rows

are genes



Columns

are samples (BLUE)



YOFR

(yeast open reading frame) is used by
TreeView

to specify how rows are linked to
external websites


Table is represented as a tab delimited file for
Cluster to use it

F
ILTER

D
ATA


Filtering tab allows you to remove genes that do
not have certain desired properties from your
dataset


% Present >= X
. This removes all genes that have
missing values in greater than (100
-
X
) percent of the
columns.


SD (Gene Vector) >= X
. This removes all genes that
have standard deviations of observed values less
than

X
.


At least X Observations with abs(Val) >= Y
. This
removes all genes that do not have at
least

X

observations with absolute values greater
than

Y
.


MaxVal
-
MinVal

>= X
. This removes all genes whose
maximum minus minimum values are less than

X
.


A
DJUSTING

D
ATA


Cluster allow to perform a number of operations that alter the
underlying data in the imported file



Log Transform Data:
replace all data values x by log2 (x).
Why?


Center genes [mean or median]: Subtract the row
-
wise mean or median
from the values in each row of data, so that the mean or median value of
each row is 0.


Center arrays [mean or median]:
Subtract the column
-
wise mean or
median from the values in each column of data, so that the mean or
median value of each column is 0.


Normalize genes:
Multiply all values in each row of data by a scale factor
S so that the sum of the squares of the values in each row is 1.0 (a separate
S is computed for each row).


Normalize arrays:
Multiply all values in each column of data by a scale
factor S so that the sum of the squares of the values in each column is 1.0
(a separate S is computed for each column).



These operations are not associative, so the order in which these
operations is applied is very important


Log transforming centered genes are not the same as centering log
transformed genes.

L
OG

T
RANSFORMATION


Experiment: analyzing gene expression data
from DNA microarray as florescent ratios


We are looking gene expression over time


Results are relative expression level to time 0


Time 0: base time


Time 1: gene is unchanged


Time 2: gene is up
-
regulated 2 folds


Time 3: gene is down
-
regulated 2 folds


“Is 2
-
fold up the same magnitude of change as 2
-
fold down but just in the opposite direction?”


If yes, then log transform the sample data


If no, then use the data as it is


M
EAN
/M
EDIAN

C
ENTERING


Experiment: analyzing a large number of tumor
samples all compared to a common reference sample
made from a collection of cell
-
lines.


For each gene, you have a series of ratio values that are
relative to the expression level of that gene in the reference
sample.


Since the reference sample really has nothing to do with
your experiment, you want your.


“Is reference sample a part of the experimental
samples or vice versa, i.e. analysis is independent of
the amount of a gene present in the reference sample”


If yes, then use centering


If no, then work with raw data


Median centering is preferred over mean centering



D
ISTANCE
/S
IMILARITY

M
EASURE







“Is graph on the left the same as graph on the right?”


Pearson correlation factor says they are similar, i.e. x = 2x
= 2x+y. Use Spearman rank correlation or Kendall's τ of
Cluster 3.0.



Euclidean distance says they are not similar, i.e. x != 2x.


Pearson measures only the similarity while Euclidean
measures the magnitude of similarity.


C
ONTENTS

1


Overview of Gene Clustering:
Biological Definitions

2


Algorithmic Methodologies

3


Software


Cluster 3 Program

4


Case Study / Article Overview

A
RTICLE


“Systematic Variation in Gene Expression
Patterns in Human Cancer Cell Lines”


2000: Nature America, Inc.; Princeton
University


http://genetics.nature.com


Primary authors


Ross Douglas,
Scherf

Uwe
, Michael
Eisen




B
ACKGROUND



Cell lines from human tumors


used for many years as
experimental models to show
neoplasia

or
neoplastic

disease


60 cancer cell lines


National Cancer Institute’s
Developmental Therapeutics Program (DPT)


DNA microarrays to show variation in the prevalence of
transcripts



Comparing RNA from:


Two breast cancer biopsy samples


Sample of normal breast tissue


NCI60 cell lines derived from breast cancers (excluding MDA
-
MB
-
435 and MDA
-
N)


Leukaemias


Pattern shared between the cancer specimens and
individual cell lines derived from breast cancers and
leukaemias


B
ACKGROUND


cDNA

microarrays were used to explore variation
in 8,000 different genes along 60 cell lines


National Cancer Institute


Screen for anti
-
cancer drugs


Purpose:


Show phenotypic variation


cell reproduction rate,
drug metabolism


Location of tumors


To verify gene expression comparison patterns in cell
lines to that of normal breast tissue or tumor samples
within breast tissue


Clustering to look at outliers that would validate or
dismiss previous classification efforts


C
LUSTERING

IN

A
CTION


Process: Develop rows of genes and columns of
microarray hybridization


Normalized fluorescence ratios from the database


Subtraction of local background


Established specific criteria to group a subset of
the 9,703
cDNA

elements from the arrays


Centered data by subtracting arithmetic mean of
all ratios measured


log2 (ratio) > 2.8


Centering provides for all future analysis to be
independent of amount of mRNA in reference pool


C
LUSTERING

IN

A
CTION


Display representing microarray
hybridization and genes



Normalized the data and switched
quantitative data to that of a color
gradient



Each color represents the mean
adjusted expression level of the gene
and cell line


C
LUSTERING

IN

A
CTION

C
LUSTERING

IN

A
CTION

C
LUSTERING

IN

A
CTION

C
LUSTERING

IN

A
CTION



Hierarchical clustering algorithm


Pearson correlation coefficient


comparing
similarities and ignoring differences in variation
along cell line genes


Similar expression characterized by short
branches


and longer branches denote dissimilarities

C
LUSTERING

IN

A
CTION


Dendrogram
:
gene expression patterns within
cell line of original tissue


Cell lines derived from
leukaemia
, melanoma,
central nervous system, colon, renal and ovarian
tissue.

C
ONCLUSIONS


cDNA’s

provided 8,000 genes


only 3,700 represented
previously classified human proteins


1,900 had homologues in other organisms and 2,400
were identified via ESTs


Estimated that
80%

of the genes were correctly
identified


Able to analyze intact tumors within their specific
microenvironment


Dendrograms

provide possibility improved taxonomy
of cancer


Helpful to explain heterogeneity of breast cancer


Possibility of individual treatment regimens
(personalized medicine)



Thank You!