O
VERVIEW
OF
G
ENE
C
LUSTERING
AND
A
LGORITHMIC
M
ETHODOLOGIES
Beth
Benas
Rizwan
Habib
Alexander
Lowitt
Piyush
Malve
C
ONTENTS
1
•
Overview of Gene Clustering:
Biological Definitions
2
•
Algorithmic Methodologies
3
•
Software
–
Cluster 3 Program
4
•
Case Study / Article Overview
W
HAT
IS
G
ENE
C
LUSTERING
?
Two or more genes that code for the same or similar
products
Two different processes for duplication of original
genes via:
1) Homologous recombination
2) Transposition events
H
OMOLOGOUS
R
ECOMBINATION
Genetic recombination where nucleotides are
exchanged between similar or identical strands of
DNA
Breaking and rejoining strands of DNA
Established in meiosis to provide for more genetic
variability
H
OMOLOGOUS
R
ECOMBINATION
*http://www.web

books.com/MoBio/Free/Ch8D1.htm
M
ISALIGNMENT
D
URING
H
OMOLOGOUS
R
ECOMBINATION
http://jeb.biologists.org/cgi/reprint/203/6/1059.pdf
R
ETROTRANSPOSON
Transposons
mobile DNA
Sequences of DNA that are capable of moving to
alternative positions along the genome of a single cell
“jumping genes”
Retrotransposition
type of
transposon
able to
become amplified within a genome
Relatively stable and tend to withstand natural
selection
Thus, prevalent across generations
M
UTATIONS
IN
D
UPLICATED
G
ENE
Second copy generated is free from selective
pressure
Second copy can mutate quicker
Not necessarily lasting changes
W
HAT
D
OES
A
LL
T
HIS
M
EAN
?
Useful technique to group similar genetic code
together
Relational understanding between homologous
objects
Trending / patterns of genetic expression
Functional relatedness
Phenotypic relatedness
W
HAT
IS
G
ENE
C
LUSTERING
?
•
Presume
–
Genome is a 2D Cartesian space or a graph paper
–
Genes are now points on this graph paper
–
Let see how many lines and hyperbolas are there?
•
Gene clustering is the process of assigning two or
more genes to a “gene cluster” that serve to encode for
the same or similar products
•
As populations from a common ancestor tend to
possess the same varieties of gene clusters, they are
useful for tracing back recent evolutionary history.
•
An example of a gene cluster is the Human β

globin
gene cluster, which contains five functional genes and
one non

functional gene for similar proteins.
–
All Hemoglobin molecules contain any two identical
proteins from this gene cluster, depending on their specific
role.
C
ONTENTS
1
•
Overview of Gene Clustering:
Biological Definitions
2
•
Algorithmic Methodologies
3
•
Software
–
Cluster 3 Program
4
•
Case Study / Article Overview
H
IERARCHICAL
C
LUSTERING
Allows organization of the clustering data to be
represented in a tree (
dendrogram
)
Agglomerative
(Bottom Up): each observation
starts as own cluster. Clusters are merged based
on similarities
Divisive
(Top Down): all observations start in
one cluster, and splits are performed recursively
as one moves down the hierarchy.
In general, splits in the tree are determined in a
greedy manner.
H
IERARCHICAL
C
LUSTERING
Divisive
Agglomerative
H
IERARCHICAL
C
LUSTERING
•
A measure of dissimilarity between sets of
observations is required for combination and
division of clusters.
•
This is achieved by use of an appropriate metric (a
measure of distance between pairs of
observations), and a linkage criteria which
specifies the dissimilarity of sets as a function of
the
pairwise
distances of observations in the sets.
H
IERARCHICAL
C
LUSTERING
•
The choice of an appropriate metric will influence
the shape of the clusters, as some elements may
be close to one another according to one distance
and farther away according to another.
•
The linkage criteria determines the distance
between sets of observations as a function of the
pairwise
distances between observations.
A
DVANTAGE
Hierarchical clustering has the distinct
advantage that any valid measure of distance can
be used. In fact, the observations themselves are
not required: all that is used is a matrix of
distances
K

M
EANS
C
LUSTERING
k

means clustering
is a method of cluster
analysis which aims to partition
n
observations
into
k
clusters in which each observation belongs
to the cluster with the nearest mean.
It is similar to the expectation

maximization
algorithm for mixtures of Gaussians in that they
both attempt to find the centers of natural
clusters in the data.
K

M
EANS
C
LUSTERING
•
Regarding computational complexity, the
k

means clustering problem is:
•
NP

hard in general Euclidean space
d
even for 2
clusters.
•
NP

hard for a general number of clusters
k
even
in the plane.
•
If
k
and
d
are fixed, the problem can be exactly
solved in time
O(n
dk+1
log n)
, where
n
is the
number of entities to be clustered.
•
Thus, a variety of heuristic algorithms are
generally used.
K

M
EANS
C
LUSTERING
•
Heuristic algorithm
no guarantee that it will
converge to the global optimum
•
Algorithm is usually very fast
it is common to
run it multiple times with different starting
conditions.
•
It has been shown that there exist certain point
sets on which
k

means takes super polynomial
time: 2
Ω(√
n
)
to converge.
K

M
EANS
C
LUSTERING
•
Two key features of
k

means efficiency
•
The number of clusters
k
is an input parameter: an
inappropriate choice of
k
may yield poor results.
•
Euclidean distance is used as a metric and variance
is used as a measure of cluster scatter.
•
Often regarded as its biggest drawbacks.
A
PPLICATIONS
OF
K

M
EANS
•
Image segmentation
•
The
k

means clustering algorithm is commonly
used in computer vision as a form of image
segmentation.
•
The results of the segmentation are used to aid
border detection and object recognition.
•
Standard Euclidean distance is usually
insufficient in forming the clusters.
•
Instead, a weighted distance measure utilizing pixel
coordinates, RGB pixel color and/or intensity, and
image texture is commonly used.
SOM
•
Self organizing map (SOM) is a learning method which
produces low dimension data (e.g. 2D) from high dimension
data (
nD
)
•
E.g. an apple is different from a banana in more then two
ways but they can be differentiated based on their size and
color only.
•
If we present apples and bananas with points and similarity
with lines then
–
Two points connected by a shorter line are of same kind
–
Two points connected by a longer line are of different kind
•
Shorter line = line with length less then threshold t
•
Longer line = line with length greater then threshold t
•
We just created a map to differentiate an apple from banana
based on two traits only.
•
We have successfully “trained” the SOM, now anyone can use
to “map” apples from banana and vice versa
•
DEMO for SOM Training
•
DEMO for SOM Mapping
A
PPLICATION
OF
SOM
•
Genome Clustering
–
Goal:
trying to understand the
phylogenetic
relationship between different genomes.
–
Compute:
bootstrap support of individual genomes for
different
phylogentic
tree topologies, then cluster
based on the topology support.
•
Clustering Proteins based on the architecture of
their activation loops
–
Align the proteins under investigation
–
Extract the functional centers
–
Turn 3D representation into 1D feature vectors
–
Cluster based on the feature vectors
PCA
•
Principal component analysis
(PCA) is a
mathematical procedure that transforms a
number of possibly correlated variables into a
smaller number of uncorrelated variables called
principal components
•
Also know as Independent component analysis or
dimension reduction technique
•
SOM and PCA are related (SOM is non

linear
PCA)
•
PCA decomposes complex data relationship into
simple components and then represent all data in
terms of these simple components
•
SOM is efficient then PCA but PCA is more
versatile.
PCA E
XAMPLE
Suppose three entities X1, X2 and X3 acts
together to define a process. i.e. their graph will
have three dimensions
A 3D P
LOT
A
PPLY
PCA
It is hard to guess the relationships
X1 vs. X2
X2 vs. X3
X3 vs. X1
PCA can transform this 3D graph into four 2D
graph to reveal individual relationship among
each of three Xi.
1 3D = 4 2D
O
NE
OF
THE
FOUR
2D
S
C
ONTENTS
1
•
Overview of Gene Clustering:
Biological Definitions
2
•
Algorithmic Methodologies
3
•
Software
–
Cluster 3 Program
4
•
Case Study / Article Overview :
C
LUSTER
3.0
•
Implements most commonly used clustering
methods for gene expression data analysis
•
provides a computational and graphical
environment for analyzing data from DNA
microarray experiments, or other genomic
datasets
•
Data_set.txt => Cluster 3.0 => cluster_output.txt
•
Cluster_output.txt =>
TreeView
=> Visualization
•
Cluster 3.0,
TreeView
are both open source and
Sample data
is also provided to play around with
it.
L
OADING
F
ILE
•
Rows
are genes
•
Columns
are samples (BLUE)
•
YOFR
(yeast open reading frame) is used by
TreeView
to specify how rows are linked to
external websites
•
Table is represented as a tab delimited file for
Cluster to use it
F
ILTER
D
ATA
•
Filtering tab allows you to remove genes that do
not have certain desired properties from your
dataset
–
% Present >= X
. This removes all genes that have
missing values in greater than (100

X
) percent of the
columns.
–
SD (Gene Vector) >= X
. This removes all genes that
have standard deviations of observed values less
than
X
.
–
At least X Observations with abs(Val) >= Y
. This
removes all genes that do not have at
least
X
observations with absolute values greater
than
Y
.
–
MaxVal

MinVal
>= X
. This removes all genes whose
maximum minus minimum values are less than
X
.
A
DJUSTING
D
ATA
•
Cluster allow to perform a number of operations that alter the
underlying data in the imported file
–
Log Transform Data:
replace all data values x by log2 (x).
Why?
–
Center genes [mean or median]: Subtract the row

wise mean or median
from the values in each row of data, so that the mean or median value of
each row is 0.
–
Center arrays [mean or median]:
Subtract the column

wise mean or
median from the values in each column of data, so that the mean or
median value of each column is 0.
–
Normalize genes:
Multiply all values in each row of data by a scale factor
S so that the sum of the squares of the values in each row is 1.0 (a separate
S is computed for each row).
–
Normalize arrays:
Multiply all values in each column of data by a scale
factor S so that the sum of the squares of the values in each column is 1.0
(a separate S is computed for each column).
•
These operations are not associative, so the order in which these
operations is applied is very important
•
Log transforming centered genes are not the same as centering log
transformed genes.
L
OG
T
RANSFORMATION
Experiment: analyzing gene expression data
from DNA microarray as florescent ratios
We are looking gene expression over time
Results are relative expression level to time 0
Time 0: base time
Time 1: gene is unchanged
Time 2: gene is up

regulated 2 folds
Time 3: gene is down

regulated 2 folds
“Is 2

fold up the same magnitude of change as 2

fold down but just in the opposite direction?”
If yes, then log transform the sample data
If no, then use the data as it is
M
EAN
/M
EDIAN
C
ENTERING
Experiment: analyzing a large number of tumor
samples all compared to a common reference sample
made from a collection of cell

lines.
For each gene, you have a series of ratio values that are
relative to the expression level of that gene in the reference
sample.
Since the reference sample really has nothing to do with
your experiment, you want your.
“Is reference sample a part of the experimental
samples or vice versa, i.e. analysis is independent of
the amount of a gene present in the reference sample”
If yes, then use centering
If no, then work with raw data
Median centering is preferred over mean centering
D
ISTANCE
/S
IMILARITY
M
EASURE
“Is graph on the left the same as graph on the right?”
Pearson correlation factor says they are similar, i.e. x = 2x
= 2x+y. Use Spearman rank correlation or Kendall's τ of
Cluster 3.0.
Euclidean distance says they are not similar, i.e. x != 2x.
Pearson measures only the similarity while Euclidean
measures the magnitude of similarity.
C
ONTENTS
1
•
Overview of Gene Clustering:
Biological Definitions
2
•
Algorithmic Methodologies
3
•
Software
–
Cluster 3 Program
4
•
Case Study / Article Overview
A
RTICLE
“Systematic Variation in Gene Expression
Patterns in Human Cancer Cell Lines”
2000: Nature America, Inc.; Princeton
University
http://genetics.nature.com
Primary authors
Ross Douglas,
Scherf
Uwe
, Michael
Eisen
B
ACKGROUND
Cell lines from human tumors
used for many years as
experimental models to show
neoplasia
or
neoplastic
disease
60 cancer cell lines
National Cancer Institute’s
Developmental Therapeutics Program (DPT)
DNA microarrays to show variation in the prevalence of
transcripts
Comparing RNA from:
Two breast cancer biopsy samples
Sample of normal breast tissue
NCI60 cell lines derived from breast cancers (excluding MDA

MB

435 and MDA

N)
Leukaemias
Pattern shared between the cancer specimens and
individual cell lines derived from breast cancers and
leukaemias
B
ACKGROUND
cDNA
microarrays were used to explore variation
in 8,000 different genes along 60 cell lines
National Cancer Institute
Screen for anti

cancer drugs
Purpose:
Show phenotypic variation
cell reproduction rate,
drug metabolism
Location of tumors
To verify gene expression comparison patterns in cell
lines to that of normal breast tissue or tumor samples
within breast tissue
Clustering to look at outliers that would validate or
dismiss previous classification efforts
C
LUSTERING
IN
A
CTION
Process: Develop rows of genes and columns of
microarray hybridization
Normalized fluorescence ratios from the database
Subtraction of local background
Established specific criteria to group a subset of
the 9,703
cDNA
elements from the arrays
Centered data by subtracting arithmetic mean of
all ratios measured
log2 (ratio) > 2.8
Centering provides for all future analysis to be
independent of amount of mRNA in reference pool
C
LUSTERING
IN
A
CTION
•
Display representing microarray
hybridization and genes
•
Normalized the data and switched
quantitative data to that of a color
gradient
•
Each color represents the mean
adjusted expression level of the gene
and cell line
C
LUSTERING
IN
A
CTION
C
LUSTERING
IN
A
CTION
C
LUSTERING
IN
A
CTION
C
LUSTERING
IN
A
CTION
Hierarchical clustering algorithm
Pearson correlation coefficient
comparing
similarities and ignoring differences in variation
along cell line genes
Similar expression characterized by short
branches
and longer branches denote dissimilarities
C
LUSTERING
IN
A
CTION
Dendrogram
:
gene expression patterns within
cell line of original tissue
Cell lines derived from
leukaemia
, melanoma,
central nervous system, colon, renal and ovarian
tissue.
C
ONCLUSIONS
cDNA’s
provided 8,000 genes
only 3,700 represented
previously classified human proteins
1,900 had homologues in other organisms and 2,400
were identified via ESTs
Estimated that
80%
of the genes were correctly
identified
Able to analyze intact tumors within their specific
microenvironment
Dendrograms
provide possibility improved taxonomy
of cancer
Helpful to explain heterogeneity of breast cancer
Possibility of individual treatment regimens
(personalized medicine)
Thank You!
Comments 0
Log in to post a comment