A New Biclustering Algorithm for
Analyzing Biological Data
Prashant
Paymal
Advisor: Dr.
Hesham
Ali
Introduction
•
Microarray technology use to study the
expression of many genes at once
•
Large amount of data is produced in the
microarray technology
•
Proper analysis of the data is important to get
meaningful information from it
•
There is a need for new analysis techniques
Data Analysis
•
From data to knowledge
•
We need to process data by grouping and
synthesizing information into a “big picture”
based upon characteristics and relationships
•
One of the most used analysis technique is
traditional clustering
Traditional Clustering
•
Applied to either rows or columns of the data matrix
separately
•
Each gene is defined using all the conditions
•
Each condition is characterized by the activity of all
the genes that belong to it
Genes
Genes
Conditions
Conditions
Motivation
•
The large amount of data provide us great
challenges of analysis
•
Clustering algorithms consider all the conditions to
group genes and all the genes to group conditions
•
Biologically data may not show similar behavior in
all conditions but in a subset of them
•
Traditional clustering algorithms will very likely
miss some important information
Biclustering
•
The term “Biclustering” was first used by Cheng and
Church in gene expression data analysis [Year
2000]
•
Clusters do not need to include all parameters
(genes in Bioinformatics) for all conditions
•
Data Matrix
▫
Each gene
–
One row
▫
Each condition
–
One column
▫
Each element
–
expression level of a gene under
specific condition
Biclustering (Cont.)
•
Performs clustering in these two dimensions
simultaneously
•
Each gene is selected using only a subset of the
conditions
•
Each condition is selected using only a subset of the
genes
Genes
Conditions
Goal of
Biclustering
•
To identify subgroups of genes and subgroups of
conditions by performing simultaneous
clustering of both rows and columns of the gene
expression matrix, instead of clustering these
two dimensions separately
•
To find biclusters is NP

hard problem: It is
actually a generalized version of traditional
clustering
Previous Work
•
A systematic comparison and evaluation of
biclustering methods for gene expression data

Amela
Prelic
(2006)
•
Algorithms:
▫
Statistical Algorithmic Method for Biclustering Analysis
Algorithm (SAMBA)
▫
Order Preserving
Submatrix
Algorithm (OPSM)
▫
Iterative Signature Algorithm (ISA)
▫
Cheng and Church algorithm
▫
xMotif
▫
Bimax
Previous Work (Cont.)
•
Comparative Analysis of Biclustering Algorithms
–
Doruk
Bozdag
…
(2010)
•
Algorithms
▫
Correlated Pattern Bicluster Algorithm (CPB)
▫
Cheng and Church Algorithm
▫
Order Preserving
Submatrix
Algorithm (OPSM)
▫
HARP Algorithm
▫
Minimum Sum

Squared Residue

based
CoClustering
Algorithm
(MSSRCC)
▫
Statistical Algorithmic Method for Biclustering Analysis
Algorithm (SAMBA)
The Importance of Assessment
•
Different algorithms give different solutions for
same data
•
There is no agreed upon guideline for choosing
among them
•
Validation Techniques
▫
External Validation Measures
Evaluate a result based on the knowledge of the correct
class labels
▫
Internal Validation Measures
Evaluate a result based on the information intrinsic to the
data alone
Validation
•
In most biclustering papers external validation
measures used to assess the methods,
▫
It is not clear how to extend notions such as
homogeneity and separation to the biclustering
context (Gat

Viks
et al 2003)
▫
Internal measures don’t work well in case of
biclustering due to which Gat

Viks
et al 2003 and
Handl
et al 2005 recommend external measures
Objectives of the Project
•
Comprehensive Assessment Technique
▫
Internal measures as well as external measures
•
Customized
Biclustering
Method
▫
Input domain
Validation using Synthetic Data
•
Testing using Manufactured data
▫
The portion of the implanted bicluster the
algorithm was able to return
▫
The portion external or irrelevant to the implanted
bicluster which algorithm returns
▫
Two metrics to evaluate cluster quality
U: Uncovered portion of the implanted bicluster
E: Portion of the output cluster external to the
implanted bicluster
Validation using Synthetic Data
•
Testing using real (domain specific) data
–
for
example using Gene match score
▫
M1, M2 be two sets of
Biclusters
▫
Average of the maximum match scores for all
biclusters
in M1 with respect to the
bicluster
in M2
•
Potential improvements
▫
Don’t consider samples / conditions
▫
Specificity and Sensitivity
Proposed Assessment
•
Calculate sensitivity and specificity scores
▫
Specificity: proportion of negatives which are correctly
identified
▫
Sensitivity: proportion of actual positives which are
correctly identified
•
Improve existing measures:
▫
Average of the maximum match scores for all bi

clusters in
M1 with respect to bi

clusters in M2 (considering both
genes and samples)
•
Assessment based on knowledge of domain data
▫
The resulting
biclusters
were evaluated based on the
enrichment of Gene Ontology (GO) terms
Experiments
•
Given two
biclustering
results
▫
M1: Result of a
biclustering
algorithm
▫
M2: True Result
▫
(G1, C1) M1 and (G2, C2) M2
•
Calculate similarity score (
Jaccard
Coefficient)
▫
and
•
Calculate the two scores,
▫
Score 1: % of result of an algorithm is included in the
true result
▫
Score 2: % of true result an algorithm can find
2
1
2
1
G
G
G
G
2
1
2
1
C
C
C
C
Results
•
Synthetic Data: 100 genes and 100 samples
•
10 implanted
biclusters
of each size 10 X 10 (10 genes and 10
samples)
•
Used publically available different
biclustering
algorithm
implementations
•
Score 1: % of result of an algorithm is included in the true result
•
Score 2: % of true result an algorithm can find
Algorithm
No of
biclusters
Score
1
Score
2
Cheng and Church Algorithm (CC)
8
0.475
0.38
Iterative Search Algorithm (ISA)
9
1
0.9
Order Preserving Sub Matrix (OPSM) Algorithm
32
0.273139
0.874044
Statistical Algorithm Method for Bicluster Analysis (SAMBA)
9
0.5
0.45
xMotif
Algorithm
87
0.100023
0.870204
Conclusion
•
Traditional Clustering is too restrictive technique for
analyzing datasets in various application domains
•
We need new flexible analysis technique like biclustering
to deal with possible imperfections in the input datasets
•
Assessment of data analysis is critical and must be
considered while selecting the right tool for each
application domains
•
Biclustering represents a powerful tool for analysis of
data in a variety of domains and can be applicable to
datasets other than biology
References
•
Madeira, S.C., Oliveira, A.L.: Biclustering algorithms
for biological data analysis: A survey
•
Amela
Prelic
et al: A systematic comparison and
evaluation of biclustering methods for gene
expression data
•
http://cheng.ececs.uc.edu/biclustering
•
http://www.tik.ethz.ch/~sop/bicat/
•
http://acgt.cs.tau.ac.il/expander/
Thank you…
Comments 0
Log in to post a comment