A New Biclustering Algorithm for Analyzing Biological Data

plantationscarfAI and Robotics

Nov 25, 2013 (4 years and 1 month ago)

50 views

A New Biclustering Algorithm for
Analyzing Biological Data

Prashant

Paymal


Advisor: Dr.
Hesham

Ali

Introduction


Microarray technology use to study the
expression of many genes at once



Large amount of data is produced in the
microarray technology



Proper analysis of the data is important to get
meaningful information from it



There is a need for new analysis techniques



Data Analysis


From data to knowledge



We need to process data by grouping and
synthesizing information into a “big picture”
based upon characteristics and relationships



One of the most used analysis technique is
traditional clustering





Traditional Clustering


Applied to either rows or columns of the data matrix
separately



Each gene is defined using all the conditions



Each condition is characterized by the activity of all
the genes that belong to it


Genes

Genes

Conditions

Conditions

Motivation


The large amount of data provide us great
challenges of analysis



Clustering algorithms consider all the conditions to
group genes and all the genes to group conditions



Biologically data may not show similar behavior in
all conditions but in a subset of them



Traditional clustering algorithms will very likely
miss some important information




Biclustering


The term “Biclustering” was first used by Cheng and
Church in gene expression data analysis [Year
2000]



Clusters do not need to include all parameters
(genes in Bioinformatics) for all conditions



Data Matrix


Each gene


One row


Each condition


One column


Each element


expression level of a gene under
specific condition


Biclustering (Cont.)


Performs clustering in these two dimensions
simultaneously



Each gene is selected using only a subset of the
conditions



Each condition is selected using only a subset of the
genes


Genes

Conditions

Goal of
Biclustering


To identify subgroups of genes and subgroups of
conditions by performing simultaneous
clustering of both rows and columns of the gene
expression matrix, instead of clustering these
two dimensions separately



To find biclusters is NP
-
hard problem: It is
actually a generalized version of traditional
clustering

Previous Work


A systematic comparison and evaluation of
biclustering methods for gene expression data
-

Amela

Prelic

(2006)



Algorithms:


Statistical Algorithmic Method for Biclustering Analysis
Algorithm (SAMBA)


Order Preserving
Submatrix

Algorithm (OPSM)


Iterative Signature Algorithm (ISA)


Cheng and Church algorithm


xMotif


Bimax


Previous Work (Cont.)


Comparative Analysis of Biclustering Algorithms


Doruk

Bozdag


(2010)



Algorithms


Correlated Pattern Bicluster Algorithm (CPB)


Cheng and Church Algorithm


Order Preserving
Submatrix

Algorithm (OPSM)


HARP Algorithm



Minimum Sum
-
Squared Residue
-
based
CoClustering

Algorithm
(MSSRCC)


Statistical Algorithmic Method for Biclustering Analysis
Algorithm (SAMBA)


The Importance of Assessment


Different algorithms give different solutions for
same data



There is no agreed upon guideline for choosing
among them



Validation Techniques


External Validation Measures


Evaluate a result based on the knowledge of the correct
class labels


Internal Validation Measures


Evaluate a result based on the information intrinsic to the
data alone



Validation


In most biclustering papers external validation
measures used to assess the methods,



It is not clear how to extend notions such as
homogeneity and separation to the biclustering
context (Gat
-
Viks

et al 2003)



Internal measures don’t work well in case of
biclustering due to which Gat
-
Viks

et al 2003 and
Handl

et al 2005 recommend external measures


Objectives of the Project


Comprehensive Assessment Technique


Internal measures as well as external measures



Customized
Biclustering

Method


Input domain



Validation using Synthetic Data


Testing using Manufactured data


The portion of the implanted bicluster the
algorithm was able to return



The portion external or irrelevant to the implanted
bicluster which algorithm returns



Two metrics to evaluate cluster quality


U: Uncovered portion of the implanted bicluster


E: Portion of the output cluster external to the
implanted bicluster

Validation using Synthetic Data


Testing using real (domain specific) data


for
example using Gene match score


M1, M2 be two sets of
Biclusters





Average of the maximum match scores for all
biclusters

in M1 with respect to the
bicluster

in M2



Potential improvements


Don’t consider samples / conditions


Specificity and Sensitivity




Proposed Assessment


Calculate sensitivity and specificity scores


Specificity: proportion of negatives which are correctly
identified


Sensitivity: proportion of actual positives which are
correctly identified



Improve existing measures:


Average of the maximum match scores for all bi
-
clusters in
M1 with respect to bi
-
clusters in M2 (considering both
genes and samples)



Assessment based on knowledge of domain data


The resulting
biclusters

were evaluated based on the
enrichment of Gene Ontology (GO) terms

Experiments


Given two
biclustering

results


M1: Result of a
biclustering

algorithm


M2: True Result


(G1, C1) M1 and (G2, C2) M2



Calculate similarity score (
Jaccard

Coefficient)



and



Calculate the two scores,


Score 1: % of result of an algorithm is included in the
true result


Score 2: % of true result an algorithm can find







2
1
2
1
G
G
G
G


2
1
2
1
C
C
C
C


Results


Synthetic Data: 100 genes and 100 samples


10 implanted
biclusters

of each size 10 X 10 (10 genes and 10
samples)


Used publically available different
biclustering

algorithm
implementations










Score 1: % of result of an algorithm is included in the true result


Score 2: % of true result an algorithm can find









Algorithm

No of
biclusters

Score
1

Score
2

Cheng and Church Algorithm (CC)

8

0.475

0.38

Iterative Search Algorithm (ISA)

9

1

0.9

Order Preserving Sub Matrix (OPSM) Algorithm

32

0.273139

0.874044

Statistical Algorithm Method for Bicluster Analysis (SAMBA)

9

0.5

0.45

xMotif

Algorithm

87

0.100023

0.870204

Conclusion


Traditional Clustering is too restrictive technique for
analyzing datasets in various application domains



We need new flexible analysis technique like biclustering
to deal with possible imperfections in the input datasets



Assessment of data analysis is critical and must be
considered while selecting the right tool for each
application domains



Biclustering represents a powerful tool for analysis of
data in a variety of domains and can be applicable to
datasets other than biology

References


Madeira, S.C., Oliveira, A.L.: Biclustering algorithms
for biological data analysis: A survey




Amela

Prelic

et al: A systematic comparison and
evaluation of biclustering methods for gene
expression data



http://cheng.ececs.uc.edu/biclustering



http://www.tik.ethz.ch/~sop/bicat/



http://acgt.cs.tau.ac.il/expander/



Thank you…