Incorporating biological knowledge into distance-based clustering ...


Sep 29, 2013 (4 years and 9 months ago)


Vol.22 no.10 2006,pages 1259–1268
Gene expression
Incorporating biological knowledge into distance-based
clustering analysis of microarray gene expression data
Desheng Huang
and Wei Pan
Department of Mathematics,China Medical University,Shenyang,China and
Division of Biostatistics,
School of Public Health,University of Minnesota,Minneapolis,MN 55455-0392,USA
Received on November 28,2005;revised on January 16,2006;accepted on February 20,2006
Advance Access publication February 24,2006
Associate Editor:Joaquin Dopazo
Motivation:Because co-expressed genes are likely to share the same
biological function,cluster analysis of gene expression profiles has
been applied for gene function discovery.Most existing clustering
methods ignore known gene functions in the process of clustering.
Results:To take advantage of accumulating gene functional annota-
tions,we propose incorporating known gene functions into a new
distance metric,which shrinks a gene expression-based distance
towards 0 if and only if the two genes share a common gene function.
A two-step procedure is used.First,the shrinkage distance metric is
used in any distance-based clustering method,e.g.K-medoids or
hierarchical clustering,to cluster the genes with known functions.
Second,while keeping the clustering results from the first step for
the genes with known functions,the expression-based distance metric
is used to cluster the remaining genes of unknown function,assigning
each of them to either one of the clusters obtained in the first step or
some new clusters.A simulation study and an application to gene
function prediction for the yeast demonstrate the advantage of our
proposal over the standard method.
This article concerns incorporating biological knowledge into
clustering genes for gene function discovery using microarray
expression data,though the methodology can have more broad
applications.It has been observed that genes with the same function
or involved in the same biological process are likely to co-express,
hence clustering genes’ expression profiles provides a means for
gene function prediction;see,e.g.Eisen et al.(1998),Brown et al.
(2000),Wu et al.(2002),Xiao and Pan (2005) and references
therein.However,most existing approaches ignore known functions
of the genes in the process of clustering.The limitations of these
approaches and the importance of incorporating gene functional or
other types of biological knowledge into clustering analysis have
been increasingly recognized.In particular,it is well-known that
results of clustering gene expression data are not stable (Zhang and
Zhao,2000;Kerr and Churchill,2001).In general,with relatively
high noise levels of genomic data,it is recognized that incorporating
biological knowledge into statistical analysis is a reliable way to
maximize statistical efficiency and enhance the interpretability of
analysis results.Hanisch et al.(2002) proposed incorporating a
metabolic pathway while Cheng et al.(2004) incorporating the
Gene Ontology (GO) (Ashburner et al.,2000) into clustering.
Both approaches work by first defining a distance metric based
on a biological network,either a pathway or GO.Then this metric
is combined with the usual expression-based metric using their
average,which is used in a clustering algorithm.As to be discussed,
owing to incomplete biological knowledge,such a combined metric
is in general biased:it over-estimates the distance between any two
genes of unknown function,as compared with the genes with known
functions.Specifically,two genes that in truth share the same func-
tion,which however is not known yet,will have an incorrectly large
distance in the biological function metric,and thus an incorrectly
large distance in the combined metric.We propose a novel idea to
handle this problem.Among other attempts,Fang et al.(2006) used
GO to guide clustering;i.e.only GO nodes are possible clusters.A
downside is that it does not permit the discovery of new gene
functional categories.Pan (2006) and Huang et al.(2006) proposed
two modifications to standard model-based clustering so that the
genes sharing the same biological function are allowed to have a
common prior probability in any cluster while the genes with dif-
ferent functions may have varying prior probabilities.The two
modifications are only applicable to model-based clustering,
which has been extensively studied in the context of clustering
gene expression data (e.g.Yeung et al.,2001;Broet et al.,2002;
Ghosh and Chinnaiyan,2002;McLachlan et al.,2002;Pan et al.,
2002);see McLachlan and Peel (2002) for a nice introduction to
model-based clustering.Here we consider another class,distance-
based clustering,ranging from partitional to hierarchical methods,
some of which are among the most competitive for gene expression
data;see Datta and Datta (2003).To be concrete,we focus on the
K-medoids method,also called partitioning around medoids (PAM)
(Kaufman and Rousseeuw,1990),which is a robust version of the
K-means and has been shown to work well for gene expression data
(van der Laan et al.,2003).
In our proposal,we assume that each gene of a genome can be
assigned into at least one,possibly several,of a few groups:one
group contains the genes of unknown function while each of the
other contains the genes sharing a common function.The grouping
can be based on biological pathways or gene functional annotation
systems,such as GO (Ashburner et al.,2000),MIPS (Mewes et al.,
2004) and KEGG (Kanehisa,1996).As usual,we can calculate a
distance matrix for all the genes based on their gene expression
profiles using,for example,Euclidean distance or Pearson’s
To whom correspondence should be addressed.
 The Author 2006.Published by Oxford University Press.All rights reserved.For Permissions,please
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from
by guest on September 29, 2013 from