Markers improve clustering of CGH data

wickedshortpumpBiotechnology

Oct 1, 2013 (3 years and 9 months ago)

75 views

Vol.23 no.4 2007,pages 450–457
doi:10.1093/bioinformatics/btl624
BIOINFORMATICS ORIGINAL PAPER
Gene expression
Markers improve clustering of CGH data
Jun Liu
￿
,Sanjay Ranka and Tamer Kahveci
Computer and Information Science and Engineering,University of Florida,Gainesville,FL 32611,USA
Received on September 24,2006;revised and accepted on December 4,2006
Advance Access publication December 6,2006
Associate Editor:John Quackenbush
ABSTRACT
Motivation:We consider the problem of clustering a population of
Comparative Genomic Hybridization (CGH) data samples using simi-
larity based clustering methods.A key requirement for clustering is to
avoid using the noisy aberrations in the CGH samples.
Results:We develop a dynamic programming algorithm to identify a
small set of important genomicintervalscalledmarkers.Theadvantage
of using these markers is that the potentially noisy genomic intervals
are excluded during the clustering process.We also develop two clus-
tering strategies using these markers.The first one,prototype-based
approach,maximizes the support for the markers.The second one,
similarity-based approach,develops a new similarity measure called
RSimandrefinesclusterswiththeaimof maximizingtheRSimmeasure
between the samples in the same cluster.Our results demonstrate
that the markers we found represent the aberration patterns of cancer
types well and they improve the quality of clustering significantly.
Availability:All software developed in this paper and all the datasets
used are available fromthe authors upon request.
Contact:juliu@cise.ufl.edu
1 INTRODUCTION
Numerical and structural chromosomal imbalances are one of
the most prominent and pathogenetically relevant features of neo-
plastic cells Mitelman et al.(1972) One method for measuring
genomic aberrations is Comparative Genomic Hybridization
(CGH) Kallioniemi et al.(1992).CGH is a molecular-cytogenetic
analysis method for detecting regions with genomic imbalances
(gains or losses of DNA segments).Applying microarray techno-
logy to CGH measures thousands of copy number information
distributed throughout the genome simultaneously Pinkel and
Albertson (2005).Raw data from array CGH experiments is
expressed as the ratio of normalized fluorescence of tumor and
reference DNA.Normalized CGH ratio data surpassing predefined
thresholds is considered indicative for genomic gains or losses,
respectively.In contrast to the array CGH,the chromosomal
CGHresults (on which this paper is based) are annotated in a reverse
in situ karyotype format (Mitelman,1995) describing imbalanced
genomic regions with reference to their chromosomal location.
CGH data of an individual tumor can be considered as an ordered
list of status values,where each value corresponds to a genomic
interval (e.g.a single chromosomal band).The status can be
expressed as a real number (positive,negative or zero for gain,
loss or no aberration,respectively).We use this strategy and rep-
resent gain,loss and no change with +1,1 and 0,respectively.
Chromosomal and array CGH recently account for the majority of
published analyses in cancer cytogenetics Gray et al.(1994);Bentz
et al.(1996);Desper et al.(1999);Hoglund et al.(2005);Mattfeldt
et al.(2001);Vandesompele et al.(2005);Joos et al.(2002).
A lot of different strategies for structural analysis of CGH data
have been developed.Some of these analysis have been aimed at
the spatial coherence of genomic segments with different copy
number levels.Many attempts have been made on this area,such
as a Gaussian model-based segmentation method Picard et al.
(2005b),unsupervised hidden Markov model approach Fridlyand
et al.(2004),hierarchical clustering tree Wang et al.(2005)
and segmentation-clustering approach Picard et al.(2005a) etc.
Willenbrock and Fridlyand (2005) pointed out that the smoothed
(segmented) CGH data aids the downstream analyses,such as
classification.
Unsupervised clustering methods are often employed to discover
previously unknown sub-categories of cancer and to identify genetic
biomarkers associated with the differentiation.Towards this end,
many attempts have been done on the studies of gene microarray
Jiang et al.(2004).However,to the best of our knowledge,so far
there has been very limited study on the clustering of smoothed
CGH data,especially on the heterogeneous sets of data that incl-
ude a large (more than several hundreds) population of samples.
In Mattfeldt et al.(2001),used a self-organizing map as a tool for
cluster analysis of CGH data.However,their studies only focus on
several tens of cases from a single cancer type.
Existing clustering algorithms Jain et al.(1999) cannot be applied
to CGH data directly.This is because CGH data are structurally
different from ordinary high-dimensional data since consecutive
dimensions (i.e.genomic intervals) may be correlated.A point-
like genomic aberration may expand to the neighboring intervals
and result in a contiguous run of gain or loss status in CGH data.
In our earlier work Liu et al.(2006),we developed a similarity
measure,Sim,for a pair of CGH samples.Sim measures the simi-
larity as the number of overlapping segments (contiguous runs of
aberrations of the same type).Although Sim leads to better clus-
tering than existing measures over a large population of CGH sam-
ples,it has a drawback.Sim is affected from noisy aberrations in
CGH data since it cannot distinguish random aberrations from the
aberrations that actually lead to the underlying cancer.
Given a set of samples belonging to the same cancer type,we
observe that a small number of point aberrations are common to a
subset of samples.We say that these point aberrations are supported
by these samples.(See Fig.1 in Appendix for an example).Such
aberrations with high-support (we will use the term markers) rep-
resent the recurrent genomic imbalances and can be used to char-
acterize the imbalance profile of the sample set.Some recent
￿
To whom correspondence should be addressed.
450
 The Author 2006.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournals.org
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
analysis of CGH data discovered important genes or chromosomal
regions that are similar to our notion of markers.For example,
Rouveirol et al.(2006) proposed algorithms for computing recurrent
minimal genomic alterations from discretized CGH data.Fu and
Fu-Liu (2005) proposed a probability model of selection to eva-
luate the importance of genes in microarray data and choose a small
set of biologically significant genes for classifier design.Beattie
and Robinson (2006) used a digital paradigm to discretize the
gene microarray and transfered the data into binary state patterns.
A clustering based on Hamming distance was applied to discover
biomarkers.However,these works have not discussed the useful-
ness of markers for the clustering of CGH data.
The aimof this paper is to demonstrate that better clusters can be
found for CGHdatasets with the help of markers.The main thesis is
that markers represent the key imbalances between all the patients
and the non-marker areas are patient specific and hence negatively
impact the clustering process.
Towards this end,we develop a dynamic programming technique
to detect important markers in a group of samples.The resulting
markers can be seen as the prototype of these samples.Clustering
is an unsupervised learning process by its definition.The quality of
a clustering can be measured using several criteria.Two important
such criteria are coverage and the NMI measure.These are internal
and external measures,respectively.We develop two methods,
each working well for different measure.Depending on the purpose
of clustering either of the two methods can be appropriate.
In particular:
(1) The first one is a prototype-based approach.It starts with a
randominitial clustering and iteratively improves the clusters
in two steps.The first step identifies a set of markers for each
cluster that have sufficient support.The second step assigns
each sample tothe cluster whose markers are supportedbythat
sample the most.This algorithm is guaranteed to converge.
(2) The second one is a similarity-based algorithm.It starts by
identifying a set of markers that capture the aberration pat-
terns globally.Wedevelopanewsimilaritymeasure,RSim,by
focusing on the aberrations at these marker positions.It then
computes the average similarity between that sample and all
the samples in each cluster using RSim.It assigns each sample
to the cluster with the highest average similarity.
Our results demonstrate the utility of finding markers.Experi-
mental results show that the prototype-based method has better
internal measure than the competing methods.The similarity-
based method has better external measure.
The rest of the paper is organized as follows.Section 2 proposes
the technique of finding markers in a group of samples.Section 3
describes the prototype-based algorithm and proves its conver-
gence.Section 4 describes the refined Sim measure and the
similarity-based algorithm.Section 5 presents the experimental
results.Section 6 concludes our work in this paper.
2 DETECTION OF MARKERS
Most cancers result from genomic instability and display various
genomic aberrations.A recurrent alteration is a set of aberrations
common to sufficiently many CGH samples.The recurrent alter-
ations can be used to characterize the aberration pattern of samples.
Due to the correlation between adjacent genomic intervals,recurrent
regions can be represented using a small number of genomic inter-
vals within these regions.We call these genomic intervals markers.
Each marker is represented by two numbers <p,q>,where p and q
denote the position and the type of aberration,respectively.
Given a set S of NCGHsamples {s
1
,s
2
,  ,s
N
}.Let x
j
d
denote the
status value for sample j at genomic interval d,8d,1 d D,where
D is the number of intervals.Let s
j
[u,v] be the segment of s
j
that
starts at the uth interval and ends at the vth interval.We use the term
segment to represent a contiguous block of aberrations of the same
type.Formally,a list of status values x
j
u
‚x
j
uþ1
‚    ‚x
j
v
,for 1  u 
v  D define a segment if genomic intervals u through v are in the
same chromosome,the values fromx
j
u
to x
j
v
are all gains or all losses,
and x
j
u1
and x
j
vþ1
are different than x
j
u
.For example,in Figure 1,
sample X contains four segments.The first and third segments are
gain type while the second and fourth segment are loss type.
Let {m
t
¼ < p
t
,q
t
>j 1  t  R},be a set of markers that are
ordered along the genomic intervals,i.e.p
1
< p
2
<    < p
R
.We
define the support of s
j
to m
t
as s(s
j
,m
t
).Here,s(s
j
,m
t
) ¼1 only if
both of the following conditions are satisfied:
(1) Support:There exists a segment s
j
[u,v] overlapping with m
t
,
i.e.up
t
vandthe typeof s
j
[u,v] is thesame as that of m
t
,i.e.
x
j
u
¼ q
t
.
(2) Uniqueness:There exists no other marker m
t
0
,t
0
< t,in the
same chromosome band such that u  p
t
0
 v,x
j
u
¼ q
t
0
and
s(s
j
,m
t
0
) ¼ 1.
Otherwise,s(s
j
,m
t
) ¼ 0.We say s
j
supports m
t
or m
t
covers s
j
if s(s
j
,m
t
) ¼ 1.We define the support value of marker m
t
as the sum of its support from all the samples.Formally,
Supportðm
t
Þ ¼
P
N
j¼1
sðs
j
‚m
t
Þ.
The intuition behind these two conditions is as follows.
(1) Support:The support value of a marker counts the number of
samples that share the same aberration status as the type of this
marker.Large support value for a marker implies that that marker is
important in characterizing the aberration pattern of samples.The
first condition ensures that a sample supports a marker only if it
has the same aberration as that marker in the position specified by
that marker.(2) Uniqueness:The aberrations in the same segment
may correspond to a single aberration that spread to neighboring
genomic intervals Liu et al.(2006).The second condition forces a
segment overlapping with multiple markers of the same aberration
type to support only one of those markers.
Marker detection problem can be formally defined as follows.
Given a set of CGH samples S ¼ {s
1
,s
2
,  ,s
N
} and a positive
integer R,the goal is to find the set of R markers M ¼ {m
1
,
m
2
,  ,m
R
},p
1
< p
2
<...< p
R
,such that the sum of support of
markers in M,i.e.
P
R
t¼1
Supportðm
t
Þ,is maximized.Next,we
develop a dynamic programming algorithm to solve this problem
optimally.
Fig.1.Two CGHsamples Xand Y with the values of genomic intervals listed
in the order of positions.The segments are underlined.
Markers improve clustering of CGH data
451
Let Oðd‚rÞ ¼
P
r
t¼1
Supportðm
t
Þ,for 1 p
t
 d  D,8t 2[1,r]
denote the largest possible support that r markers can get from the
genomic intervals in the range [1..d].Here,[1..d] denote the integers
1,2,...,d.O(1,1) ¼ support of the single marker at the first
genomic interval.The value of O(d,r) in general,where 1 
r R,r d D,can be computed as the maximumof two possible
cases.
 Case 1:There exists a marker m
r
at the dth genomic interval.
In this case,O(d,r) can be computed as the maximum sum of
support of m
r
and the optimal value of locating r  1 markers.
Formally,it can be written as O(d,r) ¼ max
b1d
0
 d  1
{O(d
0
,r1) + Support(m
r
)},where b denotes the least genomic
interval that is in the same chromosome as the dth genomic
interval.Note that O(d
0
,r1) may correspond to different set
of r 1 markers for different values of d
0
.The type of marker m
r
can be either gain or loss.We choose the marker type as the one
that leads to a larger Support(m
r
) value.
 Case 2:There is no marker at the dth genomic interval.O(d,r) ¼
O (d  1,r) in this case.
Thus,O(d,r) can be computed using the following recursive
equation,O(d,r) ¼
max

Oðd  1‚r  1Þ þSupportðm
r
Þ
Oðd  2‚r  1Þ þSupportðm
r
Þ
  
Oðb  1‚r  1Þ þSupportðm
r
Þ

if m
r
appears at interval d
Oðd  1‚rÞ otherwise:
The marker set Mthat leads to O(D,R) corresponds to R markers
with the largest sumof supports.We call the above approach a fixed
number of markers approach.
An important feature of the dynamic programming approach is
that optimal solutions to subproblems are retained so as to avoid
recomputing their values Horowitz et al.(1998).We construct a D·
R matrix with cell (d,r) storing the optimal value of O(d,r).An
iterative program is implemented to fill this matrix.For each cell
(d,r),we need to revisit cell (g,r1),b 1 g d 1,which takes
constant time that is proportional to the average length of chromo-
some.Besides,we need O(N) time to compute Support(m
r
).So
the time complexity of filling one cell is O(N).The overall time
complexity of filling the whole matrix would be O(DNR).
3 PROTOTYPE-BASED CLUSTERING
Given a set of CGHsamples,our marker detection technique finds a
set of markers that characterize the aberration pattern of the samples
in that set and can be considered as the prototype or model repre-
senting the samples.In this section,we develop a prototype-based
clustering algorithm to partition the dataset into subsets such that
each subset has a prototype common to the samples in that subset.It
is similar in spirit to the k-means algorithm.The user specifies the
number of clusters,k,and the number of markers per cluster,R.It
starts with a random partitioning of the samples.It then iteratively
maximizes an objective function,which we call cohesion function,
in two steps.We discuss the cohesion function later.The two steps
are described as follows.
 Refinement step:For each cluster,the dynamic programming
technique developed in Section 2 is used to identify the optimal
set of R markers.These markers serve as the prototype of this
cluster.
 Reassignment step:Each sample is assignedto the cluster whose
prototype is supported the most by that sample.
Essentially,both steps optimize a cohesion function alternatively
until the value of this function does not change,i.e.converges to a
stable value.We formally prove this below.
Given a set of CGHsamples S ¼{s
1
,s
2
,...,s
N
}.Let Kdenote the
number of clusters.Let c
i
denote the ith cluster.A clustering of
samples can be represented by a K-partition,C ¼c
1
[ c
2
[ c
K
,that
divides samples into K disjoint clusters.Let R be the number of
markers in each cluster.Let M
i
¼ {m
i,t
j 1  t  R} denote the set
of markers (i.e.prototype) for cluster i,where 1  i  K,and
m
i,t
denote the tth marker in the ith cluster.Each marker m
i,t
is a
tuple <p
i,t
,q
i,t
>,where p
i,t
and q
i,t
denote the position and type
of this marker respectively.Let M denote the set of prototypes
M¼{M
1
,M
2
,...,M
K
}.We define a cohesion function of clustering
as below
cohesionðC‚MÞ ¼
X
K
i¼1
X
s
j
2c
i
X
R
t¼1
sðs
j
‚m
i‚ t
Þ‚
where s(s
j
,m
i,t
) is defined as same as in Section 2.Essentially,the
cohesion function computes the intra-cluster similarity between
samples and cluster prototypes.
In the refinement step,we optimize the cohesion function by
refining the prototype (markers) of clusters given that samples
are partitioned into K clusters.For cluster i,let O
i
ðd‚rÞ ¼
P
r
t¼1
Supportðm
t
Þ ¼
P
r
t¼1
P
s
j
2c
i
sðs
j
‚m
i‚ t
Þ denote the largest
support that t markers can get from the genomic intervals in the
range [1..d].Thus the optimal cohesion function can be written as
cohesionðC‚M
*
Þ ¼
P
K
i¼1
O
i
ðD‚RÞ where M* denotes the set of
refined prototypes and M* ¼ Argmax
M
(cohesion(C,M)).The
dynamic programming technique described in Section 2 is used
to compute O
i
(D,R) for 1  i  K and,in this way,the cohesion
function is optimized.
In the reassignment step,we reassign the sample s
j
to the ith
cluster whose prototype M
i
is supported the most by s
j
,i.e.M
i
covers
the largest number of segments in s
j
.This is because,otherwise,the
cohesion function could always be improved by moving s
j
into the
ith cluster.Formally,C* ¼ Argmax
C
(cohesion(C,M)) where C*
denotes the new partition of samples after reassignment step.
Proof of convergence:It can be seen that both steps,refinement
and reassignment step,are connected through the cohesion function
they alternatively optimize.The dynamic programming in refine-
ment step optimize the cohesion with respect to M.At iteration (h)
we have
cohesionðC
ðhÞ
‚M
ðhþ1Þ
Þ  cohesionðC
ðhÞ
‚M
ðhÞ
Þ
On the other hand,the reassignment step optimize the cohesion
based on C,i.e.
cohesionðC
ðhþ1Þ
‚M
ðhþ1Þ
Þ  cohesionðC
ðhÞ
‚M
ðhþ1Þ
Þ
Put together,our algorithmgenerates a sequence (C
(h)
,M
(h)
),h 0,
that increases the cohesion function as
cohesionðC
ðhþ1Þ
‚M
ðhþ1Þ
Þ  cohesionðC
ðhÞ
‚M
ðhÞ
Þ
J.Liu et al.
452
Since the maximum value of cohesion function is finite given a set
of samples,our algorithm converges to a stable value at the end.It
is worth noting that our algorithm,similar to k-means,does not
guarantee to converges to a global maxima.
Note that it is possible to generalize this method to variable
number of markers for different clusters with two modifications
to the algorithm.Instead of fixing the number of markers,we fix
the segment coverage of markers for each cluster in the refinment
step.The segment coverage is defined as the ratio of segments that
are covered by the markers to the total number of segments in the
cluster.The coverage ratio is a normalized value between 0 and 1.
It increases as the number of markers increases.The reassignment
step needs to be modified slightly as well as follows.A sample is
moved to a new cluster only if this sample supports the prototype
of new cluster more than that of old cluster and the coverage ratio
of the clusters do not drop to less than the given coverage ratio
cutoff.We tested the method with variable number of markers.
It did not produce better results than the prototype-based method
with fixed number of markers.Therefore,we do not report any
experimental results.
4 PAIRWISE SIMILARITY BASED CLUSTERING
Pairwise similarity based algorithms partition the samples into clus-
ters so that the similarity between samples fromthe same cluster are
larger than the similarity between samples of different cluster.This
generally requires calculating the similarity between two samples.
In our previous work Liu et al.(2006),we proposed a segment-
based similarity measure,called Sim,for CGH data.In this section,
we develop a new similarity measure,called RSim,which avoids
noisy aberrations with the help of markers.
Let Ddenote the number of genomic intervals of each sample.Let
s
i
¼ x
i
1
‚x
i
2
‚    ‚x
i
D
and s
j
¼ x
j
1
‚x
j
2
‚    ‚x
j
D
be two CGH samples.
Here,x
i
d
and x
j
d
denote the value or status of the dth genomic
interval of s
i
and s
j
,respectively.
We,first,summarize the Sim measure we developed for com-
puting the similarity of two CGH data.We call two segments from
two samples overlapping if they have at least one common genomic
interval of the same type.Sim constructs maximal segments by
combining as many contiguous aberrations of the same type as
possible.Thus,each sample translates into a sequence of segments.
For example,in Figure 1,s
i
and s
j
are two samples that have four
and two segments,respectively.After this transformation,Sim
computes the similarity between two CGH samples as the number
of overlapping segment pairs.This is because each overlap may
indicate a common point-like aberration in both samples which
potentially led to the overlapping segments.In Figure 1,the first
segment of s
i
is overlapping with the first segment of s
j
.Similarly
the third segment of s
i
is overlapping with the second segment of
s
j
.Sim computes the similarity between two samples based on the
genomic aberrations local to these samples.Thus,Sim cannot dis-
tinguish the true aberrations from noisy ones.As a result,Sim is a
local measure that is easily biased by the noise.Next,we develop a
new approach that addresses this limitation.
We propose to employ markers to eliminate the contribution of
noise to the pairwise similarity.We develop a refined Simmeasure,
called RSim,as follows.Let M ¼ {m
1
,m
2
,  ,m
R
},p
1
< p
2
<    <
p
R
,denote the set of markers that are globally identified in all
samples.The markers imply the important genomic intervals that
are associated with the aberration patterns of samples.Given two
CGHsamples s
i
and s
j
,RSimcomputes the similarity between them
as the number of overlapping segments pairs,such that these seg-
ments satisfy both of the following two conditions:
(1) At least one of the markers in Mis contained in both segments.
(2) Both of the overlapping segments have the same aberration
type as the marker they both contain.
Formally,let x
i
u
‚x
i
uþ1
‚    ‚x
i
v
and x
j
u
0
‚x
j
u
0
þ1
‚    ‚x
j
v
0
be a pair of
segment from samples s
i
and s
j
,respectively.RSim counts this
pair of segments as one only if
(1) There exists a marker m
t
¼<p
t
,q
t
>,m
t
2M,suchthat up
t
v
and u
0
 p
t
 v
0
,and
(2) The aberration type of both segments is the same as that of m
t
,
i.e.x
i
u
¼ x
j
u
0
¼ q
t
.
Unlike Sim,RSim does not consider the overlapping segments that
do not intersect with any marker.This is because RSim considers
such segments as noise.For example,assume that there are two
markers in Figure 1,one at the 3rd and the other at the 11th genomic
interval.Then RSim measure for s
i
and s
j
is computed as one,
whereas Sim measure is two.
An important observation on RSim is as follows.As the number
of markers approaches to the number of genomic intervals in the
CGH data,RSim becomes equivalent to the Sim measure.This is
because all segments in the CGH data will overlap with a marker
and contribute to the similarity.Thus Simis a special case of RSim
when noisy aberrations are not eliminated.
Our previous work Liu et al.(2006) showed that Sim works best
when combined with topdown algorithm,compared to other popular
clustering algorithms,such as bottom-up and k-means.In this paper,
we propose to use the topdown clustering method with RSimas the
pairwise similarity measure for pairs of CGH samples.
Note that it is possible to extend the Rawmeasure in our previous
work Liu et al.(2006) by taking the markers into account.The
extended Raw measure works as follows.For each pair of samples,
we compute the similarity between themas the number of genomic
intervals that meet the following two conditions.First,both samples
have the same aberration type (gain or loss) at this interval.Second,
one marker of the same type as both samples appears at this interval.
Our experiments (results ommited due to space limit) show that
although the markers improve the original Raw measure,RSim is
always superior to this measure.
5 EXPERIMENTAL RESULTS
Dataset.With >12000 cases Baudis (2006),the largest resource
for published CGH data can be found in the Progenetix database
Baudis and Cleary (2001) (http://www.progenetix.net).We use
three different datasets,dissimilarDS,interDS and similarDS,
taken from Progenetix databases.Each dataset contains >800
CGHsamples (i.e.cytogenetic imbalance profiles of tumor samples)
fromfour different histopathological cancer types.Each sample has
been coded according to the ICD-O-3 systemFritz et al.(2000) and
consists of 862 ordered genomic intervals extracted from 24 chro-
mosomes.In principle,each dataset can be mapped to an integer
matrix of size N· 862,where Ndenotes the number of samples.The
difference of these three datasets are the divergence of aberration
Markers improve clustering of CGH data
453
patterns in distinct cancer types.In dissimilarDS,the samples of
different cancer types contain diverse aberration patterns that are
easily distinguished from each other.The samples in similarDS
contain similar aberration patterns.The interDS dataset is at an
intermediate degree.The choices of these datasets are based on a
visual inspection of the matrices (as in Fig.1 in appendix) for each
of the cancer types.
System specifications.Our experimental simulations were run
on a systemwith dual 2.59 GHz AMDOpteron Processors,8 GB of
RAM,and an Linux operating system.
5.1 Quality of clustering
Measuring the quality of clustering is a challenging task as it is
an unsupervised learning task.There are a number of internal
and external cluster validation techniques that are described in
the literature.In the following,we describe two measures that
can be used for evaluating the clustering.
(1) Coverage Measure (CM):An internal measure evaluates the
quality of clusters if the class labels of samples are not known
a priori.One possible internal measure is the cohesionfunction
defined in Section 3.This function measures the total support
of markers over each cluster.We use the termCoverage Mea-
sure (CM) to denote this measure.Markers with high-support
potentially convey some meaningful biological information
and potentially can serve as the first step for further analysis,
such as the identification of new oncogenes and tumor sup-
pressor genes.A group of markers can be considered as bio-
logically relevant if they cover most of the segments in all the
patients.
(2) Normalized Mutual Information (NMI):One way to measure
the quality of clustering is to see if each cluster predominantly
consists of samples from a single cancer type.This clearly
makes the assumption that this information is available (as
was the case for datasets used in this paper) and those samples
from the same cancer type will generally be similar to each
other.The external measure,knownas the Normalized Mutual
Information,described belowcan be used for this purpose.Let
N,Uand Kdenote the total number of samples,the number of
different cancer types andthe number of clusters,respectively.
Let a
1
,a
2
,  ,a
m
denote the number of samples that belong to
each cancer type.Similarly,let b
1
,b
2
,  ,b
k
be the number of
samples that belong to each cluster.Let c
i,j
,8i,j,1 i Uand
1  j  K,denote the number of samples in jth cluster that
belongtotheithcancer type.TheNMI functionis computedas:
NMI ¼
P
U
i¼1
P
K
j¼1
c
i‚ j
log

N∙ c
i‚ j
a
i
b
j

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

P
i
a
i
log
a
i
N

P
j
b
j
log
b
j
N

s
:
NMI function takes a value in [0..1] interval.Higher values
indicate better clustering qualities.Unlike other external mea-
sures,NMI was computedbasedonmutual informationI(X;Y)
between a randomvariable X,governing the cluster labels and
a random variable Y,governing the cancer types.It has been
argued that the mutual information is a superior measure than
purity or entropy Strehl and Ghosh (2002).Moreover,NMI is
quite impartial to the number of clusters Zhong and Ghosh
(2005).
5.2 Quality of markers
Measuring the quality (or the biological relevance) of the markers
identified for the clusters is also important.Here,we develop a
measure to address this.We combine all the markers found in
each cluster.For each marker in combined marker set,we first
compute the ratio of the samples that support it among the samples
in each cancer type.Thus,if there are T cancer types in the dataset,
T values are computed for each marker.We define the maximum
of these ratios as the biological relevance of this marker.This is
because a larger ratio of support from one cancer type indi-
cates that the marker better captures the aberration pattern of
that cancer type.Therefore,this marker is biologically relevant
in that cancer type.We use the term Global Maximum Support
(GMS) to represent this measure.Formally,let M¼
fm
1
‚m
2
‚...‚m
r
g denote the set of markers.Let C
i
,1 i T denote
the set of samples that belong to ith cancer type,where T is the
number of different cancer types in the dataset.GMS measure for
marker m
i
2 Mis computed as:
GMSðm
i
Þ ¼ max
1tT

1
j C
t
j
X
s
j
2C
t
sðs
j
‚m
i
Þ

Note that GMS differs greatly from CM for the CM measure is
computed over clusters identified by the underlying clustering
strategy,whereas GMS is computed over the cancer types.
5.3 Evaluation
We tested the prototype-based approach and two similarity-based
approaches,RSim and Sim Liu et al.(2006) over three datasets,
dissimilarDS,interDS and similarDS.For each clustering method
and dataset,we created 6,8 and 10 clusters.For each number of
clusters,we tried different number of markers,i.e.4,6,8,10 and 12
markers,per cluster in the prototype-based approach.This is
because biologists have pointed out that a total of 4–7 genetic
alterations can be estimated for the malignant transformation of
a cell Renan (1993).Thus,we estimated that the number of aber-
rations common to the samples of one cluster could be around 10.
For the consistency,we also employed different numbers of markers
for each clustering of 6,8 and 10 clusters in RSim.Here,the number
of markers is determined as the product of number of clusters and
the number of markers per cluster used in the prototype-based
approach.For example,24,36,48,60 and 72 markers were
found to create six clusters using RSim.We compared three
methods according to the qualities of clusters.We evaluated the
cluster qualities using both NMI and CMmeasures.To evaluate the
cluster qualities using CM,for both RSim and Sim,we identified
the markers for each resulting cluster.The number of markers per
cluster is the same as that used in the prototype-based approach.
We compute the error bars for part of the results.We also used
GMS to evaluate the biological relevance of markers found in
prototype-based approach and RSim approach.
The CM results are shown in Table 1.The CM values mono-
tonically increase as the number of clusters or number of markers
increase.Thus,we compare three clustering methods for the same
number of clusters and markers.We observe that prototype-based
approach has 8–34 % better coverage than RSim and 15–41%
J.Liu et al.
454
better coverage than Sim.This is because the cohesion function
optimized in prototype-based approach is the same as the Coverage
Measure.RSim and Sim do not optimize CM directly.The results
also show that RSim is superior to Sim most of the time.This is
because the use of markers in RSim filters out the noise that are
irrelevant to the aberration patterns.As a result,the markers in each
cluster are supported by more samples in RSimas compared to Sim.
The NMI results are shown in Table 2.Since Sim produces
clusters without finding markers,we list its results on a separate
column.The results show that all three methods perform better on
dissimilarDS than interDS and similarDS.This is because the aber-
ration patterns of distinct cancer types in dissimilarDS are diver-
gent.Thus,it is harder to cluster interDS and similarDS datasets.
From the results,we observe that RSim and Sim always beat
prototype-based approach in terms of the NMI values.This obser-
vation,together with the conclusion from results in Table 1,indi-
cates that NMI measure has no apparent relationship with CM.This
is because NMI computes the quality based on the class labels of
samples.On the other hand,CM evaluates the compactness of
samples in each cluster based on chromosomal aberration patterns
and completely ignores the class labels.Therefore,we conclude that
the pairwise similarity-based clustering approaches are more suit-
able to external measures,such as NMI,while the prototype-based
approach works better for the CM.RSim usually has the best NMI
results using 10 markers.When compared to Sim,RSimusually has
better NMI values.This indicates that the use of markers in refining
the pairwise similarity also leads to a better clustering in terms of
NMI.Given that RSim has better CM (Table 1) and NMI values
than Sim (Table 2),we conclude that RSim is a better pairwise
similarity measure than Sim.
We compute the error bars of experimental results as follows.We
randomly sample 50% of each dataset and cluster it using three
methods,prototype-based,RSim and Sim,described in the paper.
To reduce the amount of computations,we choose one configuration
from different combinations of parameters in the experiments.For
each dataset,we create eight clusters.We identify 10 markers per
cluster in the prototype-based approach and 80 markers in RSim
approach.We then compute both NMI and CM values for the
resulting clusters using 10 markers per cluster.We repeat this pro-
cess 100 times and compute the error bar for the values of NMI and
CM.The error bar indicates the interval where 5–95%of the results
lie.Table 3 shows the results with error bars.Note that the results of
CM are roughly half the results shown in Table 1 because the
calculation of CM depends on the dataset and we sample 50% of
each dataset in the experiments.The results show that RSim is
superior to Sim most of the time in terms of both NMI and CM.
Moreover,among the three clustering methods,RSim and
prototype-based approach works the best for NMI and CM mea-
sures,respectively.These observations are compatible to those we
obtained fromTables 1 and 2.Therefore,the error bars confirmour
earlier conclusions.
Next experiment compares the GMS values for RSIM and
prototype-based clustering approaches.For each dataset,we created
eight clusters and identified 10 markers per cluster.This is because
these results are among the best results of each dataset in Table 2.
We then sort the markers in descending GMS value order.We plot
the sorted results of both RSim and prototype-based approach
(Fig.2).The plots show that the maximumglobal support of mark-
ers found by RSim is always comparable to or better than those
found by the prototype-based approach.
6 CONCLUSIONS
We considered the problemof clustering CGH data of a population
of cancer patient samples.There are three main contributions of
our work:
(1) We developed a dynamic programming algorithmto identify
the optimal set of important genomic intervals called markers.
The advantage of using these markers is that the potentially
noisy genomic intervals are excluded in the computation of
pairwise similarity between samples.
(2) We developed two clustering strategies using these markers.
The first one,prototype-based approach,maximizes the
support for the markers.The second one,similarity-based
approach,develops a new similarity measure called RSim.
It iteratively refines clusters with the aim of maximizing the
RSim measure between the samples in the same cluster.
We demonstrated the utility of such a measure in improving
the quality of clustering using the classified disease entities
from the Progenetix database.Our results show that the
markers we found represent the aberration patterns of cancer
types well.
Table 1.The coverage measure for the three clustering methods applied over
each of three datasets
Dataset K Alg Number of markers
4K 6K 8K 10K 12K
Proto 1655 2107 2465 2707
2953
6
RSim
1520 1916 2190 2512 2722
Sim
1316 1741 2051 2325 2571
Proto
1738 2192 2524 2769 3038
DissimilarDS
8
RSim
1585 1925 2254 2516 2817
Sim
1470 1883 2209 2484 2719
Proto
1839 2214 2613 2928 3149
10
RSim
1645 1938 2371 2575 2806
Sim
1480 1895 2212 2480 2716
Proto
1328 1696 2007 2318 2531
6
RSim
993 1463 1764 2006 2257
Sim
945 1303 1611 1884 2123
Proto
1345 1770 2104 2415 2709
InterDS
8
RSim
1011 1473 1821 2114 2373
Sim
1004 1382 1706 1993 2246
Proto
1443 1860 2189 2519 2779
10
RSim
1192 1533 1825 2156 2467
Sim
1113 1486 1798 2078 2322
Proto
1793 2197 2512 2834 3161
6
RSim
1430 1868 2265 2541 2808
Sim
1414 1830 2206 2530 2775
Proto
1827 2257 2643 2973 3298
SimilarDS
8
RSim
1627 2061 2400 2694 3044
Sim
1530 1956 2316 2625 2889
Proto
1946 2438 2908 3274 3551
10
RSim
1734 2153 2506 2831 3197
Sim
1636 2100 2492 2822 3116
Here,K denotes the number of clusters.Proto denotes the prototype-based approach.
Markers improve clustering of CGH data
455
Table 2.The NMI values of the three clustering methods are applied over each of three datasets
Dataset K Sim Alg Number of markers
4K 6K 8K 10K 12K
DissimilarDS 6 0.41 Proto
RSim
0.31
0.45
0.31
0.49
0.29
0.50
0.29
0.49
0.24
0.50
8
0.47 Proto
RSim
0.32
0.48
0.28
0.49
0.30
0.49
0.25
0.51
0.28
0.49
10
0.43 Proto
RSim
0.35
0.46
0.33
0.44
0.37
0.47
0.34
0.50
0.29
0.49
InterDS
6
0.34 Proto
RSim
0.06
0.35
0.08
0.34
0.08
0.38
0.08
0.40
0.11
0.39
8
0.32 Proto
RSim
0.07
0.34
0.10
0.34
0.09
0.36
0.10
0.38
0.11
0.36
10
0.32 Proto
RSim
0.10
0.35
0.09
0.36
0.10
0.36
0.15
0.37
0.12
0.36
SimilarDS
6
0.39 Proto
RSim
0.14
0.29
0.13
0.41
0.06
0.39
0.07
0.39
0.07
0.40
8
0.36 Proto
RSim
0.14
0.36
0.09
0.38
0.11
0.38
0.07
0.38
0.05
0.37
10
0.33 Proto
RSim
0.08
0.35
0.08
0.37
0.10
0.35
0.06
0.37
0.07
0.37
Here,K denotes the number of clusters.Proto,denotes the prototype-based approach.
Table 3.The percentage error for three clustering methods,prototype-based (denoted as Proto),RSim and Sim,over each of three datasets,dissimilarDS,
interDS and similarDS,to create eight clusters and identify 10 markers per cluster
NMI CM
5% Median 95% 5% Median 95%
Proto 0.20 0.28 0.34 1372 1431 1487
DissimilarDS RSim 0.45 0.51 0.55
1230
1291 1337
Sim 0.36 0.42 0.47
1187
1241 1289
Proto 0.05 0.09 0.13
1156
1228 1285
InterDS RSim 0.32 0.37 0.41
985
1042 1120
Sim 0.27 0.30 0.33
963
1032 1098
Proto 0.06 0.09 0.13
1480
1535 1581
SimilarDS RSim 0.33 0.37 0.40
1294
1342 1401
Sim 0.29 0.33 0.37
1283
1327 1376
The resulting clusters are evaluated using both NMI and CM.Here,5 and 95% denote the 5th and 95th percentile,respectively.
Fig.2.Plots of Global Maximumsupport of markers foundin(a) dissimilarDS,(b) interDSand(c) similarDS,respectively,usingbothprototype-based approach
and RSim.The results of prototype-based approach (denoted as Proto) and RSim are plotted in solid line and dashed line,respectively.
J.Liu et al.
456
(3) We developed several measures for comparing markers and
different clustering methods.Our experimental results show
that optimizingfor the coverage measure maynot leadtobetter
values of NMI and vice versa.
ACKNOWLEDGEMENTS
The authors would like to thank Dr Michael Baudis for confirming
the selection of datasets.The work of authors is supported in part by
the National Science Foundation under Grant ITR 0325459.Any
opinions,findings and conclusions or recommendations expressed
in this material are those of the author(s) and do not necessarily
reflect the views of the National Science Foundation.
Conflict of Interest:none declared.
REFERENCES
Baudis,M.and Cleary,M.L.(2001) Progenetix.net:an online repository for molecular
cytogenetic aberration data.Bioinformatics,17,1228–1229.
Baudis,M.(2006) An online database and bioinformatics toolbox to support data
mining in cancer cytogenetics.Biotechniques,40,269–270,272.
Beattie,B.J.and Robinson,P.N.(2006) Binary state pattern clustering:A digital para-
digm for class and biomarker discovery in gene microarray studies of cancer.
J.Comput.Biol.,13,1114–1130.
Bentz,M.et al.(1996) High incidence of chromosomal imbalances and gene
amplifications in the classical follicular variant of follicle center lymphoma.
Blood,88,1437–1444.
Kallionieni,A.et al.(1992) Comparative genomic hybridization for molecular
cytogenetic analysis of solid tumors.Science,258,818–821.
Fridlyand,J.et al.(2004) Hidden markov models approach to the analysis of array CGH
data.J.Multivar.Anal.,90,132–153.
Desper,R.et al.(1999) Inferring tree models for oncogenesis fromcomparative genome
hybridization data.J.Comput.Biol.,6,37–52.
Fritz,A.,Percy,C.,Jack,A.,Sobin,L.and Parkin,M.(ed.) (2000) International Classi-
fication of Diseases for Oncology (ICD-O),3rd edn.World Health Organization,
Geneva.
Fu,L.M.and Fu-Liu,C.S.(2005) Evaluation of gene importance in microarray data
based upon probability of selection.BMC Bioinformatics,6,67.
Gray,J.W.et al.(1994) Molecular cytogenetics of human breast cancer.Cold Spring
Harb.Symp.Quant.Biol.,59,645–652.
Hoglund,M.et al.(2005) Statistical behavior of complex cancer karyotypes.
Genes Chromosomes Cancer,42,327–341.
Horowitz,E.,Sahni,S.and Rajasekaran,S.(1998) Computer Algorithms/C++.Computer
Science Press,New York,NY,USA.
Jain,A.K.et al.(1999) Data clustering:a review.ACM Comput.Surv.,31,
264–323.
Jiang,D.et al.(2004) Cluster analysis for gene expression data:a survey.IEEE Trans.
Knowl.Data Eng.,16,1370–1386.
Joos,S.et al.(2002) Classical hodgkin lymphoma is characterized by recurrent
copy number gains of the short arm of chromosome 2.Blood,99,
1381–1387.
Liu,J.et al.(2006) Distance-based clustering of CGH data.Bioinformatics,22,
1971–1978.
Mattfeldt,T.et al.(2001) Cluster analysis of comparative genomic hybridization CGH
data using self-organizing maps:application to prostate carcinomas.Anal.Cell.
Pathol.,23,29–37.
Mitelman,F.et al.(1972) Tumor etiology and chromosome pattern.Science,176,
1340–1341.
Mitelman,F.(ed.) (1995) International System for Cytogenetic Nomenclature.Karger,
Basel.
Picard,F.,Robin,S.,Lebarbier,E.and Daudin,J.-J.(2005a) A segmentation-clustering
problem for the analysis of array CGH data.In Applied Stochastic Models and
Data Analysis.
Picard,F.et al.(2005b) A statistical approach for array CGH data analysis.BMC
Bioinformatics,6,27.
Pinkel,D.and Albertson,D.G.(2005) Array comparative genomic hybridization and its
applications in cancer.Nat.Genet.,37 (Suppl.),S11–S17.
Renan,M.J.(1993) How many mutations are required for tumorigenesis?implications
from human cancer data Mol.Carcinog.,7,139–146.
Rouveirol,C.et al.(2006) Computation of recurrent minimal genomic alterations from
array-cgh data.Bioinformatics,22,849–856.
Strehl,A.and Ghosh,J.(2002) Cluster ensembles—a knowledge reuse framework for
combining partitionings.In Proceedings of AAAI 2002,AAAI,Edmonton,Canada,
pp.93–98.
Vandesompele,J.et al.(2005) Unequivocal delineation of clinicogenetic subgroups and
development of a new model for improved outcome prediction in neuroblastoma.
J.Clin.Oncol.,23,2280–2299.
Wang,P.et al.(2005) A method for calling gains and losses in array CGH data.
Biostatistics,6,45–58.
Willenbrock,H.and Fridlyand,J.(2005) A comparison study:applying segmentation
to array CGH data for downstream analyses.Bioinformatics,21,4084–4091.
Zhong,S.and Ghosh,J.(2005) Generative model-based document clustering:a
comparative study.Knowl.Inf.Syst.,8,374–384.
Markers improve clustering of CGH data
457