Vol.23 no.4 2007,pages 450–457

doi:10.1093/bioinformatics/btl624

BIOINFORMATICS ORIGINAL PAPER

Gene expression

Markers improve clustering of CGH data

Jun Liu

,Sanjay Ranka and Tamer Kahveci

Computer and Information Science and Engineering,University of Florida,Gainesville,FL 32611,USA

Received on September 24,2006;revised and accepted on December 4,2006

Advance Access publication December 6,2006

Associate Editor:John Quackenbush

ABSTRACT

Motivation:We consider the problem of clustering a population of

Comparative Genomic Hybridization (CGH) data samples using simi-

larity based clustering methods.A key requirement for clustering is to

avoid using the noisy aberrations in the CGH samples.

Results:We develop a dynamic programming algorithm to identify a

small set of important genomicintervalscalledmarkers.Theadvantage

of using these markers is that the potentially noisy genomic intervals

are excluded during the clustering process.We also develop two clus-

tering strategies using these markers.The first one,prototype-based

approach,maximizes the support for the markers.The second one,

similarity-based approach,develops a new similarity measure called

RSimandrefinesclusterswiththeaimof maximizingtheRSimmeasure

between the samples in the same cluster.Our results demonstrate

that the markers we found represent the aberration patterns of cancer

types well and they improve the quality of clustering significantly.

Availability:All software developed in this paper and all the datasets

used are available fromthe authors upon request.

Contact:juliu@cise.ufl.edu

1 INTRODUCTION

Numerical and structural chromosomal imbalances are one of

the most prominent and pathogenetically relevant features of neo-

plastic cells Mitelman et al.(1972) One method for measuring

genomic aberrations is Comparative Genomic Hybridization

(CGH) Kallioniemi et al.(1992).CGH is a molecular-cytogenetic

analysis method for detecting regions with genomic imbalances

(gains or losses of DNA segments).Applying microarray techno-

logy to CGH measures thousands of copy number information

distributed throughout the genome simultaneously Pinkel and

Albertson (2005).Raw data from array CGH experiments is

expressed as the ratio of normalized ﬂuorescence of tumor and

reference DNA.Normalized CGH ratio data surpassing predeﬁned

thresholds is considered indicative for genomic gains or losses,

respectively.In contrast to the array CGH,the chromosomal

CGHresults (on which this paper is based) are annotated in a reverse

in situ karyotype format (Mitelman,1995) describing imbalanced

genomic regions with reference to their chromosomal location.

CGH data of an individual tumor can be considered as an ordered

list of status values,where each value corresponds to a genomic

interval (e.g.a single chromosomal band).The status can be

expressed as a real number (positive,negative or zero for gain,

loss or no aberration,respectively).We use this strategy and rep-

resent gain,loss and no change with +1,1 and 0,respectively.

Chromosomal and array CGH recently account for the majority of

published analyses in cancer cytogenetics Gray et al.(1994);Bentz

et al.(1996);Desper et al.(1999);Hoglund et al.(2005);Mattfeldt

et al.(2001);Vandesompele et al.(2005);Joos et al.(2002).

A lot of different strategies for structural analysis of CGH data

have been developed.Some of these analysis have been aimed at

the spatial coherence of genomic segments with different copy

number levels.Many attempts have been made on this area,such

as a Gaussian model-based segmentation method Picard et al.

(2005b),unsupervised hidden Markov model approach Fridlyand

et al.(2004),hierarchical clustering tree Wang et al.(2005)

and segmentation-clustering approach Picard et al.(2005a) etc.

Willenbrock and Fridlyand (2005) pointed out that the smoothed

(segmented) CGH data aids the downstream analyses,such as

classiﬁcation.

Unsupervised clustering methods are often employed to discover

previously unknown sub-categories of cancer and to identify genetic

biomarkers associated with the differentiation.Towards this end,

many attempts have been done on the studies of gene microarray

Jiang et al.(2004).However,to the best of our knowledge,so far

there has been very limited study on the clustering of smoothed

CGH data,especially on the heterogeneous sets of data that incl-

ude a large (more than several hundreds) population of samples.

In Mattfeldt et al.(2001),used a self-organizing map as a tool for

cluster analysis of CGH data.However,their studies only focus on

several tens of cases from a single cancer type.

Existing clustering algorithms Jain et al.(1999) cannot be applied

to CGH data directly.This is because CGH data are structurally

different from ordinary high-dimensional data since consecutive

dimensions (i.e.genomic intervals) may be correlated.A point-

like genomic aberration may expand to the neighboring intervals

and result in a contiguous run of gain or loss status in CGH data.

In our earlier work Liu et al.(2006),we developed a similarity

measure,Sim,for a pair of CGH samples.Sim measures the simi-

larity as the number of overlapping segments (contiguous runs of

aberrations of the same type).Although Sim leads to better clus-

tering than existing measures over a large population of CGH sam-

ples,it has a drawback.Sim is affected from noisy aberrations in

CGH data since it cannot distinguish random aberrations from the

aberrations that actually lead to the underlying cancer.

Given a set of samples belonging to the same cancer type,we

observe that a small number of point aberrations are common to a

subset of samples.We say that these point aberrations are supported

by these samples.(See Fig.1 in Appendix for an example).Such

aberrations with high-support (we will use the term markers) rep-

resent the recurrent genomic imbalances and can be used to char-

acterize the imbalance proﬁle of the sample set.Some recent

To whom correspondence should be addressed.

450

The Author 2006.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournals.org

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

analysis of CGH data discovered important genes or chromosomal

regions that are similar to our notion of markers.For example,

Rouveirol et al.(2006) proposed algorithms for computing recurrent

minimal genomic alterations from discretized CGH data.Fu and

Fu-Liu (2005) proposed a probability model of selection to eva-

luate the importance of genes in microarray data and choose a small

set of biologically signiﬁcant genes for classiﬁer design.Beattie

and Robinson (2006) used a digital paradigm to discretize the

gene microarray and transfered the data into binary state patterns.

A clustering based on Hamming distance was applied to discover

biomarkers.However,these works have not discussed the useful-

ness of markers for the clustering of CGH data.

The aimof this paper is to demonstrate that better clusters can be

found for CGHdatasets with the help of markers.The main thesis is

that markers represent the key imbalances between all the patients

and the non-marker areas are patient speciﬁc and hence negatively

impact the clustering process.

Towards this end,we develop a dynamic programming technique

to detect important markers in a group of samples.The resulting

markers can be seen as the prototype of these samples.Clustering

is an unsupervised learning process by its deﬁnition.The quality of

a clustering can be measured using several criteria.Two important

such criteria are coverage and the NMI measure.These are internal

and external measures,respectively.We develop two methods,

each working well for different measure.Depending on the purpose

of clustering either of the two methods can be appropriate.

In particular:

(1) The first one is a prototype-based approach.It starts with a

randominitial clustering and iteratively improves the clusters

in two steps.The first step identifies a set of markers for each

cluster that have sufficient support.The second step assigns

each sample tothe cluster whose markers are supportedbythat

sample the most.This algorithm is guaranteed to converge.

(2) The second one is a similarity-based algorithm.It starts by

identifying a set of markers that capture the aberration pat-

terns globally.Wedevelopanewsimilaritymeasure,RSim,by

focusing on the aberrations at these marker positions.It then

computes the average similarity between that sample and all

the samples in each cluster using RSim.It assigns each sample

to the cluster with the highest average similarity.

Our results demonstrate the utility of ﬁnding markers.Experi-

mental results show that the prototype-based method has better

internal measure than the competing methods.The similarity-

based method has better external measure.

The rest of the paper is organized as follows.Section 2 proposes

the technique of ﬁnding markers in a group of samples.Section 3

describes the prototype-based algorithm and proves its conver-

gence.Section 4 describes the reﬁned Sim measure and the

similarity-based algorithm.Section 5 presents the experimental

results.Section 6 concludes our work in this paper.

2 DETECTION OF MARKERS

Most cancers result from genomic instability and display various

genomic aberrations.A recurrent alteration is a set of aberrations

common to sufﬁciently many CGH samples.The recurrent alter-

ations can be used to characterize the aberration pattern of samples.

Due to the correlation between adjacent genomic intervals,recurrent

regions can be represented using a small number of genomic inter-

vals within these regions.We call these genomic intervals markers.

Each marker is represented by two numbers <p,q>,where p and q

denote the position and the type of aberration,respectively.

Given a set S of NCGHsamples {s

1

,s

2

, ,s

N

}.Let x

j

d

denote the

status value for sample j at genomic interval d,8d,1 d D,where

D is the number of intervals.Let s

j

[u,v] be the segment of s

j

that

starts at the uth interval and ends at the vth interval.We use the term

segment to represent a contiguous block of aberrations of the same

type.Formally,a list of status values x

j

u

‚x

j

uþ1

‚ ‚x

j

v

,for 1 u

v D deﬁne a segment if genomic intervals u through v are in the

same chromosome,the values fromx

j

u

to x

j

v

are all gains or all losses,

and x

j

u1

and x

j

vþ1

are different than x

j

u

.For example,in Figure 1,

sample X contains four segments.The ﬁrst and third segments are

gain type while the second and fourth segment are loss type.

Let {m

t

¼ < p

t

,q

t

>j 1 t R},be a set of markers that are

ordered along the genomic intervals,i.e.p

1

< p

2

< < p

R

.We

deﬁne the support of s

j

to m

t

as s(s

j

,m

t

).Here,s(s

j

,m

t

) ¼1 only if

both of the following conditions are satisﬁed:

(1) Support:There exists a segment s

j

[u,v] overlapping with m

t

,

i.e.up

t

vandthe typeof s

j

[u,v] is thesame as that of m

t

,i.e.

x

j

u

¼ q

t

.

(2) Uniqueness:There exists no other marker m

t

0

,t

0

< t,in the

same chromosome band such that u p

t

0

v,x

j

u

¼ q

t

0

and

s(s

j

,m

t

0

) ¼ 1.

Otherwise,s(s

j

,m

t

) ¼ 0.We say s

j

supports m

t

or m

t

covers s

j

if s(s

j

,m

t

) ¼ 1.We deﬁne the support value of marker m

t

as the sum of its support from all the samples.Formally,

Supportðm

t

Þ ¼

P

N

j¼1

sðs

j

‚m

t

Þ.

The intuition behind these two conditions is as follows.

(1) Support:The support value of a marker counts the number of

samples that share the same aberration status as the type of this

marker.Large support value for a marker implies that that marker is

important in characterizing the aberration pattern of samples.The

ﬁrst condition ensures that a sample supports a marker only if it

has the same aberration as that marker in the position speciﬁed by

that marker.(2) Uniqueness:The aberrations in the same segment

may correspond to a single aberration that spread to neighboring

genomic intervals Liu et al.(2006).The second condition forces a

segment overlapping with multiple markers of the same aberration

type to support only one of those markers.

Marker detection problem can be formally deﬁned as follows.

Given a set of CGH samples S ¼ {s

1

,s

2

, ,s

N

} and a positive

integer R,the goal is to ﬁnd the set of R markers M ¼ {m

1

,

m

2

, ,m

R

},p

1

< p

2

<...< p

R

,such that the sum of support of

markers in M,i.e.

P

R

t¼1

Supportðm

t

Þ,is maximized.Next,we

develop a dynamic programming algorithm to solve this problem

optimally.

Fig.1.Two CGHsamples Xand Y with the values of genomic intervals listed

in the order of positions.The segments are underlined.

Markers improve clustering of CGH data

451

Let Oðd‚rÞ ¼

P

r

t¼1

Supportðm

t

Þ,for 1 p

t

d D,8t 2[1,r]

denote the largest possible support that r markers can get from the

genomic intervals in the range [1..d].Here,[1..d] denote the integers

1,2,...,d.O(1,1) ¼ support of the single marker at the ﬁrst

genomic interval.The value of O(d,r) in general,where 1

r R,r d D,can be computed as the maximumof two possible

cases.

Case 1:There exists a marker m

r

at the dth genomic interval.

In this case,O(d,r) can be computed as the maximum sum of

support of m

r

and the optimal value of locating r 1 markers.

Formally,it can be written as O(d,r) ¼ max

b1d

0

d 1

{O(d

0

,r1) + Support(m

r

)},where b denotes the least genomic

interval that is in the same chromosome as the dth genomic

interval.Note that O(d

0

,r1) may correspond to different set

of r 1 markers for different values of d

0

.The type of marker m

r

can be either gain or loss.We choose the marker type as the one

that leads to a larger Support(m

r

) value.

Case 2:There is no marker at the dth genomic interval.O(d,r) ¼

O (d 1,r) in this case.

Thus,O(d,r) can be computed using the following recursive

equation,O(d,r) ¼

max

Oðd 1‚r 1Þ þSupportðm

r

Þ

Oðd 2‚r 1Þ þSupportðm

r

Þ

Oðb 1‚r 1Þ þSupportðm

r

Þ

if m

r

appears at interval d

Oðd 1‚rÞ otherwise:

The marker set Mthat leads to O(D,R) corresponds to R markers

with the largest sumof supports.We call the above approach a ﬁxed

number of markers approach.

An important feature of the dynamic programming approach is

that optimal solutions to subproblems are retained so as to avoid

recomputing their values Horowitz et al.(1998).We construct a D·

R matrix with cell (d,r) storing the optimal value of O(d,r).An

iterative program is implemented to ﬁll this matrix.For each cell

(d,r),we need to revisit cell (g,r1),b 1 g d 1,which takes

constant time that is proportional to the average length of chromo-

some.Besides,we need O(N) time to compute Support(m

r

).So

the time complexity of ﬁlling one cell is O(N).The overall time

complexity of ﬁlling the whole matrix would be O(DNR).

3 PROTOTYPE-BASED CLUSTERING

Given a set of CGHsamples,our marker detection technique ﬁnds a

set of markers that characterize the aberration pattern of the samples

in that set and can be considered as the prototype or model repre-

senting the samples.In this section,we develop a prototype-based

clustering algorithm to partition the dataset into subsets such that

each subset has a prototype common to the samples in that subset.It

is similar in spirit to the k-means algorithm.The user speciﬁes the

number of clusters,k,and the number of markers per cluster,R.It

starts with a random partitioning of the samples.It then iteratively

maximizes an objective function,which we call cohesion function,

in two steps.We discuss the cohesion function later.The two steps

are described as follows.

Refinement step:For each cluster,the dynamic programming

technique developed in Section 2 is used to identify the optimal

set of R markers.These markers serve as the prototype of this

cluster.

Reassignment step:Each sample is assignedto the cluster whose

prototype is supported the most by that sample.

Essentially,both steps optimize a cohesion function alternatively

until the value of this function does not change,i.e.converges to a

stable value.We formally prove this below.

Given a set of CGHsamples S ¼{s

1

,s

2

,...,s

N

}.Let Kdenote the

number of clusters.Let c

i

denote the ith cluster.A clustering of

samples can be represented by a K-partition,C ¼c

1

[ c

2

[ c

K

,that

divides samples into K disjoint clusters.Let R be the number of

markers in each cluster.Let M

i

¼ {m

i,t

j 1 t R} denote the set

of markers (i.e.prototype) for cluster i,where 1 i K,and

m

i,t

denote the tth marker in the ith cluster.Each marker m

i,t

is a

tuple <p

i,t

,q

i,t

>,where p

i,t

and q

i,t

denote the position and type

of this marker respectively.Let M denote the set of prototypes

M¼{M

1

,M

2

,...,M

K

}.We deﬁne a cohesion function of clustering

as below

cohesionðC‚MÞ ¼

X

K

i¼1

X

s

j

2c

i

X

R

t¼1

sðs

j

‚m

i‚ t

Þ‚

where s(s

j

,m

i,t

) is deﬁned as same as in Section 2.Essentially,the

cohesion function computes the intra-cluster similarity between

samples and cluster prototypes.

In the reﬁnement step,we optimize the cohesion function by

reﬁning the prototype (markers) of clusters given that samples

are partitioned into K clusters.For cluster i,let O

i

ðd‚rÞ ¼

P

r

t¼1

Supportðm

t

Þ ¼

P

r

t¼1

P

s

j

2c

i

sðs

j

‚m

i‚ t

Þ denote the largest

support that t markers can get from the genomic intervals in the

range [1..d].Thus the optimal cohesion function can be written as

cohesionðC‚M

*

Þ ¼

P

K

i¼1

O

i

ðD‚RÞ where M* denotes the set of

reﬁned prototypes and M* ¼ Argmax

M

(cohesion(C,M)).The

dynamic programming technique described in Section 2 is used

to compute O

i

(D,R) for 1 i K and,in this way,the cohesion

function is optimized.

In the reassignment step,we reassign the sample s

j

to the ith

cluster whose prototype M

i

is supported the most by s

j

,i.e.M

i

covers

the largest number of segments in s

j

.This is because,otherwise,the

cohesion function could always be improved by moving s

j

into the

ith cluster.Formally,C* ¼ Argmax

C

(cohesion(C,M)) where C*

denotes the new partition of samples after reassignment step.

Proof of convergence:It can be seen that both steps,reﬁnement

and reassignment step,are connected through the cohesion function

they alternatively optimize.The dynamic programming in reﬁne-

ment step optimize the cohesion with respect to M.At iteration (h)

we have

cohesionðC

ðhÞ

‚M

ðhþ1Þ

Þ cohesionðC

ðhÞ

‚M

ðhÞ

Þ

On the other hand,the reassignment step optimize the cohesion

based on C,i.e.

cohesionðC

ðhþ1Þ

‚M

ðhþ1Þ

Þ cohesionðC

ðhÞ

‚M

ðhþ1Þ

Þ

Put together,our algorithmgenerates a sequence (C

(h)

,M

(h)

),h 0,

that increases the cohesion function as

cohesionðC

ðhþ1Þ

‚M

ðhþ1Þ

Þ cohesionðC

ðhÞ

‚M

ðhÞ

Þ

J.Liu et al.

452

Since the maximum value of cohesion function is ﬁnite given a set

of samples,our algorithm converges to a stable value at the end.It

is worth noting that our algorithm,similar to k-means,does not

guarantee to converges to a global maxima.

Note that it is possible to generalize this method to variable

number of markers for different clusters with two modiﬁcations

to the algorithm.Instead of ﬁxing the number of markers,we ﬁx

the segment coverage of markers for each cluster in the reﬁnment

step.The segment coverage is deﬁned as the ratio of segments that

are covered by the markers to the total number of segments in the

cluster.The coverage ratio is a normalized value between 0 and 1.

It increases as the number of markers increases.The reassignment

step needs to be modiﬁed slightly as well as follows.A sample is

moved to a new cluster only if this sample supports the prototype

of new cluster more than that of old cluster and the coverage ratio

of the clusters do not drop to less than the given coverage ratio

cutoff.We tested the method with variable number of markers.

It did not produce better results than the prototype-based method

with ﬁxed number of markers.Therefore,we do not report any

experimental results.

4 PAIRWISE SIMILARITY BASED CLUSTERING

Pairwise similarity based algorithms partition the samples into clus-

ters so that the similarity between samples fromthe same cluster are

larger than the similarity between samples of different cluster.This

generally requires calculating the similarity between two samples.

In our previous work Liu et al.(2006),we proposed a segment-

based similarity measure,called Sim,for CGH data.In this section,

we develop a new similarity measure,called RSim,which avoids

noisy aberrations with the help of markers.

Let Ddenote the number of genomic intervals of each sample.Let

s

i

¼ x

i

1

‚x

i

2

‚ ‚x

i

D

and s

j

¼ x

j

1

‚x

j

2

‚ ‚x

j

D

be two CGH samples.

Here,x

i

d

and x

j

d

denote the value or status of the dth genomic

interval of s

i

and s

j

,respectively.

We,ﬁrst,summarize the Sim measure we developed for com-

puting the similarity of two CGH data.We call two segments from

two samples overlapping if they have at least one common genomic

interval of the same type.Sim constructs maximal segments by

combining as many contiguous aberrations of the same type as

possible.Thus,each sample translates into a sequence of segments.

For example,in Figure 1,s

i

and s

j

are two samples that have four

and two segments,respectively.After this transformation,Sim

computes the similarity between two CGH samples as the number

of overlapping segment pairs.This is because each overlap may

indicate a common point-like aberration in both samples which

potentially led to the overlapping segments.In Figure 1,the ﬁrst

segment of s

i

is overlapping with the ﬁrst segment of s

j

.Similarly

the third segment of s

i

is overlapping with the second segment of

s

j

.Sim computes the similarity between two samples based on the

genomic aberrations local to these samples.Thus,Sim cannot dis-

tinguish the true aberrations from noisy ones.As a result,Sim is a

local measure that is easily biased by the noise.Next,we develop a

new approach that addresses this limitation.

We propose to employ markers to eliminate the contribution of

noise to the pairwise similarity.We develop a reﬁned Simmeasure,

called RSim,as follows.Let M ¼ {m

1

,m

2

, ,m

R

},p

1

< p

2

< <

p

R

,denote the set of markers that are globally identiﬁed in all

samples.The markers imply the important genomic intervals that

are associated with the aberration patterns of samples.Given two

CGHsamples s

i

and s

j

,RSimcomputes the similarity between them

as the number of overlapping segments pairs,such that these seg-

ments satisfy both of the following two conditions:

(1) At least one of the markers in Mis contained in both segments.

(2) Both of the overlapping segments have the same aberration

type as the marker they both contain.

Formally,let x

i

u

‚x

i

uþ1

‚ ‚x

i

v

and x

j

u

0

‚x

j

u

0

þ1

‚ ‚x

j

v

0

be a pair of

segment from samples s

i

and s

j

,respectively.RSim counts this

pair of segments as one only if

(1) There exists a marker m

t

¼<p

t

,q

t

>,m

t

2M,suchthat up

t

v

and u

0

p

t

v

0

,and

(2) The aberration type of both segments is the same as that of m

t

,

i.e.x

i

u

¼ x

j

u

0

¼ q

t

.

Unlike Sim,RSim does not consider the overlapping segments that

do not intersect with any marker.This is because RSim considers

such segments as noise.For example,assume that there are two

markers in Figure 1,one at the 3rd and the other at the 11th genomic

interval.Then RSim measure for s

i

and s

j

is computed as one,

whereas Sim measure is two.

An important observation on RSim is as follows.As the number

of markers approaches to the number of genomic intervals in the

CGH data,RSim becomes equivalent to the Sim measure.This is

because all segments in the CGH data will overlap with a marker

and contribute to the similarity.Thus Simis a special case of RSim

when noisy aberrations are not eliminated.

Our previous work Liu et al.(2006) showed that Sim works best

when combined with topdown algorithm,compared to other popular

clustering algorithms,such as bottom-up and k-means.In this paper,

we propose to use the topdown clustering method with RSimas the

pairwise similarity measure for pairs of CGH samples.

Note that it is possible to extend the Rawmeasure in our previous

work Liu et al.(2006) by taking the markers into account.The

extended Raw measure works as follows.For each pair of samples,

we compute the similarity between themas the number of genomic

intervals that meet the following two conditions.First,both samples

have the same aberration type (gain or loss) at this interval.Second,

one marker of the same type as both samples appears at this interval.

Our experiments (results ommited due to space limit) show that

although the markers improve the original Raw measure,RSim is

always superior to this measure.

5 EXPERIMENTAL RESULTS

Dataset.With >12000 cases Baudis (2006),the largest resource

for published CGH data can be found in the Progenetix database

Baudis and Cleary (2001) (http://www.progenetix.net).We use

three different datasets,dissimilarDS,interDS and similarDS,

taken from Progenetix databases.Each dataset contains >800

CGHsamples (i.e.cytogenetic imbalance proﬁles of tumor samples)

fromfour different histopathological cancer types.Each sample has

been coded according to the ICD-O-3 systemFritz et al.(2000) and

consists of 862 ordered genomic intervals extracted from 24 chro-

mosomes.In principle,each dataset can be mapped to an integer

matrix of size N· 862,where Ndenotes the number of samples.The

difference of these three datasets are the divergence of aberration

Markers improve clustering of CGH data

453

patterns in distinct cancer types.In dissimilarDS,the samples of

different cancer types contain diverse aberration patterns that are

easily distinguished from each other.The samples in similarDS

contain similar aberration patterns.The interDS dataset is at an

intermediate degree.The choices of these datasets are based on a

visual inspection of the matrices (as in Fig.1 in appendix) for each

of the cancer types.

System speciﬁcations.Our experimental simulations were run

on a systemwith dual 2.59 GHz AMDOpteron Processors,8 GB of

RAM,and an Linux operating system.

5.1 Quality of clustering

Measuring the quality of clustering is a challenging task as it is

an unsupervised learning task.There are a number of internal

and external cluster validation techniques that are described in

the literature.In the following,we describe two measures that

can be used for evaluating the clustering.

(1) Coverage Measure (CM):An internal measure evaluates the

quality of clusters if the class labels of samples are not known

a priori.One possible internal measure is the cohesionfunction

defined in Section 3.This function measures the total support

of markers over each cluster.We use the termCoverage Mea-

sure (CM) to denote this measure.Markers with high-support

potentially convey some meaningful biological information

and potentially can serve as the first step for further analysis,

such as the identification of new oncogenes and tumor sup-

pressor genes.A group of markers can be considered as bio-

logically relevant if they cover most of the segments in all the

patients.

(2) Normalized Mutual Information (NMI):One way to measure

the quality of clustering is to see if each cluster predominantly

consists of samples from a single cancer type.This clearly

makes the assumption that this information is available (as

was the case for datasets used in this paper) and those samples

from the same cancer type will generally be similar to each

other.The external measure,knownas the Normalized Mutual

Information,described belowcan be used for this purpose.Let

N,Uand Kdenote the total number of samples,the number of

different cancer types andthe number of clusters,respectively.

Let a

1

,a

2

, ,a

m

denote the number of samples that belong to

each cancer type.Similarly,let b

1

,b

2

, ,b

k

be the number of

samples that belong to each cluster.Let c

i,j

,8i,j,1 i Uand

1 j K,denote the number of samples in jth cluster that

belongtotheithcancer type.TheNMI functionis computedas:

NMI ¼

P

U

i¼1

P

K

j¼1

c

i‚ j

log

N∙ c

i‚ j

a

i

b

j

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

P

i

a

i

log

a

i

N

P

j

b

j

log

b

j

N

s

:

NMI function takes a value in [0..1] interval.Higher values

indicate better clustering qualities.Unlike other external mea-

sures,NMI was computedbasedonmutual informationI(X;Y)

between a randomvariable X,governing the cluster labels and

a random variable Y,governing the cancer types.It has been

argued that the mutual information is a superior measure than

purity or entropy Strehl and Ghosh (2002).Moreover,NMI is

quite impartial to the number of clusters Zhong and Ghosh

(2005).

5.2 Quality of markers

Measuring the quality (or the biological relevance) of the markers

identiﬁed for the clusters is also important.Here,we develop a

measure to address this.We combine all the markers found in

each cluster.For each marker in combined marker set,we ﬁrst

compute the ratio of the samples that support it among the samples

in each cancer type.Thus,if there are T cancer types in the dataset,

T values are computed for each marker.We deﬁne the maximum

of these ratios as the biological relevance of this marker.This is

because a larger ratio of support from one cancer type indi-

cates that the marker better captures the aberration pattern of

that cancer type.Therefore,this marker is biologically relevant

in that cancer type.We use the term Global Maximum Support

(GMS) to represent this measure.Formally,let M¼

fm

1

‚m

2

‚...‚m

r

g denote the set of markers.Let C

i

,1 i T denote

the set of samples that belong to ith cancer type,where T is the

number of different cancer types in the dataset.GMS measure for

marker m

i

2 Mis computed as:

GMSðm

i

Þ ¼ max

1tT

1

j C

t

j

X

s

j

2C

t

sðs

j

‚m

i

Þ

Note that GMS differs greatly from CM for the CM measure is

computed over clusters identiﬁed by the underlying clustering

strategy,whereas GMS is computed over the cancer types.

5.3 Evaluation

We tested the prototype-based approach and two similarity-based

approaches,RSim and Sim Liu et al.(2006) over three datasets,

dissimilarDS,interDS and similarDS.For each clustering method

and dataset,we created 6,8 and 10 clusters.For each number of

clusters,we tried different number of markers,i.e.4,6,8,10 and 12

markers,per cluster in the prototype-based approach.This is

because biologists have pointed out that a total of 4–7 genetic

alterations can be estimated for the malignant transformation of

a cell Renan (1993).Thus,we estimated that the number of aber-

rations common to the samples of one cluster could be around 10.

For the consistency,we also employed different numbers of markers

for each clustering of 6,8 and 10 clusters in RSim.Here,the number

of markers is determined as the product of number of clusters and

the number of markers per cluster used in the prototype-based

approach.For example,24,36,48,60 and 72 markers were

found to create six clusters using RSim.We compared three

methods according to the qualities of clusters.We evaluated the

cluster qualities using both NMI and CMmeasures.To evaluate the

cluster qualities using CM,for both RSim and Sim,we identiﬁed

the markers for each resulting cluster.The number of markers per

cluster is the same as that used in the prototype-based approach.

We compute the error bars for part of the results.We also used

GMS to evaluate the biological relevance of markers found in

prototype-based approach and RSim approach.

The CM results are shown in Table 1.The CM values mono-

tonically increase as the number of clusters or number of markers

increase.Thus,we compare three clustering methods for the same

number of clusters and markers.We observe that prototype-based

approach has 8–34 % better coverage than RSim and 15–41%

J.Liu et al.

454

better coverage than Sim.This is because the cohesion function

optimized in prototype-based approach is the same as the Coverage

Measure.RSim and Sim do not optimize CM directly.The results

also show that RSim is superior to Sim most of the time.This is

because the use of markers in RSim ﬁlters out the noise that are

irrelevant to the aberration patterns.As a result,the markers in each

cluster are supported by more samples in RSimas compared to Sim.

The NMI results are shown in Table 2.Since Sim produces

clusters without ﬁnding markers,we list its results on a separate

column.The results show that all three methods perform better on

dissimilarDS than interDS and similarDS.This is because the aber-

ration patterns of distinct cancer types in dissimilarDS are diver-

gent.Thus,it is harder to cluster interDS and similarDS datasets.

From the results,we observe that RSim and Sim always beat

prototype-based approach in terms of the NMI values.This obser-

vation,together with the conclusion from results in Table 1,indi-

cates that NMI measure has no apparent relationship with CM.This

is because NMI computes the quality based on the class labels of

samples.On the other hand,CM evaluates the compactness of

samples in each cluster based on chromosomal aberration patterns

and completely ignores the class labels.Therefore,we conclude that

the pairwise similarity-based clustering approaches are more suit-

able to external measures,such as NMI,while the prototype-based

approach works better for the CM.RSim usually has the best NMI

results using 10 markers.When compared to Sim,RSimusually has

better NMI values.This indicates that the use of markers in reﬁning

the pairwise similarity also leads to a better clustering in terms of

NMI.Given that RSim has better CM (Table 1) and NMI values

than Sim (Table 2),we conclude that RSim is a better pairwise

similarity measure than Sim.

We compute the error bars of experimental results as follows.We

randomly sample 50% of each dataset and cluster it using three

methods,prototype-based,RSim and Sim,described in the paper.

To reduce the amount of computations,we choose one conﬁguration

from different combinations of parameters in the experiments.For

each dataset,we create eight clusters.We identify 10 markers per

cluster in the prototype-based approach and 80 markers in RSim

approach.We then compute both NMI and CM values for the

resulting clusters using 10 markers per cluster.We repeat this pro-

cess 100 times and compute the error bar for the values of NMI and

CM.The error bar indicates the interval where 5–95%of the results

lie.Table 3 shows the results with error bars.Note that the results of

CM are roughly half the results shown in Table 1 because the

calculation of CM depends on the dataset and we sample 50% of

each dataset in the experiments.The results show that RSim is

superior to Sim most of the time in terms of both NMI and CM.

Moreover,among the three clustering methods,RSim and

prototype-based approach works the best for NMI and CM mea-

sures,respectively.These observations are compatible to those we

obtained fromTables 1 and 2.Therefore,the error bars conﬁrmour

earlier conclusions.

Next experiment compares the GMS values for RSIM and

prototype-based clustering approaches.For each dataset,we created

eight clusters and identiﬁed 10 markers per cluster.This is because

these results are among the best results of each dataset in Table 2.

We then sort the markers in descending GMS value order.We plot

the sorted results of both RSim and prototype-based approach

(Fig.2).The plots show that the maximumglobal support of mark-

ers found by RSim is always comparable to or better than those

found by the prototype-based approach.

6 CONCLUSIONS

We considered the problemof clustering CGH data of a population

of cancer patient samples.There are three main contributions of

our work:

(1) We developed a dynamic programming algorithmto identify

the optimal set of important genomic intervals called markers.

The advantage of using these markers is that the potentially

noisy genomic intervals are excluded in the computation of

pairwise similarity between samples.

(2) We developed two clustering strategies using these markers.

The first one,prototype-based approach,maximizes the

support for the markers.The second one,similarity-based

approach,develops a new similarity measure called RSim.

It iteratively refines clusters with the aim of maximizing the

RSim measure between the samples in the same cluster.

We demonstrated the utility of such a measure in improving

the quality of clustering using the classified disease entities

from the Progenetix database.Our results show that the

markers we found represent the aberration patterns of cancer

types well.

Table 1.The coverage measure for the three clustering methods applied over

each of three datasets

Dataset K Alg Number of markers

4K 6K 8K 10K 12K

Proto 1655 2107 2465 2707

2953

6

RSim

1520 1916 2190 2512 2722

Sim

1316 1741 2051 2325 2571

Proto

1738 2192 2524 2769 3038

DissimilarDS

8

RSim

1585 1925 2254 2516 2817

Sim

1470 1883 2209 2484 2719

Proto

1839 2214 2613 2928 3149

10

RSim

1645 1938 2371 2575 2806

Sim

1480 1895 2212 2480 2716

Proto

1328 1696 2007 2318 2531

6

RSim

993 1463 1764 2006 2257

Sim

945 1303 1611 1884 2123

Proto

1345 1770 2104 2415 2709

InterDS

8

RSim

1011 1473 1821 2114 2373

Sim

1004 1382 1706 1993 2246

Proto

1443 1860 2189 2519 2779

10

RSim

1192 1533 1825 2156 2467

Sim

1113 1486 1798 2078 2322

Proto

1793 2197 2512 2834 3161

6

RSim

1430 1868 2265 2541 2808

Sim

1414 1830 2206 2530 2775

Proto

1827 2257 2643 2973 3298

SimilarDS

8

RSim

1627 2061 2400 2694 3044

Sim

1530 1956 2316 2625 2889

Proto

1946 2438 2908 3274 3551

10

RSim

1734 2153 2506 2831 3197

Sim

1636 2100 2492 2822 3116

Here,K denotes the number of clusters.Proto denotes the prototype-based approach.

Markers improve clustering of CGH data

455

Table 2.The NMI values of the three clustering methods are applied over each of three datasets

Dataset K Sim Alg Number of markers

4K 6K 8K 10K 12K

DissimilarDS 6 0.41 Proto

RSim

0.31

0.45

0.31

0.49

0.29

0.50

0.29

0.49

0.24

0.50

8

0.47 Proto

RSim

0.32

0.48

0.28

0.49

0.30

0.49

0.25

0.51

0.28

0.49

10

0.43 Proto

RSim

0.35

0.46

0.33

0.44

0.37

0.47

0.34

0.50

0.29

0.49

InterDS

6

0.34 Proto

RSim

0.06

0.35

0.08

0.34

0.08

0.38

0.08

0.40

0.11

0.39

8

0.32 Proto

RSim

0.07

0.34

0.10

0.34

0.09

0.36

0.10

0.38

0.11

0.36

10

0.32 Proto

RSim

0.10

0.35

0.09

0.36

0.10

0.36

0.15

0.37

0.12

0.36

SimilarDS

6

0.39 Proto

RSim

0.14

0.29

0.13

0.41

0.06

0.39

0.07

0.39

0.07

0.40

8

0.36 Proto

RSim

0.14

0.36

0.09

0.38

0.11

0.38

0.07

0.38

0.05

0.37

10

0.33 Proto

RSim

0.08

0.35

0.08

0.37

0.10

0.35

0.06

0.37

0.07

0.37

Here,K denotes the number of clusters.Proto,denotes the prototype-based approach.

Table 3.The percentage error for three clustering methods,prototype-based (denoted as Proto),RSim and Sim,over each of three datasets,dissimilarDS,

interDS and similarDS,to create eight clusters and identify 10 markers per cluster

NMI CM

5% Median 95% 5% Median 95%

Proto 0.20 0.28 0.34 1372 1431 1487

DissimilarDS RSim 0.45 0.51 0.55

1230

1291 1337

Sim 0.36 0.42 0.47

1187

1241 1289

Proto 0.05 0.09 0.13

1156

1228 1285

InterDS RSim 0.32 0.37 0.41

985

1042 1120

Sim 0.27 0.30 0.33

963

1032 1098

Proto 0.06 0.09 0.13

1480

1535 1581

SimilarDS RSim 0.33 0.37 0.40

1294

1342 1401

Sim 0.29 0.33 0.37

1283

1327 1376

The resulting clusters are evaluated using both NMI and CM.Here,5 and 95% denote the 5th and 95th percentile,respectively.

Fig.2.Plots of Global Maximumsupport of markers foundin(a) dissimilarDS,(b) interDSand(c) similarDS,respectively,usingbothprototype-based approach

and RSim.The results of prototype-based approach (denoted as Proto) and RSim are plotted in solid line and dashed line,respectively.

J.Liu et al.

456

(3) We developed several measures for comparing markers and

different clustering methods.Our experimental results show

that optimizingfor the coverage measure maynot leadtobetter

values of NMI and vice versa.

ACKNOWLEDGEMENTS

The authors would like to thank Dr Michael Baudis for confirming

the selection of datasets.The work of authors is supported in part by

the National Science Foundation under Grant ITR 0325459.Any

opinions,findings and conclusions or recommendations expressed

in this material are those of the author(s) and do not necessarily

reflect the views of the National Science Foundation.

Conflict of Interest:none declared.

REFERENCES

Baudis,M.and Cleary,M.L.(2001) Progenetix.net:an online repository for molecular

cytogenetic aberration data.Bioinformatics,17,1228–1229.

Baudis,M.(2006) An online database and bioinformatics toolbox to support data

mining in cancer cytogenetics.Biotechniques,40,269–270,272.

Beattie,B.J.and Robinson,P.N.(2006) Binary state pattern clustering:A digital para-

digm for class and biomarker discovery in gene microarray studies of cancer.

J.Comput.Biol.,13,1114–1130.

Bentz,M.et al.(1996) High incidence of chromosomal imbalances and gene

ampliﬁcations in the classical follicular variant of follicle center lymphoma.

Blood,88,1437–1444.

Kallionieni,A.et al.(1992) Comparative genomic hybridization for molecular

cytogenetic analysis of solid tumors.Science,258,818–821.

Fridlyand,J.et al.(2004) Hidden markov models approach to the analysis of array CGH

data.J.Multivar.Anal.,90,132–153.

Desper,R.et al.(1999) Inferring tree models for oncogenesis fromcomparative genome

hybridization data.J.Comput.Biol.,6,37–52.

Fritz,A.,Percy,C.,Jack,A.,Sobin,L.and Parkin,M.(ed.) (2000) International Classi-

ﬁcation of Diseases for Oncology (ICD-O),3rd edn.World Health Organization,

Geneva.

Fu,L.M.and Fu-Liu,C.S.(2005) Evaluation of gene importance in microarray data

based upon probability of selection.BMC Bioinformatics,6,67.

Gray,J.W.et al.(1994) Molecular cytogenetics of human breast cancer.Cold Spring

Harb.Symp.Quant.Biol.,59,645–652.

Hoglund,M.et al.(2005) Statistical behavior of complex cancer karyotypes.

Genes Chromosomes Cancer,42,327–341.

Horowitz,E.,Sahni,S.and Rajasekaran,S.(1998) Computer Algorithms/C++.Computer

Science Press,New York,NY,USA.

Jain,A.K.et al.(1999) Data clustering:a review.ACM Comput.Surv.,31,

264–323.

Jiang,D.et al.(2004) Cluster analysis for gene expression data:a survey.IEEE Trans.

Knowl.Data Eng.,16,1370–1386.

Joos,S.et al.(2002) Classical hodgkin lymphoma is characterized by recurrent

copy number gains of the short arm of chromosome 2.Blood,99,

1381–1387.

Liu,J.et al.(2006) Distance-based clustering of CGH data.Bioinformatics,22,

1971–1978.

Mattfeldt,T.et al.(2001) Cluster analysis of comparative genomic hybridization CGH

data using self-organizing maps:application to prostate carcinomas.Anal.Cell.

Pathol.,23,29–37.

Mitelman,F.et al.(1972) Tumor etiology and chromosome pattern.Science,176,

1340–1341.

Mitelman,F.(ed.) (1995) International System for Cytogenetic Nomenclature.Karger,

Basel.

Picard,F.,Robin,S.,Lebarbier,E.and Daudin,J.-J.(2005a) A segmentation-clustering

problem for the analysis of array CGH data.In Applied Stochastic Models and

Data Analysis.

Picard,F.et al.(2005b) A statistical approach for array CGH data analysis.BMC

Bioinformatics,6,27.

Pinkel,D.and Albertson,D.G.(2005) Array comparative genomic hybridization and its

applications in cancer.Nat.Genet.,37 (Suppl.),S11–S17.

Renan,M.J.(1993) How many mutations are required for tumorigenesis?implications

from human cancer data Mol.Carcinog.,7,139–146.

Rouveirol,C.et al.(2006) Computation of recurrent minimal genomic alterations from

array-cgh data.Bioinformatics,22,849–856.

Strehl,A.and Ghosh,J.(2002) Cluster ensembles—a knowledge reuse framework for

combining partitionings.In Proceedings of AAAI 2002,AAAI,Edmonton,Canada,

pp.93–98.

Vandesompele,J.et al.(2005) Unequivocal delineation of clinicogenetic subgroups and

development of a new model for improved outcome prediction in neuroblastoma.

J.Clin.Oncol.,23,2280–2299.

Wang,P.et al.(2005) A method for calling gains and losses in array CGH data.

Biostatistics,6,45–58.

Willenbrock,H.and Fridlyand,J.(2005) A comparison study:applying segmentation

to array CGH data for downstream analyses.Bioinformatics,21,4084–4091.

Zhong,S.and Ghosh,J.(2005) Generative model-based document clustering:a

comparative study.Knowl.Inf.Syst.,8,374–384.

Markers improve clustering of CGH data

457

## Comments 0

Log in to post a comment