Ph.D. Student Enrolment Project Title

jamaicacooperativeAI and Robotics

Oct 17, 2013 (3 years and 7 months ago)

70 views

Ph.D. Student

:

Cand.scient. Mark Burton

Enrolment

:

1
May 2008

Project Title

:

Development, characterization and application of data
-
analysis methods for
construction of gene expression based classifiers in breast cancer

Supervisors

:

Professor, lic.scient., Torben Kruse
, head of research de
partment of clinical
genetics, Odense University Hospital;

ph.d.,
molecular biologist
, Mads
Thomassen,

department of clinical genetics, Odense University Hospital;

Qihua
Tan, lektor, dr.med.,
Instit
ute Of Clinical Research
, SDU

Institute

:

Institute of
c
linical research

Research Unit

:
Human
G
enetics


Abstract
:

Aim
s
:
1)
Use DNA microarray gene expression data to i
dentify genes

and metagenes

(i.e. genes
belonging to a biological pathway, chromosomal region or transcription factors associated with
breast cancer metastasis), that

are differentially expressed between primary breast tumors which
have metastasized

or not
,
; 2) Bui
ld DNA microarray
-
based classifiers which can predict metastasis

outcome in new patients, 3
)

Investigate if
metagen
e models

are

better than single gene models

to
predict metastasis
-
outcome
;

4
)
Benchmarking of
classification

methods wit
h respect to predict
ion

accuracy

(ACC)
,

defined as the mean of sensitivity

and specificity
; 5) Benchmarking of validation
methods

(Conservative, leave
-
out
-
one cross
-
validation

(LOOCV)
,

½
-

and


-
splitting)
.

Methods:
Eight breast cancer

DNA microarray dat
asets

were chosen for our analysis. As we are
face
d

with the curse of dimensionality, meaning
that number of genes exceeds the number of
samples
, we

need

a reduction of
gene

and metagene

candidates. Th
is

was done using
our

new

method which
rank
s

the genes

and m
etagenes

within each dataset
according
to their signal
-
to
-
noise
ratio
, followed by
a meta
-
approach
which
calculat
es

the individual gene mean rank across the eight
datasets (Xi).

To test if a gene was significantly differentially expressed between the two t
umor
-
groups, 8 random numbers

were sampled,

which ranged between

1

and the

maximum
scaled
gene
rank

value
, and calculated the mean of these 8 random numbers. This calculation was repeated 1000
times, thus generating a distribution of random average values.

The FDR
-
value for each gene was
then calculated by comparison of the Xi
-
values with this distribution
. C
hoosing an FDR of 0.05
left
a list of
significant differentially expressed genes.

In the further analysis we focused on node
-
negative patients from two

of the above datasets

(A and
R)
, and a combination of the two
(
AR
)
.

For validation
,

two datasets

which had not been used for
choosing the candidate genes

was used

(T and

M
)

and a combination of the two
TM
.

Results:

283

and 71

s
ignif
icantly differential
ly
expressed genes and metagenes were identified
.

These
were used for development of
prediction
-
models by 10
-
fold cross
-
validation
within A, R and
AR respectively
,
using
either Random Forest

(RF)
, logistic regression

(LR)
, R
-
SVM, L
-
SVM, S
-
SVM and neural n
etwo
rk

(NN)
,

leading to a total of 26 single
-
gene and

27 metagene prediction
models.

Amongst the
se

models,
T and TM were
predicted with 68% and 67% ACC
using R
-

and
AR
-
trained RF
-
based models having 273 and 99

genes, respectively
. M
w
as

validated with 68%
ACC
using a
n

R
-
trained L
-
SVM model consisting of 19 metagenes
.

Wilcoxon paired tests
showed
that metagenes
outperform

single genes (p = 0.0096)

in all validation setups
, and conservative

validation performs better than LOOCV

(p=0.0012), ½ s
plitting

(p=
7.
39

x 1
0e
-
14
)

and

-
splitting
(p=9.07 x 10e
-
11). Furthermore
,

the results

show

that in most cases
,
R
F

outperform
s

the other
classification methods.



Keywords
:

Oncology and Haematology