Bioinformatics System for Gene Diagnostics and Expression Studies

jamaicacooperativeAI and Robotics

Oct 17, 2013 (3 years and 8 months ago)

78 views


1




Abstract
--

This report is based on work undertaken on
the CADRA (Corona
ry Artery Disease Risk
-
Assessment) project. Pattern recognition techniques
were employed in the classification of CAD (Coronary
Artery Disease) cases using various risk factors and
information from microarray gene
-
expression data. The
machine learning algo
rithms used are the Naïve Bayes
Classifier (NBC), C4.5 Decision Tree and Support
Vector Machines (SVM). The Chinese samples in the
dataset were segregated and investigated using NBC
and C4.5, allowing for comparison with ‘all
-
race’
classifiers. Lastly, th
e implementation of a web
-
based
Bioinformatics system using
Microsoft technology

undertaken
with my colleagues is described.



1.

Introduction


Advances in the Life Sciences and various related
technologies has fueled much interest amongst researchers
in the study of genes and di
seases. The Biosensors
-
Focused
Interest Group (BFIG) of the National University of
Singapore (NUS) was formed with the focus of harnessing
DNA microarrays for genetic diagnosis and gene expression
studies. CADRA (Coronary Artery Disease Risk
Assessment)
was one of the projects undertaken which
involved the study of 30 genes associated with
atherosclerosis. This project involves the development and
implementation of a Web
-
based front
-
end for integrating the

DNA information with Microsoft technologies, as w
ell as to
understand and implement machine learning algorithms in
classifying CAD patients.


Coronary Artery Disease (CAD) is a major cause of death in
developed countries. It is described as a complex disease
due to its multifactorial nature. This disease

is further
complicated by gene
-
environment and gene
-
gene
interactions resulting in risk factor variability.

CADRA researchers are utilising microarray technology in
studying various candidate genes involved in CAD. Having
obtained the large quantity of ca
ndidate gene
-
expressions

data, the bioinformatics group of CADRA project has been
involved in data mining and analysis in relation to CAD.




The algorithms used for classification are the Naïve Bayes
Classifier, C4.5 Decision Tree and Support Vector
Mac
hines.



2.

Experiment I
ssues and Methodology


Jonatrim 2 is the dataset used to do classification. It consists
of a total of 29 independent attributes and 1 dependent
attribute (ie. the CAD status). 19 of the independent
attributes are categorical data and th
ey pertain to
genotype/phenotype information from microarray analysis

(
henceforce known as

A attributes).


Several issues and complications have arisen in the dataset.
These include possible noise in the genotype/phenotype
classification process, contam
inated risk factors (TC, HDL,
LDL and TG), missing data and class imbalances.


2.1


Conduct of Training and Testing


The training and testing of models of the Naïve Bayes
Classifier and C4.5 Decision Tree were carried out as
described by Russel and Norvig [
1
].


The results of classification on the test set are quoted for the
various algorithms used in their respective chapters. The
average results

are given

across the entire training

set

size
range (
except for SVM
)
. The following steps and measures
were also taken in the conduct of the various training and
testing of models.




Sampling has been carried out based on uniform
distribution. The test sets were sampled without
replacement while
the training set was sampled with
replacement when insufficient examples arose.



Training set sizes ranging from 100 to 800 instances (
in increments of 100 ) and 1000 to 2000 instances (in
increments of 200) were used. Test set size has been
fixed at 100
samples throughout.



SVM calls for a different method for training and
testing. The test set is still fixed at 100 samples but no
resampling will be carried out for training sets, which is
fixed at training size of 500.



The Chinese training sets were of size

280 and test sets
were of size 100.


Bioinformatics System for Gene Diagnostics and
Expression Studies

Justin Chan Shao Ling

National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260

jcsl76@yahoo.com


2



To address the various issue of contaminated attributes of
TC, HDL, LDL and TG, and for comparative studies with
previous investigations [
3
], Jonatrim 2 dataset has been split
into several categorical

types
.


Type

Attribute
s

A

All attributes

B

All attributes except

A

C

All attributes except TC,HDL,LDL,TG

D

All attributes except

A,TC,HDL,LDL,TG

E


A from Type A

F


A

G

All attributes except Race

H

All attributes except

A, Race

I

All attributes except TC,HDL,LDL,TG
,Race

J

All attributes except

A,TC,HDL,LDL,TG,Race

K


A from Type A with Race


Table 1: Description of Dataset Types used


Within each run of the algorithm concerned, the test

set
used is the same throughou
t
.

However, each run using T
ype
F uses a different test set. The
training and
test set
s

are

also
different for
between

algorithms.


2.2 Performance Measure


To evaluate the performance of the algorithm, the various
measures in the confusion matrix as well as others ( eg.
sensitivity, predictiv
e value positive ) would be presented in
tabular form for the various
algorithms used.

It should be
noted that in the case of medical diagnosis of a condition
like CAD, not all classification
errors carry the same
penalty [4
]. Specifically, one wo
uld like to avoid classifying
CAD cases as non
-
CAD cases.



3 Introduction to Naïve Bayes Classifier


The Naive Bayesian Classifier (NBC) is a simple, yet
effective method for pattern classification in cases involving
both discrete and continuous attribut
es. Its basis is rooted in
Probability Theory, particularly Baye’s Rule. Although, it is
less expressive and assumes attributes to be equally
important and independent of one another, its simplicity has
been known to rival and outperform more complicated
c
lassifiers like decision trees and instance
-
based learners
[
5
].


The concept of conditional probability in Probability
Theory provides the link in obtaining Baye’s Rule, which is


Baye’s rule of conditional probability shows how to predict
the class of a
previously unseen example, given a training
sample. The chosen class, y, should be the one which
maximizes


where y represents the target class out of all the various
possible classes ( ie. y


Y ) and x={x
1
, x
2
, …,x

}, the
attributes used for classificat
ion. The various probabilities
of Pr{Y}, Pr{X} and the conditional probability Pr{Y|X}
had been obtained from the training sample. In the case
where the various attributes making up X are statistical
independent, then we can say that the attributes X
1
, X
2
,

…,
X

, are statistically independent iff


Therefore, NBC predicts the class, y, that maximizes



4 Introduction to C4.5 Decision Trees


The divide
-
and
-
conquer approach to decision tree
induction, sometimes called top
-
down induction of decision
trees, w
as developed and refined over many years by Ross
Quinlan[
6
]. Its expressiveness and ability to generate rules
are its major area of attraction to researchers.


A decision tree is a collection of branches (paths from the
root to the leafs), leafs (indicati
ng the class) and nodes
(specifying test to be carried out). Decision trees classify
cases as identified through attributes (or features).
Essentially, they recursively partition regions of the
attribute space into sub
-
regions according to the most
‘inform
ative’ attribute [12]. To specify the criterion for
selecting the most ‘informative’ attribute, the use of
information gain ratio criterion was used for C4.5.
Originally, only the information gain criterion was used (ie.
in ID3) but it was observed that th
is measure had a strong
bias in favour of attributes with many outcomes. These
criterion measures are largely based on Information Theory.


4.1 Information Measures


The following are some of the information measures used in

C4.5. The entropy
H(X)

of a discrete random variable
X

is

defined as





3

The conditional entropy
H(X|Y)

of a random variable
X

given the random variable
Y

is


The information gain or mutual information between
random variable
X

and
Y

is given as



The information gain ratio cr
iterion or gain ratio, as it is
more commonly known, is thus defined as




5
Introduction to
Support Vector Machines


Support Vector Machines (SVM) are learning systems that
use a hypothesis of space of linear functions in a high
dimensional feature space, trained with
a learning algorithm
from Optimisation Theory that implements a learning bias
derived from Statistical Learning Theory [
7
]. The
motivation for use of higher dimensional feature space is
that there is higher probability of encountering linearly
separable pa
tterns. SVM incorporates many ideas including
quadratic programming and the use of kernels.


5.1 Linearly Separable Patterns


Consider a training sample


S
=((x
1
,y
1
),…,(x

,y

))


that is linearly separable in the feature space implicitly
defined by the ker
nel
K
(x
i
,x
j
) and suppose the parameters


and
b

solve the following quadratic optimisation problem:



Then the decision rule is given by


is equivalent to the maximal margin hyperplane in the
feature space implicitly defined by the kernel
K
(x
i
,x
j
).


5.
2 Linearly Non
-
Separable Patterns


In actual real
-
life situations, the data obtained may be
contaminated or corrupted with noise, causing patterns to be
linearly non
-
separable. SVM provides a ‘soft’ approach in
such classification through the use of slack

variable,



along with relaxed constraints,


A classifier that generalises well is then found by
controlling both the classifier (via

w.w

) and the number
of training errors, minimizing the objective function


subject to the constraints of the sla
ck variable, the relaxed
constraints and
C

> 0. The only modification in the dual
formulation is the constraint on the

i

values, namely


instead of

i



0 . The parameter C controls the tradeoff
between the complexity of the machine and the number of
non
-
separable points.



6

Software U
sed


The Weka package (Version 3.0.1) , from the University of
Waikato, was used to run NBC and C4.5. For NBC, the
package offers the use of kernel density estimation for
modeling continuous attributes. For C4.5, the confid
ence
intervals set to 0.1 for pruning. Jonatrim 2 dataset was pre
-
processed into the required ARFF format before running the
software. References with regard to the ARFF format and
s
oftware features are available[8
].


An application was developed to crea
te the required models
to do SVM classification. The software uses the CSV
(comma
-
separated variable) data file format which is readily
supported by major spreadsheet programs like Microsoft
Excel. The polynomial kernel of degree 2 is used the kernel
used
in building the classifier ( ie.
K
(x
i
,x
j
) =(x
i


x
j

+ 1)
2

).



7
Comparing Results from Different Algorithms


The classification results from the 3 algorithms explored
based on Type A to Type F datasets are summarised in the
tables below. Note that the run

that gave the better results
for the particular dataset type in NBC and C4.5 classifiers
are quoted here. Also, the results stated here are still limited
by the different issues methodologies involved in training

4

and testing the classifiers, and serve to
indicate the
estimated potential of the classifiers.


Statistics

NBC

C4.5

SVM

True Positive(%)

44.64%

44.57%

36.00%

True Negative(%)

42.36%

43.57%

33.00%

False Positive(%)

7.64%

6.43%

17.00%

False Negative(%)

5.36%

5.43%

14.00%

Sensitivity

0.8929

0.89
14

0.72

Specificity

0.8471

0.8714

0.66

Predictive Value Positive

0.8548

0.8762

0.679245

Predictive Value Negative

0.8884

0.8916

0.702128

Success Rate

0.87

0.8814

0.69

Failure Rate

0.13

0.1186

0.31


Table 2: Classification result comparison on Type A



Statistics

NBC

C4.5

SVM

True Positive(%)

45.29%

44.64%

47.00%

True Negative(%)

41.93%

44.57%

39.00%

False Positive(%)

8.07%

5.43%

11.00%

False Negative(%)

4.71%

5.36%

3.00%

Sensitivity

0.9057

0.8929

0.94

Specificity

0.8386

0.8914

0.78

Predictive V
alue Positive

0.8488

0.8938

0.810345

Predictive Value Negative

0.8996

0.8945

0.928571

Success Rate

0.8721

0.8921

0.86

Failure Rate

0.1279

0.1079

0.14


Table 3: Classification result comparison on Type B





Statistics

NBC

C4.5

SVM

True Positive(%)

41.36
%

42.57%

36.00%

True Negative(%)

43.07%

44.86%

35.00%

False Positive(%)

6.93%

5.14%

15.00%

False Negative(%)

8.64%

7.43%

14.00%

Sensitivity

0.8271

0.85143

0.72

Specificity

0.8614

0.89714

0.7

Predictive Value Positive

0.8629

0.8969

0.705882

Predictiv
e Value Negative

0.8397

0.86299

0.714286

Success Rate

0.8443

0.87429

0.71

Failure Rate

0.1557

0.12571

0.29


Table 4: Classification result comparison on Type C




Statistics

NBC

C4.5

SVM

True Positive(%)

44.00%

41.36%

44.00%

True Negative(%)

37.21%

43.
07%

45.00%

False Positive(%)

12.79%

6.93%

5.00%

False Negative(%)

6.00%

8.64%

6.00%

Sensitivity

0.88

0.8271

0.88

Specificity

0.7443

0.8614

0.9

Predictive Value Positive

0.7751

0.8629

0.897959

Predictive Value Negative

0.8624

0.8397

0.882353

Success
Rate

0.8121

0.8443

0.89

Failure Rate

0.1879

0.1557

0.11


Table 5: Classification result comparison on Type D



Statistics

NBC

C4.5

SVM

True Positive(%)

28.57%

26.79%

24.00%

True Negative(%)

31.71%

27.29%

31.00%

False Positive(%)

18.29%

22.71%

19.00%

False Negative(%)

21.43%

23.21%

26.00%

Sensitivity

0.5714

0.5357

0.48

Specificity

0.6343

0.5457

0.62

Predictive Value Positive

0.6101

0.5398

0.55814

Predictive Value Negative

0.5982

0.5471

0.54386

Success Rate

0.6029

0.5407

0.55

Failure Rate

0.3971

0
.4593

0.45


Table 6: Classification result comparison on Type E




Statistics

NBC

C4.5

SVM

True Positive(%)

30.07%

31.07%

39.00%

True Negative(%)

25.29%

23.50%

22.00%

False Positive(%)

24.71%

26.50%

28.00%

False Negative(%)

19.93%

18.93%

11.00%

Sensi
tivity

0.601429

0.621429

0.78

Specificity

0.505714

0.47

0.44

Predictive Value Positive

0.549736

0.54034

0.58209

Predictive Value Negative

0.55879

0.557516

0.666667

Success Rate

0.553571

0.545714

0.61

Failure Rate

0.446429

0.454286

0.39


Table 7: Clas
sification result comparison on Type F


As shown by the various tables above, it is observed that
learning algorithms generally perform better for Type A, B,
C and D datasets. The exception is the SVM, which
performs only average on Type A and C. Performan
ce of
Type E and F, which uses

A attributes, perform poorly.
The highest success rates and the lowest false negative rates
for each type are underlined for reference.




5

8 Classification based on Chinese samples


It was shown that for both algorithms e
xplored (ie. NBC
and C4.5), better classification was generally observed
when Chinese
-
trained classifiers were used on Chinese test
cases.

This may be due to less genetic and environmental
variations among the race. The best result was by a Type G
C4.5 de
cision tree with a success rate of 95%.



9 CADRA's Web
-
based Bioinformatics System


A web
-
based Bioinformatics system in providing patient
database, data
-
mining and delivery of microarray gene
-
expression data has been proposed and its prototype has
bee
n implemented by colleagues and myself using various
Microsoft technologies.


A web
-
server was set up at the Clinical Research Centre to
house the website. The server is running on the Windows
2000 operating system with Internet Information Server 5.
Micr
osoft's Active Server Pages 3.0 (ASP 3.0) was used.
Access 2000 was used as the prototype database. Microsoft
Backoffice 4.5 and SQL Server 7 has been installed. Entry
into the website can be found at
http://bfigcad.nus.edu.sg/cad_ra/login.asp .


9.1

Var
iations on the CADRA Prototype


The prototype of the CADRA web
-
based system is a
generic template for more elaborate structures. One such
possible structure is described in the figure below.



Figure 1: Alternative structure of service delivery with
Appli
cation and Internet Service Provider separate from
Microarray Lab


It was required that other technologies and products be
explored for the project and several discoveries were made.
Java technology from Sun Microsystems offers a viable
alternative to Micr
osoft's products. In combination with the
Linux operating system, Apache Web Server and mySQL
database, this solution offers a low cost yet robust option for
the Internet a
nd Application Service Provider.


10


C
onclusion


Generally good classification resul
ts were generally
obtained for all 3 algorithms used, except when microarray
gene
-
expression data were used only. It is felt that for SVM,
better results could be obtained using other kernels.
Furthermore, better classification results were generally
obser
ved when Chinese test sets were shown to Chinese
-
trained classifiers as opposed to classifiers trained by all
races based on NBC and C4.5. Lastly, a web
-
based
bioinformatics system implemented with my colleagues for
both medical information
-
retrieval and C
AD risk
-
assessment has been described.



References


[
1
]Stuart Russell & Peter Norvig, Artificial Intelligence: A
Modern Approach, 1995, Prentice Hall, p 538


[
2
] Gary M. Weiss, Foster Provost, The Effects of Class
Distribution on Classifier Learning, Rutg
ers University,
2001


[3
]Chin W.C., Gene Identification of DNA Microarray
Patterns Using Neural Network Techniques, National
University of Singapore, 2000


[
4
]Congdon, C. B. A Comparison of Genetic Algorithms
and Other Machine Learning Systems on a Complex

Classification Task from Common Disease Research,
University of Michigan, 1995


[
5
] Domingos, P. and Pazzani, M. Beyond Independence:
Conditions for the Optimality of the Simple Bayesian
Classifier, Proceedings of the 13
th

International Conference
on Mach
ine Learning, San Francisco, CA, Morgan
Kaufmann (1996), p

1
05
-
112


[
6
] Ross Quinlan, C4.5: Programs for Machine
Learning,
Morgan Kaufmann, 1993, p vii,


[
7
]

Nello Cristinanini & John Shawe
-
Taylor, An
Introduction to Support Vector Machines and Other
Kernel
-
based Learning Methods, Cambridge University

Press, 2000, p

7,

30
-
31,
93
-
110


[
8
] Witten, I. And Frank, E. Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementations. Morgan Kaufmann, 2000,
p 265
-
286