A COMPARATIVE STUDY OF MACHINE LEARNING

randombroadAI and Robotics

Oct 15, 2013 (3 years and 9 months ago)

66 views

A COMPARA
TIVE

STUDY OF MACHINE LEA
RNING
ALGORITHMS
APPLIED TO

PREDICTIVE TOXICOLOG
Y DATA
MINING


Neagu

C.D.
*
, Guo

G.
*
,

Trundle

P.
R.
*

and Cronin

M.T.D.
**


*
Department of Computing, University of Bradford, Bradford, BD7 1DP, U
K

{
D.Neagu,
G.
G
uo,
P.R.Trundle}@bradford.ac.uk


**
School of Pharmacy and Chemistry, Liverpool John Moores University, L3 3AF, UK

M.T.Cronin@ljm
u
.ac.uk



Abstract:

This paper
reports results of
a comparative study of widely used machine

learning
algorithms
applied to predictive toxicology data mining
. The involved machine learning algorithms

are
chosen in terms
of their representability and diversity
, and
are
extensively evaluated on seven toxicity data sets
which
c
o
me
from real
-
world ap
plications.
Some results based on
visual analys
is

of
the correlations of different descriptors
to the class values of chemical compounds

and
on
the relationships
of
the
range

of chosen descriptors to the
performance of machine learning algorithms are
empha
sized from

our experiments
.
Some
interesting findings

(
no

specific algorithm appears best for
all
seven toxicity data sets;
up to five

descriptors are sufficient for
creating
classification
models

for each toxicity data set
with good accuracy
)
o
n data
and
models


quality
are
presented
.
We suggest
that
,

for a specific dataset,
model accuracy
is affected

by
the

feature selection method

and
model development technique.

Models built with too many or too few descriptors are both undesirable,
and finding the opti
mal feature subset appears at least as important as selecting appropriate algorithms with
which to build a final model.

Keywords:

p
redictive
toxicology, data mining,
algorithm
, visual analysis
, feature selection

1.

Introduction


The increasing amount and comp
lexity of data used in predictive toxicology calls for new and flexible
approaches to mine the data.
Traditional m
anual data analysis has become inefficient and computer
-
based
analys
i
s
is

indispensable. Statistical methods
[
1
]
, expert systems
[
2
]
, fuzzy ne
ural networks
[
3
]
,

other
machine
learning algorithms
[
4, 5
]

are extensively studied and applied to predictive toxicology for
model development
and
decision making.
However, due to the complexity of
modelling
existing
toxicity data

sets

caused by
numerous

i
rrelevan
t

descriptors, s
kew
ed distribution
,

missing value
s

and noisy data, no dominant
machine
learning algorithm
can be proposed to
model
accurately
all the toxicity data sets

available
.
This motivate
d

us
to conduct a comparative study of machine learning

algorithms appli
ed to

seven

toxic
it
y

data

sets
. The
intention
of this study wa
s to
discuss on the applicability of some widely used
machine learning algorithm
s

for
the

toxicity data sets at hand. For this purpose,
s
even machine learning algorithms

which
a
re described i
n
next section
we
re chosen for this comparative study in terms of their representability and diversity
, and a

library of models
wa
s built in order to
provid
e

some
useful
model
benchmark
s

for researchers working in this
area.


2. Methods

2.1.

Machine Learning Algorithm
s

S
even algorithms have been chosen for this study in terms of their
representability, i.e.
abilit
y

to learn
numerical data

as
reported by
the
m
a
chine learning community

[
6
]
. The
y were also

chosen in terms of their
diversity, i.e.

the way
t
hey

learn data and
represe
nt the final models

differently

[6]
.

A brief introduction of the s
even machine learning al
gorithms
applied
in this study
is given below
:



Support Vector Machine

[
7
]
-

S
VM

is based on the
Structural Risk Minimization

princi
ple from statistical
learning theory. Given a training set in a vector space, SVM finds the best decision hyperplane that
separates the instances in two classes. The quality of a decision hyperplane is determined by the distance
(referred as margin) betwee
n two hyperplanes that are parallel to the decision hyperplane and touch the
closest instances from each class.



Bayes Net

[
8
]



Given a data set with instances characterized by features A
1
,..,A
k
, then
t
he B
N method

assigns the most probable class value
c

to a new instance with observed feature values
a
1

through
a
k

which satisf
y


is maximal.



Decision Tree

[
9
]

-

DT

is a widely used classification method in machine learning and data mining. The
decision tree is grown by recursively split
ting the training set based on a locally optimal criterion until all
or most of the records belonging to each of the leaf nodes bear the same class label.



Instance
-
Based Learner
s



I
BLs

[
10
]
classify an instance by comparing it to a set of pre
-
classified
i
nstances and choose a dominant class of similar instances as the classification result.



Repeated Incremental Pruning to Produce Error Reduction



RIPPER
[
11
]
is
a propositional rule learning
algorithm that performs efficiently on large noisy data sets. It
induces classification (if
-
then) rules from a
set of pre
-
labeled
instances

and looks at the
instances

to find a set of rules that predict the class of earlier
instances
.
I
t
also
allows users to specify constraints on the learned if
-
then rules to add prior
knowledge
about the conce
p
ts
, in order to get
more accurate hypothesis.



Multi
-
Layer Perceptron
s
-

MLPs
[
11
]
are feedforward neural networks

with

one or two hidden layers
,

trained with the standard backpropagation algorithm
.

T
hey can approximate

virtually a
ny input
-
output
map and
have been shown to approximate the performance of optimal statistical classifiers in difficult
problems.



Fuzzy Neural Network
s



F
NNs

[
1
2
]
are connectionist structures that implement fuzzy rules and fuzzy
inference
.
We

use the Back
Propagation (BP) algorithm to identify and express input
-
output relationships
in the form of fuzzy rules, thus leading
further

to possible knowledge extraction by humans
.


2.2.
Toxicity
Data Sets


For the purpose of evaluation, s
even data sets from real
-
wo
rld applications are chosen. Among these data sets,
five of them, i.e. TROUT, ORAL_QUAIL, DAPHNIA, DIETARY_QUAIL and BEE, come from
the
DEMETRA project
[
13
]
, APC data set
is
provided
by
C
entral
S
cience
L
aboratory (CSL) York, England

[
14
]
,
Phenols data set
comes from
TETRATOX database
[
15
]
.
A

r
andom division of each data set into

a

training
set and
a
testing set
wa
s carried out before evaluation.
General information about these data sets is given in
Table 1
.

<Table 1>

In Table 1,
the meaning of the title in
each column is as follows: NI
-

Number of Instances, NF_FS
-

Number
of Features after Feature Selection using
a
correlation
-
based
method

which identifies subsets of features that
are highly correlated to the class
[
1
6
]
;
NC
-

Number of Classes; CD
-

Class D
istribution; CD_TR
-

Class
Distribution of
TRaining

set, and CD_TE
-

Class Distribution of TEsting set.


3. R
esults


E
xperimental results of different
algorithm
s

evaluated on
these seven

data sets
are presented in Table
s

2

and
3
,

where parameter LR for ML
P stands for learning rate

and parameter
k

for IBL stands for the number of
nearest neighbours used for classifying new instances
.
The learning rate is
a

parameter

to control the
adjustment of connections strength

during the training process of
a neural ne
twork [
11
]
.

The classification
accuracies of models created by each algorithm vary between each data set
:

some accuracies are relatively
poor when compared to ‘benchmark’ data sets
from the
University of California at Irvine (
UCI
)

machine
learning reposito
ry

[1
7
]
. The

UCI

machine learning repository is
a collection
of databases, domain theories
and data generators that are used by the machine learning community for the empirical analysis of machine
learning algorithms
.

We ran the same algorithms against som
e
UCI

data sets and found that performances
obtained are better
o
n average than for the toxic
it
y
models

[
1
8
].

This indicates that t
he data from t
he seven
toxicity

data sets used in this paper
,

which
are

often noisy,
unevenly distributed across the multi
-
di
mensional
attribute space

and have a low ratio of instances (rows) to features (columns)
,

can make accurate class
predictions di
fficult
.

In Table
s

2 and 3,

the classification accuracy is defined
by

eq. (1)
:




(1
)

<Table 2>

In Table
s

2 and 3, t
he
figures in
bold in each row represent the best classification accuracy
for the
data

set

named to the left
.

Table

2

helps

identify the best model developed by the considered algorithms. Table 3
focuses on

identification of

the most suitable algorithm to develop good models for the data sets under
consideration.
Moreover,
Table 2 reports accuracies with
a single

train
/test split
of

the data (see Table 1)
,
whereas data used for models in Table 3 has been automatically split i
n 90/10

ten times

(ten fold cross
validation).
90 percent of
toxicity data

wer
e

used

for training and
the remaining
10 percent for testing

in each
of the 10 cases
. The results reported in Table 3 are
the

average classification accuracy

over the 10 tests
.
T
his
means
that
the

models listed in Table 2
are
more depend
ent

on the division of data sets

compared with the
models reported in Table 3
.
Consequently

the classification accuracies listed in Table 3 reflect
more fairly
the
learning ability of each machine
learning algorithm.

<Table 3>

Data sets
p
roperties
like noisiness,
uneven distribut
ion and size

can make creating accurate models
difficult. As
shown in
Table
3
,

some algorithms appear more suitable for particular data sets
,
i.e.

obtain higher
classificati
on accuracy
:

IBL for BEE
, SVM for
PHENOLS

and

BN for APC
. They

exhibit higher than
average accuracy compared to their
results

across all seven
data set
s. This implies that careful
algorithm

selection can make the creation of accurate models
more
straightfo
rward
.
.

A
case study of
visual

analy
sis
[
19
]
of

the correlations f
or

different descriptors to the class values of
chemical compounds

has been carried out on two data sets:

PHENOLS and TROUT
.
Figures 1 and 2 show
three
selected
attributes
that are the
most
highly correlated to the class for t
hese
data

sets.

For PHENOLS the
three selected attributes were

Log

P
,

magnitude of dipole moment and molecular weight

and

t
he class
is
described by

the mechanism of
action
. For TROUT the three selected attributes were
th
e
3
rd

order
valence
corrected cluster molecular connectivity
,

specific polarity and Log

D at pH9

and t
he class
value is given by
LC
50

(mg/l)
after 96 hours

for the
r
ainbow
t
rout
.

Figure 1 (PHENOLS
) shows a moderately good distribution
of data, but lacks cl
early defined boundaries between classes
.

I
n particular
,

Class 2 and

Class

3 show a large
amount of overlap in the lower portion of the graph. Figure 2 (TROUT) shows the same lack of boundaries
between classes, but also shows an uneven distribution of data
:

a large cluster of data
-
points from all three
classes can be seen to the left of the graph, with only a small amount of data
-
points falling in the remaining
attribute space.

Th
ese factors contribute to the

relatively low prediction accuracies obtained on

these toxicity
data sets.

Whilst it is common practise to remove outliers from
data set
s with the intention of improving the
prediction accuracy of models, the aim of this paper
was

not to create highly predictive models, but rather to
investigate the pro
bable causes of poor model performance; undoubtedly outliers are one such cause.



<Figure 1>

<Figure 2>



A
further

study
on
implications of
data quality to classification
accuracy
has been carried out
.

Two data
sets
,

P
HENOLS

and ORAL_QUAIL
,

and six alg
orithms
(
BN, MLP, IBL, DT, RIPPER, and SVM
)

we
re
considered
in
the

experiment.
T
he t
op 20 descriptors from each data set with
the
highest correlation to class
values were extracted using the
feature selection method R
eliefF
[
20
]

implemented in Weka
a
.

Relie
fF is an
extension of the Relief algorithm, which works only for binary classification problems.
The Relief algorithm
works
for two class problems
by randomly sampling an instance and locating its nearest neighbour from the
same and opposite class. The val
ues of the features of the nearest neighbours are compared to the sampled
instance and used to update the relevance scores for each feature. ReliefF, an extension of Relief, aims to
solve the problem of datasets with multi
-
class, noisy and incomplete data.

Twenty models were created for
each
data set
, with each model using the
n

most correlated
descriptors to the class


where
n

varied
from
1
to

20. 10
-
fold cross validated accuracies of these models are
presented

in
F
igure
s

3 and 4.

<Figure 3>

<Figure 4>

Fi
gure 3 shows that increasing the number of descriptors used to build the models on
PHENOLS
data

set
has little impa
ct once the top 3
-
4 descriptors

(1:
an indicator variable for the presence of a 2
-

or 4
-
dihydroxy
phenol (
OH OH
),

2:
the maximum donor superd
elocalisability
, 3: Log

P

(calculated by the ACD software)
, 4:
the number of elements in each molecular (
Nelem
)
)

are included. After this point the accuracies of the various
algorithms var
y

by little more than 5%
.
This
suggests

the

first 4

descriptors of
t
he
P
HENOLS

data set have a
high correlation to the class value, and that they are sufficient to describe the majority of variation within the
data.




a

Weka: a free data mining software:
http://www.cs.waikato.ac.nz/~ml/weka

[9]

Figure 4 (ORAL_QUAIL data) shows that
increas
ing

the number of descriptors

used to create a model

can decrea
se the subsequent accuracy.
This reflects the unreliability of the
ORAL
_Q
UAIL

data set, i.e.
a
large

amount

of
noise, less re
levan
t

descriptors

etc
.
The first 4
-
5
descriptors

(1: SdsssP_acnt
-

Count of all (
-
> P = ) groups in molecule; 2: SdsssP
-

Sum of

all (
-
> P = ) E
-
state values; 3: SdS_acnt
-

Count of all ( = S )
groups in molecule; 4: SdS
-

Sum of all ( = S ) E
-
State values in molecule; 5: SssO_acnt
-

Count of all (
-

O
-

) groups in molecule)

of this
data set

appear to be sufficient for creating m
odels, and including any further
descriptors could lead to possible overfitting on the noisy and irrelevant data they contain.


4.
Conclusion
s


The outcomes of our
comparative
study and experiments proved that single classifier
-
based models
are
not
suffici
ently
discriminative
for
all
the
data sets considered

given the main characteristics of toxicity data
(noisiness, uneven distribution and size)
.

C
ase stud
ies

of
a
multiple classifier combination system
[
2
1]
indicates that hybrid intelligent systems are wor
thy of further research in order to obtain better performance
for specific application
s in
pre
dictive toxicology data mining
. This is because
multiple classifier combination
system
s

have the
advantage
that they can
manage complex class distributions throug
h combinations of
different model learning abilities
.

The authors would also speculate that model accuracy could be improved further by choosing a particular
feature selection method based on the
data set

and algorithm used
. The inclusion of
more
feature s
election
methods
i.e., kNNMFS [2
2
], ReliefF [
20
],
is proposed as future work.


The comparison of models created using different numbers of features highlights the need for care when
using feature selection techniques.
Reducing the number of descriptors in
a
data set

is commonly accepted as
a necessary step towards highly predictive
,

yet
interpretable,
models
. However,

as
our

results show, an
optimum
number
of descriptors exists, at least for the
data set
s used here. Models built with too many or too
few des
criptors are both undesirable, and finding the optimal feature subset appears at least as important as
selecting appropriate algorithms with which to build a final model.


Acknowledgement
s


This work
is
part
ially

supporte
d by the EU EP5 project DEMETRA

(ht
tp://www.demetra
-
tox.net)
.
DN, MC
and GG

acknowledge the support of the EPSRC p
roject
PYTHIA
GR/T02508/01

(
http://pythia.inf.brad.ac.uk
)
. PT acknowledges the support of the EPSRC + CSL grant.


References


1.

Friks
son, L., Johansson, E.
&

Lundstedt, T.
(2004).
Regression
-

and Projection
-

based Approaches in
Predictive Toxicology
. In

Predictive Toxicology
.
(ed. C.
Helma
).
Marcel Dekker, New York.

2.

Parsons, S.
&

McBurney, P.
(2004).
The Use of Expert Systems for Toxico
logy Risk Prediction. In
Predictive Toxicology
.

(ed. C.
Helma
).
Marcel Dekker, New York.

3.

Mazzatorta, P., Benfenati, E., Neagu, D.
&

Gini, G.

(2000).

Tuning
N
eural and
F
uzzy
-
neural
N
etworks
for
T
oxicity
M
odelling.
Journal of Chemical Information and Compute
r Sciences
, American Chemical
Society, Washington.

42 (5)
,

1250
-
1255
.

4.

Craciun, M.V, Neagu,

D., Craciun, C.A
&

Smiesko, M.

(2004).

A Study of Supervised and Unsupervised
Machine Learning Methodologies for Predictive Toxicology.
In

Intelligent Systems in Med
icine
.

(ed.
H.N. Teodorescu)
, pp. 61
-
69
.

Performantica, Iasi, Romania.

5.

Guo, G., Neagu, D.
(2005).
Fuzzy kNNModel Applied to Predictive Toxicology Data Mining.
Jour
n
a
l o
f

Computational Intelligenc
e and Applications
, Imperial College Press,
5 (
3
)
, 1
-
13
.

6.

Caru
ana, R. & Niculescu
-
Mizil, A. (2006).
An Empirical Comparison of Supervised Learning
Algorithms. In
Proc
s
. of ICML2006
, pp.161
-
168
.

7.

Cristianini, N
&
Shawe
-
Taylor, J.
(2000)
.

An Introduction to Support Vector Machines (and other kernel
-
based learning method
s)
.

Cambridge University Press.

8.

Cooper
,
G. F.
&
Herskovits
,
E.

(1992)

A
B
ayesian
M
ethod for the
I
nduction of
P
robabilistic
N
etworks
from
D
ata
.
Machine Learning
,
Kluwer Academic Publishers,

9
,
309
-
347
.

9.

Quinlan, J.R. (1986)
.

Induction of Decision Trees.
Mac
hine Learning
.
Kluwer Academic Publishers
,
1(1)
,
81
-
106.

10.

Aha, DW, Kibler, D & Albert.
MK. (1991)
.

Instance
-
based
L
earning
A
lgorithms.
Machine Learning
,
Kluwer Academic Publishers, 6
,
37
-
66.


11.

Witten, I.H. & Frank, G. (2000).
Data Mining: Practical Machine L
earning Tools with Java
Implementations
, Morgan Kaufmann, San Francisco.

12.

Liu, P and Li, H.
(2004).
Fuzzy Neural Network Theory and Application
.
Series in Machine Perception
and Artificia
l Intelligence
.

59
.

13.

Website of the
EU FP5 Quality of Life DEMETRA QLRT
-
2001
-
00691
.

Development of Environmental
Modules for Evaluation of Toxicity of pesticide Residues in Agriculture
, 2001
-
2006
:
http://www.demetra
-
tox.net

14.

Project
CSL: Development of
A
rti
ficial
I
ntelligence
-
based
I
n
-
silico
T
oxicity
M
odels for
U
se in
P
esticide
R
isk
A
ssessment
, 2004
-
2007
.

15.

Schultz, T.W.
(1997).
TETRATOX:
Tetrahymena
P
yriformis

P
opulation
G
rowth
I
mpairment
E
ndpoint

-

A
S
urrogate for
F
ish
L
ethality.
Toxicol. Methods

7
,

289
-
309
.

16.

Hall, M. A. (1998). Correlation
-
based Feature Subset Selection for Machine Learning. PhD Thesis,
University of Waikato
.

17.

Newman, D.J.
,

Hettich, S.
,

Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning
databases

Website
http://www.ics.uci.ed
u/~mlearn/MLRepository.html Irvine, CA: University of
California, Department of Information and Computer Science.

18.

Shhab
,

A, Guo
,

G
&

Neagu
,

D
.

(2005).
A Study on Applications of Machine Learning Techniques in
Data Mining
.

In
Proc
s
. of the 22nd BNCOD worksh
op on Data Mining and Knowledge Discovery in
Databases
,
(eds. David Nelson, Sue Stirk, Helen Edwards, Kenneth McGarry),
pp.
131
-
138, University
of Sunderland Press

19.

Keim, D.A (2002) Information visualization and visual data mining,
IEEE Transactions on Visu
alization
and Computer Graphics
, 8

(
1
)
, p
p.

1
-
8
.

20.

Kononenko, I.

(1994).

Estimating attributes: Analysis and Extension of Relief. In
Proc
s
. of ECML’94, the
Seventh European Conference in Machine Learning
, Springer
-
Verlag, pp.171
-
182
.

21.

Neagu, D. & Guo,

G.
(200
6
).
An Effective Combination based on Class
-
wise Expertise of Diverse
Classifiers for Predictive Toxicology Data Mining. In
Proc
s
. of ADMA 06
,
Springer Berlin / Heidelberg,
LNAI 4093/2006, 165
-
172.

22.

Guo
, G
.
,
Neagu
, D

&
Cronin
, M.T.D.

(2005).

Using
k
NN Model for Automatic Feature Selection.
In
Proc. of
ICAPR
2005
,

Springer Berlin / Heidelberg, LNCS 3686/2005,
410
-
419
.


Tables

Table 1. General information about toxicology data sets

Data sets

NI

NF_FS

NC

CD

CD_TR

CD_TE

TROUT

ORAL_QUAIL

DAPHNIA

DIETARY QUAIL

BEE

PHENOLS

APC

282

116

264

123

105

250

60

22

8

20

12

11

11

6

3

4

4

5

5

3

4

129:89:64

4:28:24:60

122:65:52:25

8:37:34:34:10

13:23:13:42:14

61:152:37

17:16:16:11

109:74:53

3:24:19:51

105:53:43:21

7:31:28:29:8

12:18:11:35:12

43:106:26

12:12:12:9

20:15:11

1:4:5:9

17:12:9:4

1:6:6:5:2

1:5:2:7:2

18:46:11

5:4:4:2


Table

2.
Classification accuracies
of different algorithms on
seven
data sets



Data sets


Average classification accuracy of data sets

BN

MLP

LR

IBL

K

DT

RIPPER

SVM

FNN

TROUT

ORAL_QUAIL

DAPHNIA

DIETARY QUAIL

BEE

PHENOLS

APC

56.52

47.37

47.62

40.00

5
8.82

70.6
7

4
0.00

65.22

47.37

54.76

70.00

58.82

86.67

53.33

0.3

0.3

0.3

0.9

0.9

0.3

0.9

63.04

47.37

64.29

60.00

70.59

73.33

53.33

5

5

5

10

1

5

5

56.52

47.37

4
5.24

45.00

58.82

77.33

53.33

5
4.35

4
2
.
10

57.14

40.00

58.82

72.00

46.67

60.87

47.37

52.38

5
5.00

5
8.82

78.67

4
6.67

50.00

47.37

57.14

40.00

47.06

73.33

40.00

Average

51.57

62.31

/

61.71

/

54.80

53.76

57.11

50.70

Table 3.
Classification accuracies
of different algorithms
on seven

data sets using ten
-
fold cross validation



Data sets

Average Classification Accurac
y of ten
-
fold

BN

MLP

LR

IBL

K

DT

RIPPER

SVM

FNN

TROUT

ORAL_QUAIL

DAPHNIA

DIETARY QUAIL

BEE

PHENOLS

APC

61.70

62.07

50.38

42.28

49.52

76.40

58.33

58.16

51.72

53.41

55.28

51.43

78.40

40.00

0.9

0.3

0.3

0.3

0.3

0.3

0.3

59.93

57.76

54.17

48.78

58.09

74.80

43
.33

5

5

5

5

5

10

5

55.32

62.93

50.00

45.53

45.71

74.40

43.33

56.74

60.34

50.00

39.84

46.67

76.40

40.00

62.06
65.52
54.55
48.78
53.33
80.00
43.33

59.79

55.27

50.00

37.50
55.89

72.67

40.00

Average

57.24

55.49

/

56.69

/

53.89

52.86

58.22

53.02

Figures


Figure 1
:
T
hree attributes most correlated to class in PHENOLS data set



Figure 2: Three attributes most correlated to class in TROUT data set


Figure 3: Performances for PHENOLS


Figure 4: Performances for ORAL_QUAIL