Rule Extraction From Trained
Neural Networks
Brian Hudson
University of Portsmouth, UK
Artificial Neural Networks
Advantages
High accuracy
Robust
Noisy data
Disadvantages
Lack of comprehensibilty
Trepan
A method for extracting a decision tree from
an artificial neural network (Craven, 1996).
The tree is built by expanding nodes in a best
first manner, producing an
unbalanced
tree.
The splitting tests at the nodes are m

of

n
tests
e.g
. 2

of

{x
1
, ¬x
2
, x
3
}, where the x
i
are Boolean
conditions
The network is used as an
oracle
to answer
queries during the learning process.
Splitting Tests
Start with a set of candidate tests
binary tests on each value for nominal features
binary tests on thresholds for real

valued features
Find optimal splitting test by a beam search,
initializing beam with candidate test
maximizing the
information gain.
Splitting Tests
To each m

of

n test in the beam and each
candidate test, apply two operators:
m

of

(n+1)
e.g
. 2

of

{x
1
, x
2
} => 2

of

{x
1
, x
2
, x
3
}
(m+1)

of

(n+1)
e.g
. 2

of

{x
1
, x
2
} => 3

of

{x
1
, x
2
, x
3
}
Admit new tests to the beam if they increase
the information gain
and differ significantly
(chi

squared) from existing tests.
Data Modelling
The amount of training data reaching each
node decreases with depth of tree.
TREPAN creates new training cases by
sampling the distributions of the training data
empirical distributions for nominal inputs
kernel density estimates for continuous inputs
Apply oracle (
i.e.
neural network) to new
training cases to assign output values.
Application to Bioinformatics
Prediction of Splice Junction sites
in Eukaryotic DNA
Splice Junction Sites
Consensus Sequences
Donor

3

2

1 +1 +2 +3 +4 +5 +6
C/G A G
 G T
A/G A G T
Acceptor

12

11

10

9

8

7

6

5

4

3

2

1 1
C/T C/T C/T C/T C/T C/T C/T C/T C/T C/T
A G 
G
EBI Dataset
Clean dataset generated at EBI
(Thanaraj, 1999)
Donors
training set: 567 positive, 943 negative
test set: 229 positive, 373 negative
Acceptors
training set: 637 positive, 468 negative
test set: 273 positive, 213 negative
Results
TREPAN Donor Tree
3 of {

2=A,

1=G, +3=A, +4=A, +5=G}
Positive
869:74
Negative
43:533
C/G A G
 G T
A/G A G T
Yes
No
C5 Donor Tree (extract)
p5=G
p3=C
or
p3=T => NEGATIVE
p3=A
p2=G => POSITIVE
p2=A
p4=A
or
p4=G => POSITIVE
p4=C
or
p4=T => NEGATIVE
p2=C
p4=A => POSITIVE
else
=> NEGATIVE
p2=T
p6=A
or
p6=G => NEGATIVE
p6=C
or
p6=T => POSITIVE
p3=G
p4=T => NEGATIVE
p4=C
p6=T => POSITIVE
else
=> NEGATIVE
Trepan Acceptor Tree
1 of {

3=G,

5=G}
NEGATIVE
{

3=A}
NEGATIVE
POSITIVE
NEGATIVE
2 of {+1!=G,

5=G}
C/T … C/T
A G

G
Application to
Chemoinformatics
1.
Learning general rules
2.
Conformational Analysis
3.
QSAR dataset
Oprea Dataset
137 diverse compounds
Classification
62 leads, 75 drugs
14 descriptors (from Cerius

2)
MW, MR, AlogP
Ndonor, Nacceptor, Nrotbond
Number of Lipinski violations
T.I. Oprea, A.M. Davis, S.J. Teague & P.D. Leeson, “Is there a
difference between Leads & Drugs? A Historical Perspective”, J. Chem.
Inf. & Comput. Sci.,
41
, 1308

1315, (2001).
C5 tree
MW <= 380 [ Mode: lead ]
Rule of 5 Violations = 0 [ Mode: lead ]
Hbond acceptor <= 2 [ Mode: lead ] => lead
Hbond acceptor > 2 [ Mode: drug ] => drug
Rule of 5 Violations > 0 [ Mode: lead ] => lead
MW > 380 [ Mode: drug ] => drug
Trepan Oprea Tree
1 of { MW<296, MR<85 }
Lead
52:3
Unclassified
12:49
MW<454
Drug
1:20
Conformational Analysis
300 conformations from
5ns MD simulation of rosiglitazone
Classified by length of long axis into
Extended
–
distance > 10A
Folded
–
distance < 10A
8 torsion angles
In house data.
Rosiglitazone
Agonist of PPAR gamma Nuclear Receptor
Regulates HDL/LDL and triglycerides
Active ingredient of Avandia for Type II
Diabetes
Distances
C5 tree
T5 <= 269 [ Mode: extended ]
T5 <= 52 [ Mode: extended ]
T7 <= 185 [ Mode: extended ] => extended
T7 > 185 [ Mode: folded ]
T6 <= 75 [ Mode: folded ] => folded
T6 > 75 [ Mode: extended ]
T5 <= 41 [ Mode: folded ]
T8 <= 249 [ Mode: folded ] => folded
T8 > 249 [ Mode: extended ] => extended
T5 > 41 [ Mode: extended ] => extended
T5 > 52 [ Mode: extended ]
T6 <= 73 [ Mode: extended ]
T8 <= 242 [ Mode: extended ]
T5 <= 7 [ Mode: extended ]
T8 <= 22 [ Mode: extended ] => extended
T8 > 22 [ Mode: folded ] => folded
T5 > 7 [ Mode: extended ] => extended
T8 > 242 [ Mode: extended ] => extended
T6 > 73 [ Mode: extended ] => extended
T5 > 269 [ Mode: folded ] => folded
Trepan Conformation Tree
T5 < 180
Extended
133:0
Unclassified
2:5
2 of { T7<181, T2>172}
Folded
0:161
Ferreira Dataset
“typical” QSAR dataset
48 HIV

1 Protease inhibitors
Activity as pIC50
Low pIC50 < 8.0
High pIC50 > 8.0
14 descriptors (mostly topological)
R. Kiralj and M.M.C. Ferreira, “A

priori Molecular Descriptors in QSAR :
a case of HIV

1 protease inhibitors I. The Chemometric Approach”, J.
Mol. Graph. & Modell.
21
, 435

448, (2003)
Original Results
PLS model
Activity determined by
X9,X11,X10,X13
R
2
= 0.91, Q
2
=0.85, Ncomps=3
C5 tree
X11 <= 2.5 [ Mode: low ]
X13 <= 16.7 [ Mode: low ] => low
X13 > 16.7 [ Mode: high ] => high
X11 > 2.5 [ Mode: high ] => high
Trepan Ferreira Tree
1 of { X13<16.1, X9<3.4 }
High
1:24
X1<552
Low
17:1
Low
4:1
High
0:1
X6<0.04
Accuracy
Conclusions
Reasonable Accuracy
Comprehensible Rules
Acknowledgements
David Whitley.
Tony Browne.
Martyn Ford.
BBSRC grant reference BIO/12005.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment