Rule-Based Clinical Data Mining: Inductive Rule Learning Algorithms CS 6795: Semantic Web Techniques

cobblerbeggarAI and Robotics

Oct 15, 2013 (3 years and 10 months ago)

59 views

1




Rule
-
Based Clinical Data Mining: Inductive Rule Learning Algorithms


CS 6795: Semantic Web Techniques



Instructor: Dr. Harold Boley


Advisor: Dr. Hongyu Liu, Liqing Geng


Prepared by: Andriy
Drozdyuk

Ki Hyang Lee

Shihyon Park









Dec.
15
. 2009



2


1.
Intr
oduction

T
he vision behind the Semantic Web is that computers should also be able to understand
and exploit information offered on the web

[6].

As the main knowledge representation
f
ormalisms the semantic web uses
rules and ontologies. Rules describe t
he logical
inferences that can

be drawn from particular facts.

I
f
ontology provides domain specific
background information, then web information is annotated by statements readable and
interpretable by machines
via the common ontological background knowle
dge
[6].

Therefore it is important to provide domain specific background information or knowledge.
R
easoning plays an important role on the Semantic Web: “Based on ontological
background knowledge and the set of asserted statements, logical reasoning can d
erive new
statements

[6]

. However,
these are some limitations: Uncertain information on the
Semantic Web is not
easily applied to
logical reasoning, and logical reasoning is
completely based on axiomatic prior knowledge and does not exploit regularities i
n the data
that have not been formulated as ontological background knowledge [6].
Since our data,
which is uncertain information and is not based on axiomatic prior knowledge, is not
suitable for logical reasoning, it is very important to start from analyz
ing algorithms from
machine learning that are suitable for Semantic Web applications.
Therefore, in this project,
we explore rule
-
based data mining methods to solve some biomedical problems using some
machine learning algorithms and inductive logic program
ming.

The information of our raw data is following:




Attribute Information: (class attribute has been moved to last column)




#


Attribute























Domain




-
------
-----------------------------------------
----------------------------------
-----------------

1. Sample code number

















id number (we removed this attribute)

2
. Clump Thickness



1


10

3. Uniformity of Cell Size








1
-

10




4. Uniformity of Cell Shape

1


10

5. Marginal Adhesion


1


10




6. Single Epithelial Ce
ll Size






1


10




7. Bare Nuclei



1


10

8. Bland Chromatin


1


10

9. Normal Nucleoli


1


10



10. Mitoses



1


10

11. Class:



(2 for benign, 4 for malignant)


Class distribution:
Benign
:
458(65.5%), Malignant: 241(34.5%)


3


2.
Machine learning a
lgorithms:
Weka

With the wide spread use of electronic medical records, researchers want to apply data
mining and machine learning techniques to this data in order to extract an individual
treatment plan. The ultimate idea is whether it can predict a poten
tial disease and provide a
medical info from

a patient’s clinical history.
However, there exist many challenges for
machine learning and data mining techniques. One of the big challenges is to normalize
data from several tables but for this project we did
not cover this. Therefore, we used the
already normalized medical data (Breast Cancer Wisconsin Data Set) which is from
University of Wisconsin. We are going to introduce and investigate several machine
learning techniques such as naïve bayes, SVM, decisio
n trees, and association rules to
determine which learning al
gorithm give us high accuracy.
Moreover, we apply Inductive
logic Programming to fulfill these investigations.




Naïve

Bayes Classifier

Bayesian reasoning provides a probabilistic approach to infe
rence. It is based on the
assumption that the quantities of interest are governed by probability distributions and that
optimal decisions can be made by reasoning about these probabilitie
s together with
observed data.
In machine learning, all the researche
rs are interested in determining the
best hypothesis from some space H, given the observed training data D. Bayes theorem
provides a way to calculate the probability of a hypothesis based on its prior probability
P(h) [4].

Bayes theorem:



P(h|D) = P(D|h)P
(h) / P(D)

From this, we should understand why Bayes theorem is needed. Usually, we can get a
probability under certain condition. This is called Conditional Probability and write P(X|Y).
It denotes that the probability of X given Y.


Let’s use our canc
er data set to

understand the Bayes theorem.
We want to know the
probability a patient has cancer. P(cancer = 4 | thickness=1, cellsize =1, cellshape =1,
adhesion =1, epithelialcellsize =1, nuclei=1, chromatin=1, nucleoli=1, mitoses=1)
.
Each
attribute ha
s

range of value from 1 to 10.
We are trying to calculate the probability a patient
has

cancer with this information.
We can get the probability from the above formula and we
can write this using conditional probability.

4





P(cancer = 4
|
thickness=1, cell
si
ze=1, cellshape=1, adhesion
=1, epithelialcell
size=1, nuclei=1,
chromatin=1,
nu
cleoli=1,mitoses=1)=P(cancer
=
4
,thickness=1,cellsize=1,cellshape=1, dhesion
=1,

epithelialcellsize
=1, nuclei=1, chromatin=1, nucleoli=1, mitoses=
1) / P(thickness=1, cellsize=1,
cell
shape=1, adhesion=1, epithelialcellsize
=1, nuclei=1, chromatin=1, nucleoli=1, mitoses=1
).


From this, the numerator can be read like; wha
t is the probability of “cancer
= 4


and

all of
the attribute
s

are “
1

.
This is called joint probability.

For example, i
f there are a 1000 total
sample data sets, and there are 5 matches where, cancer = 4 and all the attributes are equal
t
o 1, then P(cancer=4, thickness
=1,…, mitoses=1)=5/1000=
0.005.

Another way to calculate
probability from the same 1000 data set is, there
are 10 instances where all the attributes is
equal to 1 and of the 10 ins
tances, 5 instances have cancer=
4, then P(cancer=4
|
thickness
=1
,

…,mitoses=1) = 5/10 = 0.5
. It looks easy to calculate.


However, in real world is it really the right way to find the
probability? It is difficult
because under conditional probability all the attributes have to satisfy the condition and if
the number of attributes increases then it is even difficult to calculate the probab
ility. For
example, P(cancer=4|
thick
ness=1,…,more

attributes = 1).
This defines that
P(cancer=4
|
thickness=1,…,more attributes=
1) = P(cancer=4
,thickness=1,..,more attributes
=1) /
P(thickness=1,…,more attributes=1).
This means that looking at the instances which have
this condition that thickness=1 and

mo
re attribute=1 where cancer
=4. However, it is
impossible to find such cases so conditional probability is useless. Therefore, we need to
consider the Conditional Independence. Using conditional independence in our case, if
we assume that each attributes

in an instance does not have any relationship between each
attributes, then we can multiply each attributes as following:




P(cancer=4| thickness=1,…,more attributes = 1)

= P (cancer=4, thickness=1, .., more attributes
=1) / P(thickness=1, …, more attrib
utes=1
).


This is changed to

P(cancer=4)*P(thickness=1)*….P(more attributes =1) /P(thickness = 1),…,P(more attributes =1
).


From this formula, we get rid of the attributes and it leaves out P(cancer=4). This means
that the probability would not be changed

even if conditions exist. Therefore, this is not
the correct method because we are trying to calculate the probability based on conditions.
This is the reason why

we need to use Bayesian Rule.
The Bayesian Rule is as following:

5


P(cancer=4|thickness=1,…,m
ore attributes
=1)

=
P(cancer=4)
*P(thickness=1,…,more

attributes=1|cancer=
4)
/
P(thickness=1)*…*P(more

attributes
=
1
).


However, the second part of the numerator P(thickness=1, …, more attributes =1 | cancer =
4) is also difficult to find the cases as discussed

earlier. Therefore, this is not right way to
calculate the probability. So, we need to define the conditional independence more carefully.
Therefore, we can write each attribute under cancer = 4 and we assume each attributes is
independent. Finally we ca
n write this as follow:

P(thickness=1, …, more attributes =1 | cancer = 4
)

=P(thickness=1|cancer=4) *…*P(more attributes =1 | cancer=4
).

P(thickness=1|cancer=4) means that from the data set (number of instance which has
thickness = 1 / number of instance w
hich has cancer=4), we can get probability.

Finally
Naïve Bayes classifier calculates the probability for each attribute and selects the maximum
high probability for each attributes.
The formula is following:

[4]


Vmax = argmax P(a1,a2,..aN|Vj) P(Vj)

=== S
ummary ===

Correctly Classified Instances 680 97.2818 %

Incorrectly Classified Instances 19 2.7182 %

Kappa statistic 0.9405

Mean absolute error 0.0278

Root mean squar
ed error 0.1593

Relative absolute error 6.1504 %

Root relative squared error 33.5215%

Total Number of Instances 699


=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F
-
Me
asure ROC Area Class

0.967 0.017 0.991 0.967 0.979 0.993


2

0.983 0.033 0.94 0.983 0.961 0.993

4

Weighted Avg

0.973 0.022
0.974

0.973 0.973


0.993


=== Confusion Matrix ===


a


b <
--

classified as

443 15 | a = 2

4

237 | b =4

[Result from Weka
Naïve

Bayes
]

This result is from Weka.
These results show us that Naïve Bayes gives 97% accuracy.
T
his test is done with 10 fold.
10 fold is the standard way to train the d
ata

set.
What this
6


means is that for example if we have a 100 data set, NB randomly pick
s 90 instances and
train them.
After that

10 instances will be tested
from the 90 trained instances.
And another
9
fold will be trained same way.
Finally, all the results
will be averaged from each fold.

The
above confusion matrix, tells us that there are 458 (443 + 15) instance which actually
belong to class 2 from total num
ber of instances which is 699.
However, after NB
algorithm trains the data set and reports that NB c
an tell 443 are 2 but 15
is 4 even if it is
actually 2.
We can match this with correct classified instances 680 and incorrect classified
instances 19
.




Support Vector Machines(SVM)

Support Vector Machines are based on the concept of how to make a decision
to divide
boundaries. SVM mainly performs classification tasks by constructing hyperplanes in a
multidimensional space that separate cases of different class labels [1]. Common task in
machine learning is to classify data. What it means is that if we have
data whose points
belong to one of two classes, then we need to decide where the new data belongs to. In
SVM, data point represents as p
-
dimensional vector, and we need to know whether we can
divide the new point with P
-
1 dimensional hyperplane. This is ca
lled a linear classifier.
There are many hyperplanes that might divide the data. The question we need to ask is that
from all the hyperplanes how can we pick the best hyperplanes?


One of the reasonable choice as the best hyperplanes is that when the hy
perplane represents
the largest separation or margin between the two classes. The one hyperplane which has
this property, it means that the distance from the hyperplane to the data is maximized. And
the hyperplane is known as the maximum
-
margin hyperplane
[3]. SVM overcomes two
problems, one is conceptual problem: How to control the complexity of the set of
approximation functions in a high
-
dimensional space in order to provide good
generalization ability, using penalized linear estimators with a large numb
er of basis
functions. The other is computational problem: How to perform numerical optimization in
a high
-
dimensional space, using the dual kernel representation of linear functions.

7



=== Summary ===

Correctly Classified Instances 670

95.8512 %

Incorrectly Classified Instances 29 4.1488 %

Kappa statistic 0.9083

Mean absolute error 0.0415

Root mean squared error 0.2037

Relative absolute error

9.1793 %

Root relative squared error 42.854 %


=== Detailed Accuracy By Class ===



TP Rate FP Rate Precision Recall F
-
Measure ROC Area Class


0.967 0.058 0.969 0.967 0.968

0.955

2


0.942 0.033 0.938 0.942 0.94 0.955

4

Weighted Avg.
0.959 0.049 0.959 0.959 0.959 0.955


=== Confusion Matrix ===



a

b <
--

classified as


443

15 | a = 2

14

227 |

b = 4

[Result from Weka SMO]

These results show us that SVM gives 95% accuracy.

The tests in this case are also conducted like they are conducted in Naïve Bayes.




Decision Tree
Algorithm

Trees algorithm creates hierarchical struc
ture of classification rul
es
.
Therefore,
Rules provided by decision tree induction are easy to understand.
It looks like a
tree implying “If ... Then ...” clause.
Decision tree algorithm is generally best suited
to problems where instances are represented by attribute
-
value pairs a
nd are described by a
fixed set of attributes

and their values [5].
In our examples, we have a 10 fixed set of
attributes and their values from 1 to 10. One of the advantages of Decision Tree algorithm
is handling of noisy training data where the errors ma
y lay in the independent variables, in
the dependent variable or in both

[5]
. They also offer a good deal of flexibility in handling
datasets with missing values [11]. However, our examples may not suitable to apply for this
algorithm, since the easiest si
tuation for decision tree learning occurs when each attribute
takes on a small number of disjoint possible values [11].
The following table shows a
set of training data that could be used to predict cancer or no cancer. In this
example, normalized informat
ion about patient was generated including their
thickness,
cell size, sell shape, adhesion, epithelial cell size, nuclei, chromatin,
nucleoli, mitoses, and whether they represented a cancer or non cancer.

8



[result from weak with J48 algorithms]


In this example, the Decision Tree algorithm might determine that the most
significant attribute for predicting
cancer or not
. The first split in the decision
tree is therefore made on cell

size.
There are ten leaf nodes from the root and
each has value of cell size attribute. In this example, if the patient has cell size
with 1, then the patient dose not cancer. If the patient has cell size of 3, then
third level of leaf nodes is for nuclei
. Now, decision tree algorithm will
consider of the value of nuclei to predict whether the patient has cancer of not.
If the nuclei have the value of 1, the patient is not cancer. If the patient has the
Thickness

Cellsize

cellshape

adhesion

Epithelial

ce
llsize

nuclei

chromatin

nucleoli

mitoses

cancer

2

4

4

5

7

10

3

2

1

2

3

6

5

6

7

?

4

9

1

2

2

4

4

4

6

5

7

3

1

2

6

4

5

3

7

3

4

6

1

2

3

3

2

6

3

3

3

5

1

2

2

9

7

5

5

8

4

2

1

2

7

3

3

3

3

2

6

1

1

2

3

4

5

1

8

1

3

6

1

2

1

1

1

3

1

5

2

1

1

4

1

2

2

1

2

6

1

1

2

4

2

3

2

1

3

4

4

1

1

4

9


value of nuclei as 4, the patient does not have cance
r. With our sample data, the
result from weak is following:

=== Summary ===


Correctly Classified Instances 660 94.4206 %

Incorrectly Classified Instances 39 5.5794 %

Kappa statistic 0.87
69

Mean absolute error 0.0796

Root mean squared error 0.218

Relative absolute error 17.6026 %

Root relative squared error 45.8562 %

Total Number of Instances 699


=== Deta
iled Accuracy By Class ===



TP Rate FP Rate Precision Recall F
-
Measure ROC Area Class


0.954 0.075 0.96 0.954 0.957 0.955

2


0.925 0.046 0.914 0.925 0.92

0.955

4

Weighted Avg.

0.944 0.065 0.944 0.944 0.944 0.955


=== Confusion Matrix ===



a b <
--

classified as


437


21 | a = 2


18

223 | b = 4

[Result from Weka
J48
]

These results show us that decision tree algo
rithm gives 94% accuracy.

The tests in this case are also conducted like they are conducted in Naïve Bayes.




Association Rules

Association rule is to define correlation in large amount of data sets.
Such correlation is
called association rule.
In Associati
on Rule, each data sequence is a list of transactions and
each transaction is a set of items.
One of the typical examples of association rule is the

"market
-
basket" problem

[10]
. If a supermarket keeps track of all the purchase transactions,
each purchase
transaction is a subset of all items available in the store.
The problem is that
the
analyz
ing a large set of transactions

can be discover
ed

the correlation between subsets?
i
.
e
.
: people buying
beer
has a high tendency of buying
diaper
. Or people buying di
aper
tends to buy
beer

[10].

In our examples,
we need to find the correlation
between subsets of

10

attributes: {
Thickness,
cell size, sell shape, adhesion, epithelial cell size,
nuclei, chromatin, nucleoli, mitoses
, cancer
}
.
We applied Weka Apriori Associ
ation
Rule to determine our cancer data sets.
Best rules found:


10



1. cellsize=1 nucleoli=1 356 ==> cancer=2 355 conf:(1)

2. cellsize=1 nucleoli=1 mitoses=1 350 ==> cancer=2 349 conf:(1)

3. cellshape=1 353 ==> cancer=2 351 conf:(0.99)

4. cellsize=
1 mitoses=1 377 ==> cancer=2 374 conf:(0.99)

5. nuclei=1 nucleoli=1 355 ==> cancer=2 352 conf:(0.99)

6. cellsize=1 384 ==> cancer=2 380 conf:(0.99)

7. cellsize=1 cancer=2 380 ==> mitoses=1 374 conf:(0.98)

8. adhesion=1 cancer=2 375 ==> mitoses=
1 369 conf:(0.98)

9. cellsize=1 nucleoli=1 356 ==> mitoses=1 350 conf:(0.98)

10. cellsize=1 nucleoli=1 cancer=2 355 ==> mitoses=1 349 conf:(0.98)




Results

Since the result of association rule

is different from other classifiers
:

Naïve Bayes, SVM
and Decision Tree,
we need to retrieve the result from association rule in same result
format with others.
Simply, we count each true or
false number of results after applying
association rule. We retrieve true positive number, false negative number, false

positive
n
umber and true negative number, and with these numbers, we can finally get following
results from Weka.

Algorithms

True Positive

False Negative

False Positive

True Negative

Naïve Bayes

443

15

4

237

SVM

443

15

14

227

Decision Tree

437

21

18

2
23

Association rule

415

43

6

235


Algorithms

True Positive
Rate



False positive
Rate

Precision

Recall

Naïve

Bayes

0.967248908

0.01659751

0.991051454

0.967248908

SVM

0.967248908

0.058091286

0.969365427

0.967248908

Decision Tree

0.954148472

0.07468879
7

0.96043956

0.954148472

Association rule

0.906113537

0.024896266

0.985748219

0.906113537


From these results, we can find that the best algorithm in our case is Naïve Bayes with only
19 false instances
,

where
as

SVM has 29, Decision Tree has
39 and Assoc
iation rule has 49
false instances.

11



3
.
Inductive Logic Programming:
Aleph

"Inductive

Logic Programming (ILP) investigates the construction of logic programs from
training examples and background knowledge. ILP is a new research field that combines
the tec
hniques and theories from inductive concept l
earning and logic programming."[8]
An empirical ILP system can be classified into either a bottom up or a top down learner.


Bottom
-
up systems search for program clauses by considering generalizations. They star
t
from the most specific clause that covers a positive training example and then generalizes
the clause until it cannot be further generalized without covering some negative examples.

Top
-
down methods apply specialization operators to learn program clauses

by searching
from general to specific.


One of the most famous emipircal top
-
down ILP system
s

is First Order Inductive Learner
(FOIL)

[
7
] algorithm. Aleph

[2] is also an inductive logic programming system. It is
written in Prolog.

In our project we make u
se of Aleph because it is written in Prolog,
which makes
it easier to learn how to use.




Methods

Aleph is a p
r
ogram that can reason on some data to produce certain rules. These rules are
meant to generalize the knowledge about the supplied data to provide
a classification for
future instances.

Because Aleph input format is drastically different than the one we have, we start out by
converting our data into a suitable format for the Aleph learner. Aleph requires three files to
provide output:



Positive exampl
es file
-

(filename.f) that is the ground facts of the concept to be learned



Negative examples file (filename.n) ground facts that are false



Background knowledge file (filename.b)
-

encodes information that is relevant to the domain
of the problem



12


We

write a Python script
convert.py

that separates our input data file into two files:
positive_n.txt and negative_n.txt. We write the knowledgebase file by hand. Here is a
snipped from the script that splits the input files into two:


def
split_into_pos_neg
(kb_in, pos_out, neg_out):

pos = open(pos_out, 'w')

neg = open(neg_out, 'w')

# Initial data

input = csv.reader(open(kb_in))


for row in input:





# Parameter that determines whether t
he instance is malignant







cancer = row[9]


if cancer == "2":

writ
e_to =
neg

else:

write_to = pos


write_row(row, write_to)


After we have the data separated into two parts, we create a number of "folds" to run
algorithm on. Each fold consists of pos
i
tive and negative training files, and positive and
negative testing fil
es. Training files contain about 90% of our data, while the testing files
store the rest of the data. Once our Aleph processes the training fold, we can r
un the rules it
generates again
s
t

our training fold and measure the success rate. We create a total of

10
folds, thus effectively splitting our data into 10
independent instances. The code is
lengthy

to include here so please refer to the code section or file itself

[12]
.

For
completeness,

we include a description of how basic Aleph algorithm works:

1.

Select

an example.

This selects an example to be generalized

2.

Build most
-
specific
-
clause.

This step creates a most specific claus
e

that entails the selected example

3.

Search

This step tries to find a more general clause than the one in previous step.

4.

Remove redunda
nt


This step selects the best general clause found, adds it to the theory and

r
emoves all
redundant clauses that are covered by this more general clause
.


Repeat the above algorithm until no more examples are left to be generalized.

After Aleph generates

the rules we need a way to check, during the testing phase, whether
13


th
e new data instances match the r
ules correctly. We do this wi
th a simple python script
check.
py, that basically tries to match the two clauses. Here is snipped of code that tests our
ma
tching methods, and

provides an easy way to see it
s functionality:


def test_match():

rule = "cancer(A, A, A, A, B, A, B, A, A)."






str= "cancer(1, 1, 1, 1, 2, 1, 2, 1, 1)."






str2="cancer(2, 2, 2, 2, 3, 2, 3, 2, 2)."






str3="cancer(2, 2, 2, 2, 3,

2, 3, 2, 6)."


print "Should match", match(str, rule)






print "Should match", match(str2, rule)






print "Should NOT match", match(str3, rule)


rule2 = "cancer(1, 1, 1, 1, 2, 1, 2, 1, 1)."






str3="cancer(1, 1, 1, 1, 2, 1, 2, 1, 1)."






str4="can
cer(1, 1, 1, 1, 2, 1, 2, 1, 2)."






print "Should match", match(str3, rule2)






print "Shoud NOT match", match(str4, rule2)




Result
s

During our 10
-
fold testing we found that out of 690 instances 450 were correctly diagnosed
as benign, while 240 were

fa
lse ne
gatives.

This result is mostly due to the way Aleph found rules for the dataset. A lot of the rules
were just definite ground clauses like:

cancer(1, 3, 4, 5, 6, 2).

Now any new instances that were not the same would be classified as benign
.


True Po
sitive

False Negative

False Positive

True Negative

Cancer

0

240

0

450

Not Cancer

7

443

0

240


The problem here is that the negative knowledgebase is bigger than positive. Upon
switching the
knowledge bases

and forming our query to be of the type not_can
cer, we
found that we had 7 correctly classified malignant instances. (See table above for "Not
Cancer" row).

This was possible because having a larger dataset, Aleph was able to formulate some rules
that were more abstract than the definite ground clauses
. The rules looked
were of the form
similar to:
cancer(A, B, B, C).

14


Which mean
s

that any instance
of this form
second and third attributes
are
the same.


4
.
Combining ILP
with

Machine Learning Algorithm

Inductive learning methods formulate general hypothes
es by finding regularities over the
training examples.

Inductive methods such as Inductive logic Programming (ILP) are used
seek prior knowledge. On the other hand machine learning algorithm such as Naives Bayes,
SVM, Decision Tree,

and Association rules a
re used to train the data set to calculate
explicit probabilities for hypotheses.

The main goal of combining
ILP and

machine
learning algorithm is to obtain the benefits of approaches: better generalization accuracy.
Inductive methods offer the advantage t
hat they require no explicit prior knowledge and
learn regularities based solely on the training data. However, they can fail when given
insufficient training data.

From this project, our goals is to find which machine learning
algorithm gives us high accu
racy rate and combine the one, which is Naïve Bayes classifier
with ILP. ILP is to make rules based on the data set
s
. The inductive logic is to make
hypothesis rules which are different than deductive logic. This means that we need to
carefully analyze the

data set to make correct rules from the given data set. So if
we have
more
knowledge domain we get more accurate rules.


Many machine learning researchers are working to combine
naïve

bayes with ILP.
Combining two algorithms is not just two algorithms

to
gether.
What it means is that if one
algorithm gives us 70% accuracy from a knowledge domain, next step is to see that the
other algorithm increases the accuracy with only looking at the 30% of inaccuracy from the
other algorithm.

15


This way combing two al
gorithms will increase the accuracy.

In order to do this
,
we
retrieve
d

all false

instances which are “false positive” and “false negative” ones

from the
results of
Naïve

Bayes

as following
:



Thickness

Cellsize

cellshape

adhesion

Epithelial

cellsize

nucle
i

chromatin

nucleoli

mitoses

cancer

5

4

4

5

7

10

3

2

1

2

6

8

8

1

3

4

3

7

1

2

6

6

6

9

6

?

7

8

1

2

8

4

4

5

4

7

7

8

2

2

8

4

6

3

3

1

4

3

1

2

6

3

3

5

3

10

3

5

3

2

5

7

7

1

5

8

3

4

1

2

5

3

4

3

4

5

4

7

1

2

4

6

5

6

7

?

4

9

1

2

4

4

4

4

6

5

7

3

1

2

3

4

5

3

7

3

4

6

1

2

3

3

2

6

3

3

3

5

1

2

6

9

7

5

5

8

4

2

1

2

6

3

3

3

3

2

6

1

1

2

5

4

5

1

8

1

3

6

1

2

4

1

1

3

1

5

2

1

1

4

10

2

2

1

2

6

1

1

2

4

6

3

2

1

3

4

4

1

1

4

[False Instances from Weka
Naïve

Bayes]


And this would be what we are talking about 30% which

first alg
orithm could not get
accuracy.
Then, we applied these false instances into ILP to find more accurate rules.



cancer(A, B, B, A, C, D, E, F, G).



cancer(6, 6, 6, 9, 6, ?, 7, 8, 1).



cancer(8, 4, 4, 5, 4, 7, 7, 8, 2).



cancer(8, 4, 6, 3, 3, 1, 4, 3, 1
).



cancer(6, 3, 3, 5, 3, 10, 3, 5, 3).



cancer(5, 7, 7, 1, 5, 8, 3, 4, 1).



cancer(5, 3, 4, 3, 4, 5, 4, 7,1).



cancer(A, B, C, B, D, E, A, F, G).



cancer(3, 4, 5, 3, 7, 3, 4, 6, 1).



cancer(3, 3, 2, 6, 3, 3, 3, 5, 1).



cancer(6, 9, 7, 5, 5, 8, 4, 2, 1).



cancer(5
, 4, 5, 1, 8, 1, 3, 6, 1).

[Result from
Aleph
]

However, the result from Aleph shows that our approach to combine ILP with NB does not

improve the accuracy
.
We still need additional expert domain knowledge to get more
reasonable result from ILP.


16


4
.
Conclu
sion

It would be interesting to see if additional expert domain knowledge to the knowledgebase
would increase the accuracy. Since Aleph is a rule driven system it does not work very well
with raw data without any relationships. Semantic web techniques coul
d be used to build an
ontology which information could than be encoded into simple clauses for Aleph.

Another aspect is whether combining Aleph findings with that of a machine learning
technique yield better results. It is not rational to try this in our c
ase, as our Aleph
correctness was less than that of a random decision (50%). However once more rules are
added it is possible that Aleph may produce high quality results which could than be
combined with algorithms like Decision tree to reinforce a hypothe
sis or discover new
relationships in the data.




Reference
s


[1]

http://www.statsoft.com/textbook/stsvm.html

[2]
http
://www.comlab.ox.ac.uk/activities/machinelearning/Aleph/

[3]

http://en.wikipedia.org/wiki/Support_vector_machine

[4] Tom M. Mitchell “Machine Learning” McGraw Hill, 1997

[5]
http://www.bandmservices.com/DecisionTrees/DecisionTrees.htm

[6]
Volker Tresp, Markus Bundschus, Achim Rettinger, and Yi Huang
, “
Towards

Machine Learning
on the Semantic Web

,
Springer Berli
n/
Heidelberg, Volume 5327,
2008

[7]
http://www.csc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/foil.html

[8] Man Leung Wung, Kwong Sak Leung, Data Mining using grammar ba
sed genetic
programming and applications
, 2000
.


[9]
http://www.statsoft.com/textbook/stassrul.html

[10]
R
Agrawal
, R Srikant, “
Fast Algorithms for Mining Association Rules in Large Dat
abases
”,

Proceedings of the International Conference on Very Large Data

Bases, V
LDB, 1994
.

[11]

Jesse Davis, Eric Lantz, David Page, Jan Struyf, Peggy Peissig, Humberto Vidaillet,
Michael Caldwe
ll, “
Machine Learning for Personalized Medicine: Will This Drug Give
Me a Heart Attack?”

http://www.cs.ualberta.ca/~szepesva/ICML2008Health/Davis.pdf

17


[12] Rule Based Clinical Data website
http://rule
-
data
-
mining.appspot.com/



Appendix

1. Foil algorithm:


http://www.cs.sfu.ca/~oschulte/socialnetwork/foil.pdf

2. Aleph
:

http://www.doc.ic.ac.uk/~shm/ilp.html

3. Weka:

http://www.cs.waika
to.ac.nz/ml/weka/

http://wekadocs.com

4.
Data

-

http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic
)

5.

William S Nobel “What is a support vector machine?” 2006 Nature Publishing
Group