1
Rule

Based Clinical Data Mining: Inductive Rule Learning Algorithms
CS 6795: Semantic Web Techniques
Instructor: Dr. Harold Boley
Advisor: Dr. Hongyu Liu, Liqing Geng
Prepared by: Andriy
Drozdyuk
Ki Hyang Lee
Shihyon Park
Dec.
15
. 2009
2
1.
Intr
oduction
T
he vision behind the Semantic Web is that computers should also be able to understand
and exploit information offered on the web
[6].
As the main knowledge representation
f
ormalisms the semantic web uses
rules and ontologies. Rules describe t
he logical
inferences that can
be drawn from particular facts.
I
f
ontology provides domain specific
background information, then web information is annotated by statements readable and
interpretable by machines
via the common ontological background knowle
dge
[6].
Therefore it is important to provide domain specific background information or knowledge.
R
easoning plays an important role on the Semantic Web: “Based on ontological
background knowledge and the set of asserted statements, logical reasoning can d
erive new
statements
[6]
”
. However,
these are some limitations: Uncertain information on the
Semantic Web is not
easily applied to
logical reasoning, and logical reasoning is
completely based on axiomatic prior knowledge and does not exploit regularities i
n the data
that have not been formulated as ontological background knowledge [6].
Since our data,
which is uncertain information and is not based on axiomatic prior knowledge, is not
suitable for logical reasoning, it is very important to start from analyz
ing algorithms from
machine learning that are suitable for Semantic Web applications.
Therefore, in this project,
we explore rule

based data mining methods to solve some biomedical problems using some
machine learning algorithms and inductive logic program
ming.
The information of our raw data is following:
Attribute Information: (class attribute has been moved to last column)
#
Attribute
Domain





1. Sample code number
id number (we removed this attribute)
2
. Clump Thickness
1
–
10
3. Uniformity of Cell Size
1

10
4. Uniformity of Cell Shape
1
–
10
5. Marginal Adhesion
1
–
10
6. Single Epithelial Ce
ll Size
1
–
10
7. Bare Nuclei
1
–
10
8. Bland Chromatin
1
–
10
9. Normal Nucleoli
1
–
10
10. Mitoses
1
–
10
11. Class:
(2 for benign, 4 for malignant)
Class distribution:
Benign
:
458(65.5%), Malignant: 241(34.5%)
3
2.
Machine learning a
lgorithms:
Weka
With the wide spread use of electronic medical records, researchers want to apply data
mining and machine learning techniques to this data in order to extract an individual
treatment plan. The ultimate idea is whether it can predict a poten
tial disease and provide a
medical info from
a patient’s clinical history.
However, there exist many challenges for
machine learning and data mining techniques. One of the big challenges is to normalize
data from several tables but for this project we did
not cover this. Therefore, we used the
already normalized medical data (Breast Cancer Wisconsin Data Set) which is from
University of Wisconsin. We are going to introduce and investigate several machine
learning techniques such as naïve bayes, SVM, decisio
n trees, and association rules to
determine which learning al
gorithm give us high accuracy.
Moreover, we apply Inductive
logic Programming to fulfill these investigations.
Naïve
Bayes Classifier
Bayesian reasoning provides a probabilistic approach to infe
rence. It is based on the
assumption that the quantities of interest are governed by probability distributions and that
optimal decisions can be made by reasoning about these probabilitie
s together with
observed data.
In machine learning, all the researche
rs are interested in determining the
best hypothesis from some space H, given the observed training data D. Bayes theorem
provides a way to calculate the probability of a hypothesis based on its prior probability
P(h) [4].
Bayes theorem:
P(hD) = P(Dh)P
(h) / P(D)
From this, we should understand why Bayes theorem is needed. Usually, we can get a
probability under certain condition. This is called Conditional Probability and write P(XY).
It denotes that the probability of X given Y.
Let’s use our canc
er data set to
understand the Bayes theorem.
We want to know the
probability a patient has cancer. P(cancer = 4  thickness=1, cellsize =1, cellshape =1,
adhesion =1, epithelialcellsize =1, nuclei=1, chromatin=1, nucleoli=1, mitoses=1)
.
Each
attribute ha
s
range of value from 1 to 10.
We are trying to calculate the probability a patient
has
cancer with this information.
We can get the probability from the above formula and we
can write this using conditional probability.
4
P(cancer = 4

thickness=1, cell
si
ze=1, cellshape=1, adhesion
=1, epithelialcell
size=1, nuclei=1,
chromatin=1,
nu
cleoli=1,mitoses=1)=P(cancer
=
4
,thickness=1,cellsize=1,cellshape=1, dhesion
=1,
epithelialcellsize
=1, nuclei=1, chromatin=1, nucleoli=1, mitoses=
1) / P(thickness=1, cellsize=1,
cell
shape=1, adhesion=1, epithelialcellsize
=1, nuclei=1, chromatin=1, nucleoli=1, mitoses=1
).
From this, the numerator can be read like; wha
t is the probability of “cancer
= 4
”
and
all of
the attribute
s
are “
1
”
.
This is called joint probability.
For example, i
f there are a 1000 total
sample data sets, and there are 5 matches where, cancer = 4 and all the attributes are equal
t
o 1, then P(cancer=4, thickness
=1,…, mitoses=1)=5/1000=
0.005.
Another way to calculate
probability from the same 1000 data set is, there
are 10 instances where all the attributes is
equal to 1 and of the 10 ins
tances, 5 instances have cancer=
4, then P(cancer=4

thickness
=1
,
…,mitoses=1) = 5/10 = 0.5
. It looks easy to calculate.
However, in real world is it really the right way to find the
probability? It is difficult
because under conditional probability all the attributes have to satisfy the condition and if
the number of attributes increases then it is even difficult to calculate the probab
ility. For
example, P(cancer=4
thick
ness=1,…,more
attributes = 1).
This defines that
P(cancer=4

thickness=1,…,more attributes=
1) = P(cancer=4
,thickness=1,..,more attributes
=1) /
P(thickness=1,…,more attributes=1).
This means that looking at the instances which have
this condition that thickness=1 and
mo
re attribute=1 where cancer
=4. However, it is
impossible to find such cases so conditional probability is useless. Therefore, we need to
consider the Conditional Independence. Using conditional independence in our case, if
we assume that each attributes
in an instance does not have any relationship between each
attributes, then we can multiply each attributes as following:
P(cancer=4 thickness=1,…,more attributes = 1)
= P (cancer=4, thickness=1, .., more attributes
=1) / P(thickness=1, …, more attrib
utes=1
).
This is changed to
P(cancer=4)*P(thickness=1)*….P(more attributes =1) /P(thickness = 1),…,P(more attributes =1
).
From this formula, we get rid of the attributes and it leaves out P(cancer=4). This means
that the probability would not be changed
even if conditions exist. Therefore, this is not
the correct method because we are trying to calculate the probability based on conditions.
This is the reason why
we need to use Bayesian Rule.
The Bayesian Rule is as following:
5
P(cancer=4thickness=1,…,m
ore attributes
=1)
=
P(cancer=4)
*P(thickness=1,…,more
attributes=1cancer=
4)
/
P(thickness=1)*…*P(more
attributes
=
1
).
However, the second part of the numerator P(thickness=1, …, more attributes =1  cancer =
4) is also difficult to find the cases as discussed
earlier. Therefore, this is not right way to
calculate the probability. So, we need to define the conditional independence more carefully.
Therefore, we can write each attribute under cancer = 4 and we assume each attributes is
independent. Finally we ca
n write this as follow:
P(thickness=1, …, more attributes =1  cancer = 4
)
=P(thickness=1cancer=4) *…*P(more attributes =1  cancer=4
).
P(thickness=1cancer=4) means that from the data set (number of instance which has
thickness = 1 / number of instance w
hich has cancer=4), we can get probability.
Finally
Naïve Bayes classifier calculates the probability for each attribute and selects the maximum
high probability for each attributes.
The formula is following:
[4]
Vmax = argmax P(a1,a2,..aNVj) P(Vj)
=== S
ummary ===
Correctly Classified Instances 680 97.2818 %
Incorrectly Classified Instances 19 2.7182 %
Kappa statistic 0.9405
Mean absolute error 0.0278
Root mean squar
ed error 0.1593
Relative absolute error 6.1504 %
Root relative squared error 33.5215%
Total Number of Instances 699
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F

Me
asure ROC Area Class
0.967 0.017 0.991 0.967 0.979 0.993
2
0.983 0.033 0.94 0.983 0.961 0.993
4
Weighted Avg
0.973 0.022
0.974
0.973 0.973
0.993
=== Confusion Matrix ===
a
b <

classified as
443 15  a = 2
4
237  b =4
[Result from Weka
Naïve
Bayes
]
This result is from Weka.
These results show us that Naïve Bayes gives 97% accuracy.
T
his test is done with 10 fold.
10 fold is the standard way to train the d
ata
set.
What this
6
means is that for example if we have a 100 data set, NB randomly pick
s 90 instances and
train them.
After that
10 instances will be tested
from the 90 trained instances.
And another
9
fold will be trained same way.
Finally, all the results
will be averaged from each fold.
The
above confusion matrix, tells us that there are 458 (443 + 15) instance which actually
belong to class 2 from total num
ber of instances which is 699.
However, after NB
algorithm trains the data set and reports that NB c
an tell 443 are 2 but 15
is 4 even if it is
actually 2.
We can match this with correct classified instances 680 and incorrect classified
instances 19
.
Support Vector Machines(SVM)
Support Vector Machines are based on the concept of how to make a decision
to divide
boundaries. SVM mainly performs classification tasks by constructing hyperplanes in a
multidimensional space that separate cases of different class labels [1]. Common task in
machine learning is to classify data. What it means is that if we have
data whose points
belong to one of two classes, then we need to decide where the new data belongs to. In
SVM, data point represents as p

dimensional vector, and we need to know whether we can
divide the new point with P

1 dimensional hyperplane. This is ca
lled a linear classifier.
There are many hyperplanes that might divide the data. The question we need to ask is that
from all the hyperplanes how can we pick the best hyperplanes?
One of the reasonable choice as the best hyperplanes is that when the hy
perplane represents
the largest separation or margin between the two classes. The one hyperplane which has
this property, it means that the distance from the hyperplane to the data is maximized. And
the hyperplane is known as the maximum

margin hyperplane
[3]. SVM overcomes two
problems, one is conceptual problem: How to control the complexity of the set of
approximation functions in a high

dimensional space in order to provide good
generalization ability, using penalized linear estimators with a large numb
er of basis
functions. The other is computational problem: How to perform numerical optimization in
a high

dimensional space, using the dual kernel representation of linear functions.
7
=== Summary ===
Correctly Classified Instances 670
95.8512 %
Incorrectly Classified Instances 29 4.1488 %
Kappa statistic 0.9083
Mean absolute error 0.0415
Root mean squared error 0.2037
Relative absolute error
9.1793 %
Root relative squared error 42.854 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F

Measure ROC Area Class
0.967 0.058 0.969 0.967 0.968
0.955
2
0.942 0.033 0.938 0.942 0.94 0.955
4
Weighted Avg.
0.959 0.049 0.959 0.959 0.959 0.955
=== Confusion Matrix ===
a
b <

classified as
443
15  a = 2
14
227 
b = 4
[Result from Weka SMO]
These results show us that SVM gives 95% accuracy.
The tests in this case are also conducted like they are conducted in Naïve Bayes.
Decision Tree
Algorithm
Trees algorithm creates hierarchical struc
ture of classification rul
es
.
Therefore,
Rules provided by decision tree induction are easy to understand.
It looks like a
tree implying “If ... Then ...” clause.
Decision tree algorithm is generally best suited
to problems where instances are represented by attribute

value pairs a
nd are described by a
fixed set of attributes
and their values [5].
In our examples, we have a 10 fixed set of
attributes and their values from 1 to 10. One of the advantages of Decision Tree algorithm
is handling of noisy training data where the errors ma
y lay in the independent variables, in
the dependent variable or in both
[5]
. They also offer a good deal of flexibility in handling
datasets with missing values [11]. However, our examples may not suitable to apply for this
algorithm, since the easiest si
tuation for decision tree learning occurs when each attribute
takes on a small number of disjoint possible values [11].
The following table shows a
set of training data that could be used to predict cancer or no cancer. In this
example, normalized informat
ion about patient was generated including their
thickness,
cell size, sell shape, adhesion, epithelial cell size, nuclei, chromatin,
nucleoli, mitoses, and whether they represented a cancer or non cancer.
8
[result from weak with J48 algorithms]
In this example, the Decision Tree algorithm might determine that the most
significant attribute for predicting
cancer or not
. The first split in the decision
tree is therefore made on cell
size.
There are ten leaf nodes from the root and
each has value of cell size attribute. In this example, if the patient has cell size
with 1, then the patient dose not cancer. If the patient has cell size of 3, then
third level of leaf nodes is for nuclei
. Now, decision tree algorithm will
consider of the value of nuclei to predict whether the patient has cancer of not.
If the nuclei have the value of 1, the patient is not cancer. If the patient has the
Thickness
Cellsize
cellshape
adhesion
Epithelial
ce
llsize
nuclei
chromatin
nucleoli
mitoses
cancer
2
4
4
5
7
10
3
2
1
2
3
6
5
6
7
?
4
9
1
2
2
4
4
4
6
5
7
3
1
2
6
4
5
3
7
3
4
6
1
2
3
3
2
6
3
3
3
5
1
2
2
9
7
5
5
8
4
2
1
2
7
3
3
3
3
2
6
1
1
2
3
4
5
1
8
1
3
6
1
2
1
1
1
3
1
5
2
1
1
4
1
2
2
1
2
6
1
1
2
4
2
3
2
1
3
4
4
1
1
4
9
value of nuclei as 4, the patient does not have cance
r. With our sample data, the
result from weak is following:
=== Summary ===
Correctly Classified Instances 660 94.4206 %
Incorrectly Classified Instances 39 5.5794 %
Kappa statistic 0.87
69
Mean absolute error 0.0796
Root mean squared error 0.218
Relative absolute error 17.6026 %
Root relative squared error 45.8562 %
Total Number of Instances 699
=== Deta
iled Accuracy By Class ===
TP Rate FP Rate Precision Recall F

Measure ROC Area Class
0.954 0.075 0.96 0.954 0.957 0.955
2
0.925 0.046 0.914 0.925 0.92
0.955
4
Weighted Avg.
0.944 0.065 0.944 0.944 0.944 0.955
=== Confusion Matrix ===
a b <

classified as
437
21  a = 2
18
223  b = 4
[Result from Weka
J48
]
These results show us that decision tree algo
rithm gives 94% accuracy.
The tests in this case are also conducted like they are conducted in Naïve Bayes.
Association Rules
Association rule is to define correlation in large amount of data sets.
Such correlation is
called association rule.
In Associati
on Rule, each data sequence is a list of transactions and
each transaction is a set of items.
One of the typical examples of association rule is the
"market

basket" problem
[10]
. If a supermarket keeps track of all the purchase transactions,
each purchase
transaction is a subset of all items available in the store.
The problem is that
the
analyz
ing a large set of transactions
can be discover
ed
the correlation between subsets?
i
.
e
.
: people buying
beer
has a high tendency of buying
diaper
. Or people buying di
aper
tends to buy
beer
[10].
In our examples,
we need to find the correlation
between subsets of
10
attributes: {
Thickness,
cell size, sell shape, adhesion, epithelial cell size,
nuclei, chromatin, nucleoli, mitoses
, cancer
}
.
We applied Weka Apriori Associ
ation
Rule to determine our cancer data sets.
Best rules found:
10
1. cellsize=1 nucleoli=1 356 ==> cancer=2 355 conf:(1)
2. cellsize=1 nucleoli=1 mitoses=1 350 ==> cancer=2 349 conf:(1)
3. cellshape=1 353 ==> cancer=2 351 conf:(0.99)
4. cellsize=
1 mitoses=1 377 ==> cancer=2 374 conf:(0.99)
5. nuclei=1 nucleoli=1 355 ==> cancer=2 352 conf:(0.99)
6. cellsize=1 384 ==> cancer=2 380 conf:(0.99)
7. cellsize=1 cancer=2 380 ==> mitoses=1 374 conf:(0.98)
8. adhesion=1 cancer=2 375 ==> mitoses=
1 369 conf:(0.98)
9. cellsize=1 nucleoli=1 356 ==> mitoses=1 350 conf:(0.98)
10. cellsize=1 nucleoli=1 cancer=2 355 ==> mitoses=1 349 conf:(0.98)
Results
Since the result of association rule
is different from other classifiers
:
Naïve Bayes, SVM
and Decision Tree,
we need to retrieve the result from association rule in same result
format with others.
Simply, we count each true or
false number of results after applying
association rule. We retrieve true positive number, false negative number, false
positive
n
umber and true negative number, and with these numbers, we can finally get following
results from Weka.
Algorithms
True Positive
False Negative
False Positive
True Negative
Naïve Bayes
443
15
4
237
SVM
443
15
14
227
Decision Tree
437
21
18
2
23
Association rule
415
43
6
235
Algorithms
True Positive
Rate
False positive
Rate
Precision
Recall
Naïve
Bayes
0.967248908
0.01659751
0.991051454
0.967248908
SVM
0.967248908
0.058091286
0.969365427
0.967248908
Decision Tree
0.954148472
0.07468879
7
0.96043956
0.954148472
Association rule
0.906113537
0.024896266
0.985748219
0.906113537
From these results, we can find that the best algorithm in our case is Naïve Bayes with only
19 false instances
,
where
as
SVM has 29, Decision Tree has
39 and Assoc
iation rule has 49
false instances.
11
3
.
Inductive Logic Programming:
Aleph
"Inductive
Logic Programming (ILP) investigates the construction of logic programs from
training examples and background knowledge. ILP is a new research field that combines
the tec
hniques and theories from inductive concept l
earning and logic programming."[8]
An empirical ILP system can be classified into either a bottom up or a top down learner.
Bottom

up systems search for program clauses by considering generalizations. They star
t
from the most specific clause that covers a positive training example and then generalizes
the clause until it cannot be further generalized without covering some negative examples.
Top

down methods apply specialization operators to learn program clauses
by searching
from general to specific.
One of the most famous emipircal top

down ILP system
s
is First Order Inductive Learner
(FOIL)
[
7
] algorithm. Aleph
[2] is also an inductive logic programming system. It is
written in Prolog.
In our project we make u
se of Aleph because it is written in Prolog,
which makes
it easier to learn how to use.
Methods
Aleph is a p
r
ogram that can reason on some data to produce certain rules. These rules are
meant to generalize the knowledge about the supplied data to provide
a classification for
future instances.
Because Aleph input format is drastically different than the one we have, we start out by
converting our data into a suitable format for the Aleph learner. Aleph requires three files to
provide output:
Positive exampl
es file

(filename.f) that is the ground facts of the concept to be learned
Negative examples file (filename.n) ground facts that are false
Background knowledge file (filename.b)

encodes information that is relevant to the domain
of the problem
12
We
write a Python script
convert.py
that separates our input data file into two files:
positive_n.txt and negative_n.txt. We write the knowledgebase file by hand. Here is a
snipped from the script that splits the input files into two:
def
split_into_pos_neg
(kb_in, pos_out, neg_out):
pos = open(pos_out, 'w')
neg = open(neg_out, 'w')
# Initial data
input = csv.reader(open(kb_in))
for row in input:
# Parameter that determines whether t
he instance is malignant
cancer = row[9]
if cancer == "2":
writ
e_to =
neg
else:
write_to = pos
write_row(row, write_to)
After we have the data separated into two parts, we create a number of "folds" to run
algorithm on. Each fold consists of pos
i
tive and negative training files, and positive and
negative testing fil
es. Training files contain about 90% of our data, while the testing files
store the rest of the data. Once our Aleph processes the training fold, we can r
un the rules it
generates again
s
t
our training fold and measure the success rate. We create a total of
10
folds, thus effectively splitting our data into 10
independent instances. The code is
lengthy
to include here so please refer to the code section or file itself
[12]
.
For
completeness,
we include a description of how basic Aleph algorithm works:
1.
Select
an example.
This selects an example to be generalized
2.
Build most

specific

clause.
This step creates a most specific claus
e
that entails the selected example
3.
Search
This step tries to find a more general clause than the one in previous step.
4.
Remove redunda
nt
This step selects the best general clause found, adds it to the theory and
r
emoves all
redundant clauses that are covered by this more general clause
.
Repeat the above algorithm until no more examples are left to be generalized.
After Aleph generates
the rules we need a way to check, during the testing phase, whether
13
th
e new data instances match the r
ules correctly. We do this wi
th a simple python script
check.
py, that basically tries to match the two clauses. Here is snipped of code that tests our
ma
tching methods, and
provides an easy way to see it
s functionality:
def test_match():
rule = "cancer(A, A, A, A, B, A, B, A, A)."
str= "cancer(1, 1, 1, 1, 2, 1, 2, 1, 1)."
str2="cancer(2, 2, 2, 2, 3, 2, 3, 2, 2)."
str3="cancer(2, 2, 2, 2, 3,
2, 3, 2, 6)."
print "Should match", match(str, rule)
print "Should match", match(str2, rule)
print "Should NOT match", match(str3, rule)
rule2 = "cancer(1, 1, 1, 1, 2, 1, 2, 1, 1)."
str3="cancer(1, 1, 1, 1, 2, 1, 2, 1, 1)."
str4="can
cer(1, 1, 1, 1, 2, 1, 2, 1, 2)."
print "Should match", match(str3, rule2)
print "Shoud NOT match", match(str4, rule2)
Result
s
During our 10

fold testing we found that out of 690 instances 450 were correctly diagnosed
as benign, while 240 were
fa
lse ne
gatives.
This result is mostly due to the way Aleph found rules for the dataset. A lot of the rules
were just definite ground clauses like:
cancer(1, 3, 4, 5, 6, 2).
Now any new instances that were not the same would be classified as benign
.
True Po
sitive
False Negative
False Positive
True Negative
Cancer
0
240
0
450
Not Cancer
7
443
0
240
The problem here is that the negative knowledgebase is bigger than positive. Upon
switching the
knowledge bases
and forming our query to be of the type not_can
cer, we
found that we had 7 correctly classified malignant instances. (See table above for "Not
Cancer" row).
This was possible because having a larger dataset, Aleph was able to formulate some rules
that were more abstract than the definite ground clauses
. The rules looked
were of the form
similar to:
cancer(A, B, B, C).
14
Which mean
s
that any instance
of this form
second and third attributes
are
the same.
4
.
Combining ILP
with
Machine Learning Algorithm
Inductive learning methods formulate general hypothes
es by finding regularities over the
training examples.
Inductive methods such as Inductive logic Programming (ILP) are used
seek prior knowledge. On the other hand machine learning algorithm such as Naives Bayes,
SVM, Decision Tree,
and Association rules a
re used to train the data set to calculate
explicit probabilities for hypotheses.
The main goal of combining
ILP and
machine
learning algorithm is to obtain the benefits of approaches: better generalization accuracy.
Inductive methods offer the advantage t
hat they require no explicit prior knowledge and
learn regularities based solely on the training data. However, they can fail when given
insufficient training data.
From this project, our goals is to find which machine learning
algorithm gives us high accu
racy rate and combine the one, which is Naïve Bayes classifier
with ILP. ILP is to make rules based on the data set
s
. The inductive logic is to make
hypothesis rules which are different than deductive logic. This means that we need to
carefully analyze the
data set to make correct rules from the given data set. So if
we have
more
knowledge domain we get more accurate rules.
Many machine learning researchers are working to combine
naïve
bayes with ILP.
Combining two algorithms is not just two algorithms
to
gether.
What it means is that if one
algorithm gives us 70% accuracy from a knowledge domain, next step is to see that the
other algorithm increases the accuracy with only looking at the 30% of inaccuracy from the
other algorithm.
15
This way combing two al
gorithms will increase the accuracy.
In order to do this
,
we
retrieve
d
all false
instances which are “false positive” and “false negative” ones
from the
results of
Naïve
Bayes
as following
:
Thickness
Cellsize
cellshape
adhesion
Epithelial
cellsize
nucle
i
chromatin
nucleoli
mitoses
cancer
5
4
4
5
7
10
3
2
1
2
6
8
8
1
3
4
3
7
1
2
6
6
6
9
6
?
7
8
1
2
8
4
4
5
4
7
7
8
2
2
8
4
6
3
3
1
4
3
1
2
6
3
3
5
3
10
3
5
3
2
5
7
7
1
5
8
3
4
1
2
5
3
4
3
4
5
4
7
1
2
4
6
5
6
7
?
4
9
1
2
4
4
4
4
6
5
7
3
1
2
3
4
5
3
7
3
4
6
1
2
3
3
2
6
3
3
3
5
1
2
6
9
7
5
5
8
4
2
1
2
6
3
3
3
3
2
6
1
1
2
5
4
5
1
8
1
3
6
1
2
4
1
1
3
1
5
2
1
1
4
10
2
2
1
2
6
1
1
2
4
6
3
2
1
3
4
4
1
1
4
[False Instances from Weka
Naïve
Bayes]
And this would be what we are talking about 30% which
first alg
orithm could not get
accuracy.
Then, we applied these false instances into ILP to find more accurate rules.
cancer(A, B, B, A, C, D, E, F, G).
cancer(6, 6, 6, 9, 6, ?, 7, 8, 1).
cancer(8, 4, 4, 5, 4, 7, 7, 8, 2).
cancer(8, 4, 6, 3, 3, 1, 4, 3, 1
).
cancer(6, 3, 3, 5, 3, 10, 3, 5, 3).
cancer(5, 7, 7, 1, 5, 8, 3, 4, 1).
cancer(5, 3, 4, 3, 4, 5, 4, 7,1).
cancer(A, B, C, B, D, E, A, F, G).
cancer(3, 4, 5, 3, 7, 3, 4, 6, 1).
cancer(3, 3, 2, 6, 3, 3, 3, 5, 1).
cancer(6, 9, 7, 5, 5, 8, 4, 2, 1).
cancer(5
, 4, 5, 1, 8, 1, 3, 6, 1).
[Result from
Aleph
]
However, the result from Aleph shows that our approach to combine ILP with NB does not
improve the accuracy
.
We still need additional expert domain knowledge to get more
reasonable result from ILP.
16
4
.
Conclu
sion
It would be interesting to see if additional expert domain knowledge to the knowledgebase
would increase the accuracy. Since Aleph is a rule driven system it does not work very well
with raw data without any relationships. Semantic web techniques coul
d be used to build an
ontology which information could than be encoded into simple clauses for Aleph.
Another aspect is whether combining Aleph findings with that of a machine learning
technique yield better results. It is not rational to try this in our c
ase, as our Aleph
correctness was less than that of a random decision (50%). However once more rules are
added it is possible that Aleph may produce high quality results which could than be
combined with algorithms like Decision tree to reinforce a hypothe
sis or discover new
relationships in the data.
Reference
s
[1]
http://www.statsoft.com/textbook/stsvm.html
[2]
http
://www.comlab.ox.ac.uk/activities/machinelearning/Aleph/
[3]
http://en.wikipedia.org/wiki/Support_vector_machine
[4] Tom M. Mitchell “Machine Learning” McGraw Hill, 1997
[5]
http://www.bandmservices.com/DecisionTrees/DecisionTrees.htm
[6]
Volker Tresp, Markus Bundschus, Achim Rettinger, and Yi Huang
, “
Towards
Machine Learning
on the Semantic Web
”
,
Springer Berli
n/
Heidelberg, Volume 5327,
2008
[7]
http://www.csc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/foil.html
[8] Man Leung Wung, Kwong Sak Leung, Data Mining using grammar ba
sed genetic
programming and applications
, 2000
.
[9]
http://www.statsoft.com/textbook/stassrul.html
[10]
R
Agrawal
, R Srikant, “
Fast Algorithms for Mining Association Rules in Large Dat
abases
”,
Proceedings of the International Conference on Very Large Data
Bases, V
LDB, 1994
.
[11]
Jesse Davis, Eric Lantz, David Page, Jan Struyf, Peggy Peissig, Humberto Vidaillet,
Michael Caldwe
ll, “
Machine Learning for Personalized Medicine: Will This Drug Give
Me a Heart Attack?”
http://www.cs.ualberta.ca/~szepesva/ICML2008Health/Davis.pdf
17
[12] Rule Based Clinical Data website
http://rule

data

mining.appspot.com/
Appendix
1. Foil algorithm:
http://www.cs.sfu.ca/~oschulte/socialnetwork/foil.pdf
2. Aleph
:
http://www.doc.ic.ac.uk/~shm/ilp.html
3. Weka:
http://www.cs.waika
to.ac.nz/ml/weka/
http://wekadocs.com
4.
Data

http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic
)
5.
William S Nobel “What is a support vector machine?” 2006 Nature Publishing
Group
Comments 0
Log in to post a comment