LN#12_2 - CFINS

crazymeasleAI and Robotics

Oct 15, 2013 (3 years and 9 months ago)

134 views

Copyright
©

by Yu
-
Ch
i Ho

1

L
ECTURE
N
OTES
12_2

VERSION
:

1.10

DATE
:

2002
-
5
-
1
1

FROM
ES

205

LECTURE
#
1
6

I
NDUCTIVE
L
EARNING

Lecture Note 16: Inductive Learning

Yu
-
Chi Ho, Zhaohui Chen, Jonathan T. Lee

A fundamental step towards intelligent machine is learning, namely to discover
knowled
ge from previous experience. This note gives a brief overview of the methods
and ideas in supervised machine learning. We point out that learning is a multistage
iterative process, and incorporating domain knowledge is very crucial for the success
of a lea
rning system.

1.

Supervised Learning

People have been dreaming of inventing intelligent machines that can replace some of
the human cognitive activities for a long time. From our viewpoint, the very first step
towards this ambitious goal is to automate the p
rocedure of knowledge acquisition,
which is the major topic of this thesis.


The problem considered here is
supervised learning
. An
instance

can be specified by
N

features




X

and a
label

Y


Y
, where
X



R
N

is the value
space of the f
eatures and
Y



R

is the set of all possible labels. In this note, we focus
on classification problems, where
Y

has categorical value and
Y

=

is a finite
discrete set. For a given domain,

can be seen a
s a ran
dom vector with
unknown joint distribution

and the
training set


is a collection of
M

labeled instances randomly sampled from the distribution
D
. The
training set
S

is essentially the instructor in supervised
learning.


The task of a
learning algorithm

is to learn a model

from training set
S

to
predict the label of unseen instances.
, known as a
hypothesis
, is a mapping
from the feature space to all the possible la
bels, namely,
H
:
X


Y
. To simplify the
notation,
S

is usually omitted, and a learned model is denoted as
.

Copyright
©

by Yu
-
Ch
i Ho

2


There are some fundamental difficulties associated with a learning procedure. First,
the number of features
N

can be large,
and each feature may have many, sometime
infinite possible values, which makes the feature space tremendously gigantic.
Second, for a given dataset, features are often correlated; as a result, information
conveyed in training examples is redundant. In addi
tion, there may be many features
entirely irrelevant to the classification problem. Redundant and irrelevant features can
be confusing for a learning algorithm if not treated properly. Third, observation noise
may present in both features and labels. For m
any domains, some features can be
simply missing. Noisy data and missing features can further confuse the learning
algorithms. Fourth, as we will see later on, for most learning algorithms the tradeoff
between
underfitting

and
overfitting
1

can be very subt
le. We will make effort to
address the above issues in this note.


An obvious criterion of a learning algorithm is to minimize the classification error rate
of the learned hypothesis, which can be defined as








(
1
.
1
)

where



is an indicator function. If an instance (
X
,
Y
) is randomly drawn according to
distribution
D
,


is the probability that

makes error on it.


In order to achieve low err
or rate in Eq.
(
1
.
1
)
, a learned model

must capture
the rules that govern the “labeling procedure” of instances. One of the fundamental
principles of supervised learning is that

should fi
t training data well. The
rationale lies in the belief that one should be able to explain the observations before he
makes predictions. If one truly understands the problem, hopefully, he will predict
well.





1

By “underfitting” we mean a model does not fit the data well; by overfitting we mean a model fits the
data perfectly, but
cannot predict well.

Copyright
©

by Yu
-
Ch
i Ho

3

In general, learning can be viewed as a non
-
line
ar fitting problem, and there are two
major approaches to tackle the problem:

1.

Assuming explicitly the form of the hypothesis
, where


are the
parameters in the model; using training set
S

to fit

.

2.

Bayesian method, which estimates t
he posterior probability

from
S
, and
makes decisions based on it.


In the following sections, we will discuss these two methods briefly. A
comprehensive treatment is beyond the scope of this note. Should the readers be
interested i
n this topic, they are referred to references
[3]
,
[7]
,
[8]
,
[18]
,
[11]
.


2.

Model Fitting

Model fitting is a long
-
standing proble
m. In ancient time, people measured the
position of planets and tried to figure out their orbits. Hypotheses were formulated by
fitting the accumulated observation data and used to predict the position of planets. If
a hypothesis makes correct predictions,

the chance that it matches the true orbit
increases; otherwise, either the hypothesis needs to be revised or the original data may
be too noisy and new observations must be made.


The above cycle covers the major steps of knowledge acquisition by human be
ings.
Machine learning tries to mimic this procedure except that a computer is utilized to
process information. As mentioned, hypothesis

is a mapping
X


Y
. If there
is no restriction on the form of a hypothesis, a classification pr
oblem can be stated as
“finding

among all the possible mappings
H

such that the expected error rate
of Eq.
(
1
.
1
)

is minimized.”








(
2
.
1
)

Th
is is an
optimization

problem in
functional

space. By definition,

is the global
optimum of a classification problem. In general,
(
2
.
1
)

is not solvable if no further
assumption is made.


Copyright
©

by Yu
-
Ch
i Ho

4

If some structural
form of the model is predefined, a hypothesis can be denoted as
, where


are parameters. In the orbit
-
fitting problem, for example, when a
circle is used to model the orbit,


include position of the center, radius and orbit
inclina
tion. In such cases, problem
(
2
.
1
)

can be restated as “given the form of the
hypothesis
, finding


among all the possible parameters


獵s栠瑨慴te牲潲
牡瑥t
(
1
.
1
)

is minimized.”








(
2
.
2
)

This is an optimization problem in
parametric

space. Different algorithms have been
developed to deal with it.


In model fitting, the predefined structure is highly crucial for the ex
cellence of a
hypothesis. If the assumption is informative, solving
(
2
.
2
)

gives a good approximation
to
, namely











(
2
.
3
)

and

is often referred as
unbiased
learning model. Otherwise,

can
be far from the global optimum.


People used to model the orbit of a planet as a circle. For some planets, this model is
fairly accurate; however, large er
rors present for other planets until people figured out
that ellipse are a better model. As one can see, the initial model is simply not
sophisticated enough to capture the underlying mechanism that generates the data. We
say that the model
underfits

the o
bserved data; in other words, the domain is not
learnable by the model. On the other hand, if high
-
degree interpolations say, a 15
th

degree polynomial are used to fit the data, it is easy to see from
Figur
e
2
-
1
b, we may
end up wit
h a hypothesis that fits all the observed data perfectly, but no one would
expect it to make correct predictions. This phenomenon is known as
overfitting
.

Copyright
©

by Yu
-
Ch
i Ho

5


(a) Using circle as model

(b) Using high
-
degree interpolations

(underfitting)

(overfitting)

Figur
e
2
-
1

Fitting the Orbit of a Planet


Most machine learning algorithms struggle with this dilemma: with too rough a
model, the hypothesis could not even fit the training data; with too complicated a
model, tra
ining set can be overfitted, which leads to poor predictive performance. The
tradeoff between model complexity and the number of training data can be very
subtle. One of the famous principles here is the
Occam’s razor
, which says that given
two fitted mode
l, which can be seen as explanations of the observed data, all other
things being equal, the simpler model is preferable. It can be shown (see
[2]
), that,
under general assumptions, lower model complexity will l
ead to better predictive
performance.


Several famous learning models can be seen as special cases of model fitting
approach, for example,
artificial neural networks
(see
[10]
),
inductive logic program
(see
[14]
) and
decision tree

(see
[15]
), etc.

2.1

Artificial neural networks (ANN)

ANN is a connectivity model, which consists of a large number of processing units
known as
neurons

(see
Figure
2
-
2
a). Neurons work together to perform a given task.
Figure
2
-
2
b shows a type of network
topology
, where the nodes are neurons and the
directed edges represents connections among them.


Mathematically, a neuron computes

the weighted sum of its input signals, and output
some activity level according to some given
activation function

f

Copyright
©

by Yu
-
Ch
i Ho

6











(
2
.
4
)

Neurons interact with each other according to the networ
k topology. Different weights
are assigned to edges, i.e.,
W
IJ
, to represent the strength of connections.

(a) A generic ANN neuron

(b) Network topology

Figure
2
-
2

The Artificial Neural Network


Learning in t
he context of ANN can be seen as updating the network topology and
connection weights based on available information so that the network can achieve
certain performance criterion. If the topology and weights are taken as parameters


of
the model, this is
exactly the problem stated in Eq.
(
2
.
2
)
. To illustrate the idea, we
here give an example on how to train a perceptron. Suppose we are given a training
set


#

X
1

X
2

X
3

Y

0

1

1

1

1

1

1

1

0

1

2

1

0

1

1

3

1

0

0

1

4

0

1

1

1

5

0

1

0

0

6

0

0

1

0

7

0

0

0

0


Table
2
-
1


Example training set

The
perceptron
has 3 inputs
X
0
,
X
1
and
X
2
, and one output
Y

and the activation
function is a threshold function. The decision rule

is


Layer

I

Layer
J

Layer
K


W
IJ


W
JK


X
1





Y

X
N

Activation

f


Summing

Junction



w
1

w
N

Copyright
©

by Yu
-
Ch
i Ho

7

where


is the threshold.


According to the
Perceptron Learning Rule
, weights are changed by an amount
proportional to the difference between the desired output and the actual output









(
2
.
5
)

where


is the learning rate,
Y

is the desired output, and
H
(
X
) is the actual output of
the perceptron.


Table
2
-
2

shows the process of learning. We choose


= 0.4, and the initial weights to
be
w
0

= 0.8,
w
1

= 0
.4,
w
2

= 0.4 and


= 0.5. After training, we find that the weights
w
0

= 1.2,
w
1

= 0.4,
w
2

= 0.4 and


=

0.7 fit all the data points. This is the learned
perceptron model for the given dataset.


X
0

X
1

X
2

Y

w
0

w
1

w
2



sum

H
(
X
)


w
0


w
1


w
2




1

1

1

1

0.80

0.40

0.40

0.50

2.10

1

0

0

0

0

1

1

0

1

0.80

0.40

0.40

0.50

1.70

1

0

0

0

0

1

0

1

1

0.80

0.40

0.40

0.50

1.70

1

0

0

0

0

1

0

0

1

0.80

0.40

0.40

0.50

1.30

1

0

0

0

0

0

1

1

1

0.80

0.40

0.40

0.50

1.30

1

0

0

0

0

0

1

0

0

0.80

0.40

0.40

0.50

0.90

1

0

-
0.40

0

-
0.4
0

0

0

1

0

0.80

0.00

0.40

0.10

0.50

1

0

0

-
0.40

-
0.40

0

0

0

0

0.80

0.00

0.00

-
0.30

-
0.30

0

0

0

0

0

0

1

0

0

0.80

0.00

0.00

-
0.30

-
0.30

0

0

0

0

0

0

0

1

0

0.80

0.00

0.00

-
0.30

-
0.30

0

0

0

0

0





























1

1

1

1

1.20

0.40

0.40

-
0.70

1.30

1

0

0

0

0

1

1

0

1

1.20

0.40

0.40

-
0.70

0.90

1

0

0

0

0

1

0

1

1

1.20

0.40

0.40

-
0.70

0.90

1

0

0

0

0

1

0

0

1

1.20

0.40

0.40

-
0.70

0.50

1

0

0

0

0

0

1

1

1

1.20

0.40

0.40

-
0.70

0.10

1

0

0

0

0

0

1

0

0

1.20

0.40

0.40

-
0.70

-
0.30

0

0

0

0

0

0

0

1

0

1.20

0.40

0
.40

-
0.70

-
0.30

0

0

0

0

0

0

0

0

0

1.20

0.40

0.40

-
0.70

-
0.70

0

0

0

0

0


Table
2
-
2


Training the perceptron to fit the dataset


One of most important types of topology in ANN is the
feedforward network

where
neurons are arranged in layers, the output of a neuron is an input of another neuron in
the next layer and no loop is allowed in the graph. One of the fundamental results in
ANN is that (see
[10]
) with an arbitrary number of no
n
-
linear neurons, a 3
-
layer or
even 2
-
layer
feedforward

ANN is capable of approximating
any

function with
arbitrary accuracy. That is to say,

can be arbitrarily close to the global
optimum

if there are enough

neurons and the optimal parameters
can be found.
Therefore, a feedforward ANN with large number of neurons is known as an
unbiased

learner. On the other hand, if an ANN has too many neurons, and there is not enough
Copyright
©

by Yu
-
Ch
i Ho

8

training data to
produce accurate estimation of the parameters, the ANN model can
overfit training data easily, and generalization performance can be poor.


2.2

Inductive logic program (ILP)

Given training examples and prior knowledge, an ILP system uses a set of first
-
order

logic
2

clauses to predict the label of unseen instances. ILP only deals with binary
values. For some domains, data transformation is necessary before the model can be
applied. If we take the maximal number of literals in a clause, the presence or absence
of each literal and the logic operators as the parameters of an ILP model, the problem
becomes a special case of Eq.
(
2
.
2
)
.


The basic idea of ILP is to find as few short (in terms of the number of literals used)
clauses as possib
le to explain all the positive examples. To search for such clauses,
ILP starts with the shorts ones. As for the training set in
Table
2
-
1
, we can see that


explains instance #0, #1, #2, #3 i
n the training data, and


covers instance #4. The two clauses fit all the positive instances in the training set;
otherwise, an instance is negative.


Several issues arise when an algorithm is devised to solve the problem. First, th
e size
of parameter space

increases exponentially fast as the maximal number of
literals in a clause goes up. Hence, efficient search scheme becomes very important.
Second, it is not easy to work directly with the error rate of Eq.
(
1
.
1
)
; therefore, some
metric score must be defined to evaluate the goodness of a candidate clause. Mutual
information and other entropy
-
based scores are among the popular choices. Another
important issue is the learnability of an ILP

system.
[14]

presents a brief discussion on
this problem. Generally, it depends on the underlying domain and the maximal
number of literals. As one can see, the more literals allowed in a clause, the better the
training exampl
e can be explained. However, this violates the Occam’s razor and may
lead to overfitting.




2

In symbolic logic, a variable is called
literal
. First
-
order logic takes only individual literals as
arguments and quantifiers, for example,
A



B



C




D
, reads
A

and
B

or
C

and not
D
.

Copyright
©

by Yu
-
Ch
i Ho

9

2.3

Decision tree

The decision trees can be seen as an extension to ILP in the sense that real number
inputs and multi
-
class output are allowed. For the fitting part, a t
ree
-
growing process
tries to partition training set into single
-
class subsets according to the value of input
features. If one feature cannot achieve single
-
class partition, more features are used to
grow the decision tree. Again, if we take all the splitt
ing points as parameters of the
model, the problem becomes Eq.
(
2
.
2
)

once again.


For example, given the training set in
Table
2
-
1
, we have the following partition:


In each box, the fi
rst number is the number of positive examples in the subset,
and the second is the number of negative examples.


Figure
2
-
3

Decision tree: split the training set according to the value of a feature

Different

criteria are used to decide which feature to be selected at each decision
point. As in ILP, entropy
-
based scores are among the popular choices. For example,
the mutual information between two random variables
X

and
Y

is defined as







(
2
.
6
)

In our example, we calculate the mutual information between each feature
X
i

and class
label
Y
, they are 0.59 bit, 0.09 bit and 0.09 bit respectively, that is why we first choose
X
1

to grow the decision

tree.


As one can see, with enough number of features, single
-
class partition is always
achievable. However, the goal is not merely to find such a partition, but to build a tree
that reveals the structure of the domain and to make
prediction
. To avoid ove
rfitting,
X
1
= 1

X
1
= 0

S

(5:3)


(4:0)

(1:3)


(1:1)

X
2
= 1

X
2
= 0


(0:2)


(1:0)

X
3
= 1

X
3
= 0


(0:1)

Copyright
©

by Yu
-
Ch
i Ho

10

we need significant number of instances at each leaf or, to put it into another way,
there should be as few partitions on the decision tree as possible.


3.

Bayesian Decision Analysis

Another approach to the classification problem is the probabilist
ic model. As
mentioned above, training set
S

can be viewed as a set of random samples from the
unknown distribution
.

The basic idea of probabilistic methods is to estimate
this distribution based on available training data.


The cr
iterion of minimizing error rate in Eq.
(
1
.
1
)

implicitly assumes that all the
classes have equal priority. Let us consider a more general case where the
loss

function

can be defined by
Table
3
-
1
. Namely, if t
he true label of an instance
X

is
Y

=
k
, and the hypothesis predicts it as
, it encounters a loss of
L
kl
.



H
(
X
) = 1

H
(
X
) = 2



H
(
X
) =
K

Y =
1

L
11

L
12



L
1
K

Y =
2

L
21

L
22



L
2
K











Y = K

L
K
1

L
K
2



L
KK

Table
3
-
1


The loss function of classification problem


For a given instance
X
, the posterior probability

is the probability that
k

is the correct label of
X
. If a hypothesis predicts
, the expected loss
associated with this decision is known as the
risk

of the decision








(
3
.
1
)

which can be minimized by choosing









(
3
.
2
)

It is easy to see that, if a hypothesis minimizes the risk in Eq.
(
3
.
1
)

at every instance
X


X
, i.e.,

Copyright
©

by Yu
-
Ch
i Ho

11








(
3
.
3
)

the overall
risk









(
3
.
4
)

is also minimized.


According to Eq.
(
3
.
4
)
,
R

can be seen as the risk associated with decision rul
e
.
Eq.
(
3
.
3
)

is known as the
Bayesian decision rule

and the risk of Bayesian decision rule
is called the
Bayesian risk







(
3
.
5
)

which is the theoretical lower bound of achievable risk.


Many machine learning algorithms assume
zero
-
one

loss function, where all the
diagonal elements in
Table
3
-
1

are 0 and all the off
-
diagonal elements are 1
. Namely,
a hypothesis loses nothing when making correct prediction and all errors are equally
costly. In such case, risk of Eq.
(
3
.
1
)

becomes








(
3
.
6
)

To minimize
(
3
.
6
)
, a hypothesis has to maximize the posterior probability, i.e.,








(
3
.
7
)


Since

is the probability that
l

is the correct label,

is the minimal probability of making error on instance
X
. Consequently, the
(Bayesian) risk associated with decision rule
(
3
.
7
)

becomes

Copyright
©

by Yu
-
Ch
i Ho

12







(
3
.
8
)

which is the theoretical lower bound of prediction error rate in Eq.
(
1
.
1
)
.


In Bayesian decision analysis, under
zero
-
one loss function, the posterior probability

is a
discriminate function

for classification. Namely, given
X


X
, if









(
3
.
9
)

can be evaluated for each clas
s
k



Y
, the one with the largest

is the optimal
decision. We shall point out that Eq.
(
3
.
9
)

is
not

the unique discriminate function that
attains Ba
yesian risk
(
3
.
8
)
. It is obvious that for any monotonically increasing
function
g
,

also attains
(
3
.
8
)
.


Now, the only thing left is how to estimate posterior probability

or
some

from the training set
S
. In the Bayesian framework,

can be calculated
by Bayes rule








(
3
.
10
)

where









(
3
.
11
)



is the
prior

or
unconditional probability

of

and


is
known as the
likelihood function

or
class density function
given
. Since

is always the same for all classes, it will not influence the decision. Several
discr
iminate functions equivalent to Eq.
(
3
.
9
)

can be derived straightforwardly from
the above discussion, for example








(
3
.
12
)

and

Copyright
©

by Yu
-
Ch
i Ho

13







(
3
.
13
)


The basic idea of Bayesian decision analysis is to minimize the risk of hypothesis
. Under zero
-
one loss function, this is equi
valent to maximizing the posterior
probability. Therefore, estimating

is the critical step. By Bayes rule, the
problem can be turned into estimating
class density function


and prior
probability
. The later one is relatively simple, but for many real world domains,
since the dimensionality of feature space
X

is high, estimating

can be
difficult. One common approach is to assume some special form
of the
distribution, say multivariate normal distribution, and to estimate the parameters

,
say the mean and variance, using training data. Under such circumstance, probabilistic
method becomes a special form of the model fitting approach.



This note

focuses on the probabilistic model, and we will talk about how to estimate
the
class density function
further when we discuss specific algorithms.


4.

Machine Learning: a Multistage Process

Human knowledge acquisition is a multistage iterative process; the o
rbit
-
fitting
problem in Section
2.

can be seen as a simplified version of this procedure.
Figure
4
-
1

summarizes some of the key stages of learning. The boxes with gray background
represent information

flow while the ones with clear background represent different
activities, among which are data collection, feature selection, model choice, inductive
learning, decision making and interpretation, etc. At each stage, we may correct
previous mistakes or adj
ust choices based on new information. The rectification in the
learning process is represented by the feedback loops in
Figure
4
-
1
.


Copyright
©

by Yu
-
Ch
i Ho

14

Figure
4
-
1

Knowledge acquisition as a multistage

procedure


As mentioned, the fundamental idea of machine learning is to replicate human
cognitive activities by employing some man
-
made computational power other than
human brain. Hence, from our viewpoint, machine learning should also follow the
same cy
cle as that in
Figure
4
-
1
.


Traditional machine learning literature focuses largely on the learning stage; relatively
little is said about data
preprocessing

and model interpretation, sometimes known as
post
-
processing
. The reason

is that learning might be the only stage in the cycle that is
not as much problem
-
dependent as the other stages. However, in real world, the
interface

of an algorithm, for example, how to transform raw data into the format
readable to an algorithm, how to

interpret the outputs to the end
-
user, etc., can be very
crucial to the success of a learning system.


Data collection

is the starting point of a learning system; the quality of raw data
determines the overall achievable performance. If a problem can be
well formulated at
the beginning and adequate information can be collected, we shall have a better
Predictive Model

Generalization Results

Final Decisions

Raw

Data

Training Set

Data preprocessing:

-

Data cleansing

-

Feature extracti
on / selection

Inductive learning

Prediction / generalization

Interpretation

Data collection

Model evaluation

Copyright
©

by Yu
-
Ch
i Ho

15

chance to achieve good results. Collecting data can be very costly and time
-
consuming, sometimes accounts for a large portion of the total cost of a system.
Thanks to the advance in information technology, data storage is now no longer a
serious concern, but on the other side, we may be overwhelmed by too many data.


The goal of
data preprocessing
is to discover important features from raw data.
This
is the f
irst step towards problem solving. Data preprocessing includes data cleaning,
normalization, transformation, feature extraction and selection, etc. Learning can be
straightforward if informative features can be
identified
at this stage. Detailed
procedure
depends highly on the nature of raw data and problem domain. Prior
knowledge can be sometimes extremely valuable. For many systems, this stage is still
primarily conducted by experienced human beings.


The product of data preprocessing is the
training set
.

Given training set, a
learning
model

has to be chosen to learn from it. As seen in Section
2.

and
3.
, people have
invented and studied many different models. However, given a specific domain, the
re
is no clear
-
cut guideline for model choice. In practice, the implementation of a
learning system depends mainly on the knowledge of human experts and available
resources. The basic idea of
inductive learning

is to incorporate domain knowledge
and traini
ng information to produce a predictive model. As mentioned, it can always
be viewed as an optimization problem. The No
-
Free
-
Lunch (NFL) theorem for
optimization (See
[23]

and
[24]
) states that if pr
oblem
-
specific structure is unknown,
on average, all algorithm performance the same as blind search. In the context of
machine learning, NFL simply implies that no learning algorithm is intrinsically
superior to another if all possible problems are conside
red. The only way to make
difference is to utilize domain
-
specific knowledge explicitly in the learning procedure.


After a predictive model is learned, the hypothesis is exposed to new instances of
problems to test whether it captures the domain knowledge

or not. As shown in
Figure
4
-
1
, feedbacks are sometimes necessary to correct earlier mistakes. For certain
domains, an
interpretation

system is required to justify the learned hypothesis before
it can be deployed to make decision
. For example, in medical test, no one would trust
a black box without asking why it works. The justification often depends on inter
-
domain knowledge that is not accessible to the learning model. Hence, at least up to
now it still relies on human experts.

Copyright
©

by Yu
-
Ch
i Ho

16


One of the misleading concepts in machine learning is that a comprehensive learning
algorithm can be applied to any dataset and solve any problem. However, powerful
learning models tend to overfit the training data and get poor predictive performance.
Fu
rthermore, the NFL theorem says that without domain knowledge, no algorithm is
superior to others. Hence, to achieve better performance, it is always important to gain
as much insight of the domain as possible from the very beginning.

Copyright
©

by Yu
-
Ch
i Ho

17

References

[1]

Anderson
, J. (1995),
An Introduction to Neural Networks
, The MIT Press

[2]

Angluin, D. and C. H. Smith (1983), “Inductive Inference: Theory and
Methods,”
Computing Surveys
, Vol 15. No. 3, pp. 327
-
369

[3]

Bishop, C. M. (1995),
Neural Networks for Pattern Recognition
, Oxfor
d
University Press

[4]

Blumer, A., A. Ehrenfeucht, D. Haussler and M. Warmuth (1987) “Occam’s
Razor,”
Information Processing Letters
, Vol. 24, pp. 377
-
380

[5]

Cestnik, B. (1990),
Estimating Probabilities: a Crucial Task in Machine
Learning
,” in
Proceedings of the
Ninth European Conference on Artificial
Intelligence

[6]

Clark, P. and T. Niblett (1989), “
The CN2 Induction Algorithm
,”
Machine
Learning
, Vol. 3

[7]

Duha, R. O. and P. E. Hart (1973),
Pattern Classification and Scene Analysis
,
John Wiley & Sons Inc.

[8]

Duha, R. O.,
P. E. Hart and D. G. Stork (2001),
Pattern Classification
, Second
Edition, John Wiley & Sons Inc.

[9]

Friedman, N., D. Geiger and M. Goldszmidt (1997), “Bayesian Network
Classifiers,”
Machine Learning
, No. 29, pp. 131
-
163

[10]

Hertz, J., A. Krogh and R. G. Palmer (
1991),
Introduction to the Theory of
Neural Computation
, Addison
-
Wesley

[11]

Kearns, M. J. and U. V. Vazirani (1997),
An Introduction to Computational
Learning Theory
, The MIT Press

[12]

Kohavi, R. and D. H. Wolpert (1996), “Bias plus variance decomposition for
zero
-
one loss function,” in
Proceedings of the Thirteenth International
Conference on Machine Learning

[13]

Langley, P., W. Iba and K. Thompson (1992), “An Analysis of Bayesian
Classifiers,” in
Proceedings of the Tenth National Conference on Artificial
Intelligence

[14]

Muggleton, S. and L. D. Raedt (1994), “Inductive Logic Programming:
Theory and Methods,” Journal of Logic Programming, Vol. 19
-
20, pp. 629
-
679

[15]

Quinlan, J. R. (1986), “Induction of Decision Tree,”
Machine Learning
, Vol. 1

[16]

Quinlan, J. R. (1990), “Learning L
ogic Definitions from relations,”
Machine
Learning
, Vol. 5, pp. 239
-
266

Copyright
©

by Yu
-
Ch
i Ho

18

[17]

Quinlan, J. R. (1993),
C4.5: Programs for Machine Learning
, Morgan
Kaufmann

[18]

Ripley, B. D. (1996),
Pattern Recognition and Neural Networks
, Cambridge
University Press

[19]

Sahami, M. (1996),

“Learning Limited Dependence Bayesian Classifiers,” in
Proceedings of the Second International Conference on Knowledge Discovery
and Data Mining
,

pp. 335
-
338, AAAI Press

[20]

Sammut, C. and R. Banerji (1986), “Learning Concept by Asking Questions,”
Machine Lea
rning: An Artificial Intelligence Approach
, Vol. 2, pp. 167
-
192,
Kaufmann

[21]

Srinivasan, A., S. Muggleton, R. King and M. Sternberg (1996), “Theories for
Mutagenicity: a Study of First
-
Order and Feature based Induction,”
Artificial
Intelligence
, Vol. 85, pp.
277
-
299

[22]

Valiant, L. G. (1984), “A Theory of Learnable,”
Communications of the ACM
,
Vol. 27, pp. 1134
-
1142

[23]

Wolpert, D. H. and W. G. Macready (1995), “No free lunch theorems for
search,”
Technical Report SFI
-
TR
-
95
-
02
-
010
, The Santa Fe Institute

[24]

Wolpert, D. H
. and W. G. Macready (1997), “No Free Lunch Theorems for
Optimization,”
IEEE Transactions on Evolutionary Computing
, Vol. 1, No. 1