Copyright
©
by Yu

Ch
i Ho
1
L
ECTURE
N
OTES
12_2
VERSION
:
1.10
DATE
:
2002

5

1
1
FROM
ES
205
LECTURE
#
1
6
I
NDUCTIVE
L
EARNING
Lecture Note 16: Inductive Learning
Yu

Chi Ho, Zhaohui Chen, Jonathan T. Lee
A fundamental step towards intelligent machine is learning, namely to discover
knowled
ge from previous experience. This note gives a brief overview of the methods
and ideas in supervised machine learning. We point out that learning is a multistage
iterative process, and incorporating domain knowledge is very crucial for the success
of a lea
rning system.
1.
Supervised Learning
People have been dreaming of inventing intelligent machines that can replace some of
the human cognitive activities for a long time. From our viewpoint, the very first step
towards this ambitious goal is to automate the p
rocedure of knowledge acquisition,
which is the major topic of this thesis.
The problem considered here is
supervised learning
. An
instance
can be specified by
N
features
X
and a
label
Y
Y
, where
X
R
N
is the value
space of the f
eatures and
Y
R
is the set of all possible labels. In this note, we focus
on classification problems, where
Y
has categorical value and
Y
=
is a finite
discrete set. For a given domain,
can be seen a
s a ran
dom vector with
unknown joint distribution
and the
training set
is a collection of
M
labeled instances randomly sampled from the distribution
D
. The
training set
S
is essentially the instructor in supervised
learning.
The task of a
learning algorithm
is to learn a model
from training set
S
to
predict the label of unseen instances.
, known as a
hypothesis
, is a mapping
from the feature space to all the possible la
bels, namely,
H
:
X
Y
. To simplify the
notation,
S
is usually omitted, and a learned model is denoted as
.
Copyright
©
by Yu

Ch
i Ho
2
There are some fundamental difficulties associated with a learning procedure. First,
the number of features
N
can be large,
and each feature may have many, sometime
infinite possible values, which makes the feature space tremendously gigantic.
Second, for a given dataset, features are often correlated; as a result, information
conveyed in training examples is redundant. In addi
tion, there may be many features
entirely irrelevant to the classification problem. Redundant and irrelevant features can
be confusing for a learning algorithm if not treated properly. Third, observation noise
may present in both features and labels. For m
any domains, some features can be
simply missing. Noisy data and missing features can further confuse the learning
algorithms. Fourth, as we will see later on, for most learning algorithms the tradeoff
between
underfitting
and
overfitting
1
can be very subt
le. We will make effort to
address the above issues in this note.
An obvious criterion of a learning algorithm is to minimize the classification error rate
of the learned hypothesis, which can be defined as
(
1
.
1
)
where
is an indicator function. If an instance (
X
,
Y
) is randomly drawn according to
distribution
D
,
is the probability that
makes error on it.
In order to achieve low err
or rate in Eq.
(
1
.
1
)
, a learned model
must capture
the rules that govern the “labeling procedure” of instances. One of the fundamental
principles of supervised learning is that
should fi
t training data well. The
rationale lies in the belief that one should be able to explain the observations before he
makes predictions. If one truly understands the problem, hopefully, he will predict
well.
1
By “underfitting” we mean a model does not fit the data well; by overfitting we mean a model fits the
data perfectly, but
cannot predict well.
Copyright
©
by Yu

Ch
i Ho
3
In general, learning can be viewed as a non

line
ar fitting problem, and there are two
major approaches to tackle the problem:
1.
Assuming explicitly the form of the hypothesis
, where
are the
parameters in the model; using training set
S
to fit
.
2.
Bayesian method, which estimates t
he posterior probability
from
S
, and
makes decisions based on it.
In the following sections, we will discuss these two methods briefly. A
comprehensive treatment is beyond the scope of this note. Should the readers be
interested i
n this topic, they are referred to references
[3]
,
[7]
,
[8]
,
[18]
,
[11]
.
2.
Model Fitting
Model fitting is a long

standing proble
m. In ancient time, people measured the
position of planets and tried to figure out their orbits. Hypotheses were formulated by
fitting the accumulated observation data and used to predict the position of planets. If
a hypothesis makes correct predictions,
the chance that it matches the true orbit
increases; otherwise, either the hypothesis needs to be revised or the original data may
be too noisy and new observations must be made.
The above cycle covers the major steps of knowledge acquisition by human be
ings.
Machine learning tries to mimic this procedure except that a computer is utilized to
process information. As mentioned, hypothesis
is a mapping
X
Y
. If there
is no restriction on the form of a hypothesis, a classification pr
oblem can be stated as
“finding
among all the possible mappings
H
such that the expected error rate
of Eq.
(
1
.
1
)
is minimized.”
(
2
.
1
)
Th
is is an
optimization
problem in
functional
space. By definition,
is the global
optimum of a classification problem. In general,
(
2
.
1
)
is not solvable if no further
assumption is made.
Copyright
©
by Yu

Ch
i Ho
4
If some structural
form of the model is predefined, a hypothesis can be denoted as
, where
are parameters. In the orbit

fitting problem, for example, when a
circle is used to model the orbit,
include position of the center, radius and orbit
inclina
tion. In such cases, problem
(
2
.
1
)
can be restated as “given the form of the
hypothesis
, finding
among all the possible parameters
獵s栠瑨慴te牲潲
牡瑥t
(
1
.
1
)
is minimized.”
(
2
.
2
)
This is an optimization problem in
parametric
space. Different algorithms have been
developed to deal with it.
In model fitting, the predefined structure is highly crucial for the ex
cellence of a
hypothesis. If the assumption is informative, solving
(
2
.
2
)
gives a good approximation
to
, namely
(
2
.
3
)
and
is often referred as
unbiased
learning model. Otherwise,
can
be far from the global optimum.
People used to model the orbit of a planet as a circle. For some planets, this model is
fairly accurate; however, large er
rors present for other planets until people figured out
that ellipse are a better model. As one can see, the initial model is simply not
sophisticated enough to capture the underlying mechanism that generates the data. We
say that the model
underfits
the o
bserved data; in other words, the domain is not
learnable by the model. On the other hand, if high

degree interpolations say, a 15
th
degree polynomial are used to fit the data, it is easy to see from
Figur
e
2

1
b, we may
end up wit
h a hypothesis that fits all the observed data perfectly, but no one would
expect it to make correct predictions. This phenomenon is known as
overfitting
.
Copyright
©
by Yu

Ch
i Ho
5
(a) Using circle as model
(b) Using high

degree interpolations
(underfitting)
(overfitting)
Figur
e
2

1
Fitting the Orbit of a Planet
Most machine learning algorithms struggle with this dilemma: with too rough a
model, the hypothesis could not even fit the training data; with too complicated a
model, tra
ining set can be overfitted, which leads to poor predictive performance. The
tradeoff between model complexity and the number of training data can be very
subtle. One of the famous principles here is the
Occam’s razor
, which says that given
two fitted mode
l, which can be seen as explanations of the observed data, all other
things being equal, the simpler model is preferable. It can be shown (see
[2]
), that,
under general assumptions, lower model complexity will l
ead to better predictive
performance.
Several famous learning models can be seen as special cases of model fitting
approach, for example,
artificial neural networks
(see
[10]
),
inductive logic program
(see
[14]
) and
decision tree
(see
[15]
), etc.
2.1
Artificial neural networks (ANN)
ANN is a connectivity model, which consists of a large number of processing units
known as
neurons
(see
Figure
2

2
a). Neurons work together to perform a given task.
Figure
2

2
b shows a type of network
topology
, where the nodes are neurons and the
directed edges represents connections among them.
Mathematically, a neuron computes
the weighted sum of its input signals, and output
some activity level according to some given
activation function
f
Copyright
©
by Yu

Ch
i Ho
6
(
2
.
4
)
Neurons interact with each other according to the networ
k topology. Different weights
are assigned to edges, i.e.,
W
IJ
, to represent the strength of connections.
(a) A generic ANN neuron
(b) Network topology
Figure
2

2
The Artificial Neural Network
Learning in t
he context of ANN can be seen as updating the network topology and
connection weights based on available information so that the network can achieve
certain performance criterion. If the topology and weights are taken as parameters
of
the model, this is
exactly the problem stated in Eq.
(
2
.
2
)
. To illustrate the idea, we
here give an example on how to train a perceptron. Suppose we are given a training
set
#
X
1
X
2
X
3
Y
0
1
1
1
1
1
1
1
0
1
2
1
0
1
1
3
1
0
0
1
4
0
1
1
1
5
0
1
0
0
6
0
0
1
0
7
0
0
0
0
Table
2

1
Example training set
The
perceptron
has 3 inputs
X
0
,
X
1
and
X
2
, and one output
Y
and the activation
function is a threshold function. The decision rule
is
Layer
I
Layer
J
Layer
K
W
IJ
W
JK
X
1
…
Y
X
N
Activation
f
Summing
Junction
w
1
w
N
Copyright
©
by Yu

Ch
i Ho
7
where
is the threshold.
According to the
Perceptron Learning Rule
, weights are changed by an amount
proportional to the difference between the desired output and the actual output
(
2
.
5
)
where
is the learning rate,
Y
is the desired output, and
H
(
X
) is the actual output of
the perceptron.
Table
2

2
shows the process of learning. We choose
= 0.4, and the initial weights to
be
w
0
= 0.8,
w
1
= 0
.4,
w
2
= 0.4 and
= 0.5. After training, we find that the weights
w
0
= 1.2,
w
1
= 0.4,
w
2
= 0.4 and
=
0.7 fit all the data points. This is the learned
perceptron model for the given dataset.
X
0
X
1
X
2
Y
w
0
w
1
w
2
sum
H
(
X
)
w
0
w
1
w
2
1
1
1
1
0.80
0.40
0.40
0.50
2.10
1
0
0
0
0
1
1
0
1
0.80
0.40
0.40
0.50
1.70
1
0
0
0
0
1
0
1
1
0.80
0.40
0.40
0.50
1.70
1
0
0
0
0
1
0
0
1
0.80
0.40
0.40
0.50
1.30
1
0
0
0
0
0
1
1
1
0.80
0.40
0.40
0.50
1.30
1
0
0
0
0
0
1
0
0
0.80
0.40
0.40
0.50
0.90
1
0

0.40
0

0.4
0
0
0
1
0
0.80
0.00
0.40
0.10
0.50
1
0
0

0.40

0.40
0
0
0
0
0.80
0.00
0.00

0.30

0.30
0
0
0
0
0
0
1
0
0
0.80
0.00
0.00

0.30

0.30
0
0
0
0
0
0
0
1
0
0.80
0.00
0.00

0.30

0.30
0
0
0
0
0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
1
1
1
1
1.20
0.40
0.40

0.70
1.30
1
0
0
0
0
1
1
0
1
1.20
0.40
0.40

0.70
0.90
1
0
0
0
0
1
0
1
1
1.20
0.40
0.40

0.70
0.90
1
0
0
0
0
1
0
0
1
1.20
0.40
0.40

0.70
0.50
1
0
0
0
0
0
1
1
1
1.20
0.40
0.40

0.70
0.10
1
0
0
0
0
0
1
0
0
1.20
0.40
0.40

0.70

0.30
0
0
0
0
0
0
0
1
0
1.20
0.40
0
.40

0.70

0.30
0
0
0
0
0
0
0
0
0
1.20
0.40
0.40

0.70

0.70
0
0
0
0
0
Table
2

2
Training the perceptron to fit the dataset
One of most important types of topology in ANN is the
feedforward network
where
neurons are arranged in layers, the output of a neuron is an input of another neuron in
the next layer and no loop is allowed in the graph. One of the fundamental results in
ANN is that (see
[10]
) with an arbitrary number of no
n

linear neurons, a 3

layer or
even 2

layer
feedforward
ANN is capable of approximating
any
function with
arbitrary accuracy. That is to say,
can be arbitrarily close to the global
optimum
if there are enough
neurons and the optimal parameters
can be found.
Therefore, a feedforward ANN with large number of neurons is known as an
unbiased
learner. On the other hand, if an ANN has too many neurons, and there is not enough
Copyright
©
by Yu

Ch
i Ho
8
training data to
produce accurate estimation of the parameters, the ANN model can
overfit training data easily, and generalization performance can be poor.
2.2
Inductive logic program (ILP)
Given training examples and prior knowledge, an ILP system uses a set of first

order
logic
2
clauses to predict the label of unseen instances. ILP only deals with binary
values. For some domains, data transformation is necessary before the model can be
applied. If we take the maximal number of literals in a clause, the presence or absence
of each literal and the logic operators as the parameters of an ILP model, the problem
becomes a special case of Eq.
(
2
.
2
)
.
The basic idea of ILP is to find as few short (in terms of the number of literals used)
clauses as possib
le to explain all the positive examples. To search for such clauses,
ILP starts with the shorts ones. As for the training set in
Table
2

1
, we can see that
explains instance #0, #1, #2, #3 i
n the training data, and
covers instance #4. The two clauses fit all the positive instances in the training set;
otherwise, an instance is negative.
Several issues arise when an algorithm is devised to solve the problem. First, th
e size
of parameter space
increases exponentially fast as the maximal number of
literals in a clause goes up. Hence, efficient search scheme becomes very important.
Second, it is not easy to work directly with the error rate of Eq.
(
1
.
1
)
; therefore, some
metric score must be defined to evaluate the goodness of a candidate clause. Mutual
information and other entropy

based scores are among the popular choices. Another
important issue is the learnability of an ILP
system.
[14]
presents a brief discussion on
this problem. Generally, it depends on the underlying domain and the maximal
number of literals. As one can see, the more literals allowed in a clause, the better the
training exampl
e can be explained. However, this violates the Occam’s razor and may
lead to overfitting.
2
In symbolic logic, a variable is called
literal
. First

order logic takes only individual literals as
arguments and quantifiers, for example,
A
B
C
D
, reads
A
and
B
or
C
and not
D
.
Copyright
©
by Yu

Ch
i Ho
9
2.3
Decision tree
The decision trees can be seen as an extension to ILP in the sense that real number
inputs and multi

class output are allowed. For the fitting part, a t
ree

growing process
tries to partition training set into single

class subsets according to the value of input
features. If one feature cannot achieve single

class partition, more features are used to
grow the decision tree. Again, if we take all the splitt
ing points as parameters of the
model, the problem becomes Eq.
(
2
.
2
)
once again.
For example, given the training set in
Table
2

1
, we have the following partition:
In each box, the fi
rst number is the number of positive examples in the subset,
and the second is the number of negative examples.
Figure
2

3
Decision tree: split the training set according to the value of a feature
Different
criteria are used to decide which feature to be selected at each decision
point. As in ILP, entropy

based scores are among the popular choices. For example,
the mutual information between two random variables
X
and
Y
is defined as
(
2
.
6
)
In our example, we calculate the mutual information between each feature
X
i
and class
label
Y
, they are 0.59 bit, 0.09 bit and 0.09 bit respectively, that is why we first choose
X
1
to grow the decision
tree.
As one can see, with enough number of features, single

class partition is always
achievable. However, the goal is not merely to find such a partition, but to build a tree
that reveals the structure of the domain and to make
prediction
. To avoid ove
rfitting,
X
1
= 1
X
1
= 0
S
(5:3)
(4:0)
(1:3)
(1:1)
X
2
= 1
X
2
= 0
(0:2)
(1:0)
X
3
= 1
X
3
= 0
(0:1)
Copyright
©
by Yu

Ch
i Ho
10
we need significant number of instances at each leaf or, to put it into another way,
there should be as few partitions on the decision tree as possible.
3.
Bayesian Decision Analysis
Another approach to the classification problem is the probabilist
ic model. As
mentioned above, training set
S
can be viewed as a set of random samples from the
unknown distribution
.
The basic idea of probabilistic methods is to estimate
this distribution based on available training data.
The cr
iterion of minimizing error rate in Eq.
(
1
.
1
)
implicitly assumes that all the
classes have equal priority. Let us consider a more general case where the
loss
function
can be defined by
Table
3

1
. Namely, if t
he true label of an instance
X
is
Y
=
k
, and the hypothesis predicts it as
, it encounters a loss of
L
kl
.
H
(
X
) = 1
H
(
X
) = 2
…
H
(
X
) =
K
Y =
1
L
11
L
12
…
L
1
K
Y =
2
L
21
L
22
…
L
2
K
…
…
…
…
…
Y = K
L
K
1
L
K
2
…
L
KK
Table
3

1
The loss function of classification problem
For a given instance
X
, the posterior probability
is the probability that
k
is the correct label of
X
. If a hypothesis predicts
, the expected loss
associated with this decision is known as the
risk
of the decision
(
3
.
1
)
which can be minimized by choosing
(
3
.
2
)
It is easy to see that, if a hypothesis minimizes the risk in Eq.
(
3
.
1
)
at every instance
X
X
, i.e.,
Copyright
©
by Yu

Ch
i Ho
11
(
3
.
3
)
the overall
risk
(
3
.
4
)
is also minimized.
According to Eq.
(
3
.
4
)
,
R
can be seen as the risk associated with decision rul
e
.
Eq.
(
3
.
3
)
is known as the
Bayesian decision rule
and the risk of Bayesian decision rule
is called the
Bayesian risk
(
3
.
5
)
which is the theoretical lower bound of achievable risk.
Many machine learning algorithms assume
zero

one
loss function, where all the
diagonal elements in
Table
3

1
are 0 and all the off

diagonal elements are 1
. Namely,
a hypothesis loses nothing when making correct prediction and all errors are equally
costly. In such case, risk of Eq.
(
3
.
1
)
becomes
(
3
.
6
)
To minimize
(
3
.
6
)
, a hypothesis has to maximize the posterior probability, i.e.,
(
3
.
7
)
Since
is the probability that
l
is the correct label,
is the minimal probability of making error on instance
X
. Consequently, the
(Bayesian) risk associated with decision rule
(
3
.
7
)
becomes
Copyright
©
by Yu

Ch
i Ho
12
(
3
.
8
)
which is the theoretical lower bound of prediction error rate in Eq.
(
1
.
1
)
.
In Bayesian decision analysis, under
zero

one loss function, the posterior probability
is a
discriminate function
for classification. Namely, given
X
X
, if
(
3
.
9
)
can be evaluated for each clas
s
k
Y
, the one with the largest
is the optimal
decision. We shall point out that Eq.
(
3
.
9
)
is
not
the unique discriminate function that
attains Ba
yesian risk
(
3
.
8
)
. It is obvious that for any monotonically increasing
function
g
,
also attains
(
3
.
8
)
.
Now, the only thing left is how to estimate posterior probability
or
some
from the training set
S
. In the Bayesian framework,
can be calculated
by Bayes rule
(
3
.
10
)
where
(
3
.
11
)
is the
prior
or
unconditional probability
of
and
is
known as the
likelihood function
or
class density function
given
. Since
is always the same for all classes, it will not influence the decision. Several
discr
iminate functions equivalent to Eq.
(
3
.
9
)
can be derived straightforwardly from
the above discussion, for example
(
3
.
12
)
and
Copyright
©
by Yu

Ch
i Ho
13
(
3
.
13
)
The basic idea of Bayesian decision analysis is to minimize the risk of hypothesis
. Under zero

one loss function, this is equi
valent to maximizing the posterior
probability. Therefore, estimating
is the critical step. By Bayes rule, the
problem can be turned into estimating
class density function
and prior
probability
. The later one is relatively simple, but for many real world domains,
since the dimensionality of feature space
X
is high, estimating
can be
difficult. One common approach is to assume some special form
of the
distribution, say multivariate normal distribution, and to estimate the parameters
,
say the mean and variance, using training data. Under such circumstance, probabilistic
method becomes a special form of the model fitting approach.
This note
focuses on the probabilistic model, and we will talk about how to estimate
the
class density function
further when we discuss specific algorithms.
4.
Machine Learning: a Multistage Process
Human knowledge acquisition is a multistage iterative process; the o
rbit

fitting
problem in Section
2.
can be seen as a simplified version of this procedure.
Figure
4

1
summarizes some of the key stages of learning. The boxes with gray background
represent information
flow while the ones with clear background represent different
activities, among which are data collection, feature selection, model choice, inductive
learning, decision making and interpretation, etc. At each stage, we may correct
previous mistakes or adj
ust choices based on new information. The rectification in the
learning process is represented by the feedback loops in
Figure
4

1
.
Copyright
©
by Yu

Ch
i Ho
14
Figure
4

1
Knowledge acquisition as a multistage
procedure
As mentioned, the fundamental idea of machine learning is to replicate human
cognitive activities by employing some man

made computational power other than
human brain. Hence, from our viewpoint, machine learning should also follow the
same cy
cle as that in
Figure
4

1
.
Traditional machine learning literature focuses largely on the learning stage; relatively
little is said about data
preprocessing
and model interpretation, sometimes known as
post

processing
. The reason
is that learning might be the only stage in the cycle that is
not as much problem

dependent as the other stages. However, in real world, the
interface
of an algorithm, for example, how to transform raw data into the format
readable to an algorithm, how to
interpret the outputs to the end

user, etc., can be very
crucial to the success of a learning system.
Data collection
is the starting point of a learning system; the quality of raw data
determines the overall achievable performance. If a problem can be
well formulated at
the beginning and adequate information can be collected, we shall have a better
Predictive Model
Generalization Results
Final Decisions
Raw
Data
Training Set
Data preprocessing:

Data cleansing

Feature extracti
on / selection
Inductive learning
Prediction / generalization
Interpretation
Data collection
Model evaluation
Copyright
©
by Yu

Ch
i Ho
15
chance to achieve good results. Collecting data can be very costly and time

consuming, sometimes accounts for a large portion of the total cost of a system.
Thanks to the advance in information technology, data storage is now no longer a
serious concern, but on the other side, we may be overwhelmed by too many data.
The goal of
data preprocessing
is to discover important features from raw data.
This
is the f
irst step towards problem solving. Data preprocessing includes data cleaning,
normalization, transformation, feature extraction and selection, etc. Learning can be
straightforward if informative features can be
identified
at this stage. Detailed
procedure
depends highly on the nature of raw data and problem domain. Prior
knowledge can be sometimes extremely valuable. For many systems, this stage is still
primarily conducted by experienced human beings.
The product of data preprocessing is the
training set
.
Given training set, a
learning
model
has to be chosen to learn from it. As seen in Section
2.
and
3.
, people have
invented and studied many different models. However, given a specific domain, the
re
is no clear

cut guideline for model choice. In practice, the implementation of a
learning system depends mainly on the knowledge of human experts and available
resources. The basic idea of
inductive learning
is to incorporate domain knowledge
and traini
ng information to produce a predictive model. As mentioned, it can always
be viewed as an optimization problem. The No

Free

Lunch (NFL) theorem for
optimization (See
[23]
and
[24]
) states that if pr
oblem

specific structure is unknown,
on average, all algorithm performance the same as blind search. In the context of
machine learning, NFL simply implies that no learning algorithm is intrinsically
superior to another if all possible problems are conside
red. The only way to make
difference is to utilize domain

specific knowledge explicitly in the learning procedure.
After a predictive model is learned, the hypothesis is exposed to new instances of
problems to test whether it captures the domain knowledge
or not. As shown in
Figure
4

1
, feedbacks are sometimes necessary to correct earlier mistakes. For certain
domains, an
interpretation
system is required to justify the learned hypothesis before
it can be deployed to make decision
. For example, in medical test, no one would trust
a black box without asking why it works. The justification often depends on inter

domain knowledge that is not accessible to the learning model. Hence, at least up to
now it still relies on human experts.
Copyright
©
by Yu

Ch
i Ho
16
One of the misleading concepts in machine learning is that a comprehensive learning
algorithm can be applied to any dataset and solve any problem. However, powerful
learning models tend to overfit the training data and get poor predictive performance.
Fu
rthermore, the NFL theorem says that without domain knowledge, no algorithm is
superior to others. Hence, to achieve better performance, it is always important to gain
as much insight of the domain as possible from the very beginning.
Copyright
©
by Yu

Ch
i Ho
17
References
[1]
Anderson
, J. (1995),
An Introduction to Neural Networks
, The MIT Press
[2]
Angluin, D. and C. H. Smith (1983), “Inductive Inference: Theory and
Methods,”
Computing Surveys
, Vol 15. No. 3, pp. 327

369
[3]
Bishop, C. M. (1995),
Neural Networks for Pattern Recognition
, Oxfor
d
University Press
[4]
Blumer, A., A. Ehrenfeucht, D. Haussler and M. Warmuth (1987) “Occam’s
Razor,”
Information Processing Letters
, Vol. 24, pp. 377

380
[5]
Cestnik, B. (1990),
Estimating Probabilities: a Crucial Task in Machine
Learning
,” in
Proceedings of the
Ninth European Conference on Artificial
Intelligence
[6]
Clark, P. and T. Niblett (1989), “
The CN2 Induction Algorithm
,”
Machine
Learning
, Vol. 3
[7]
Duha, R. O. and P. E. Hart (1973),
Pattern Classification and Scene Analysis
,
John Wiley & Sons Inc.
[8]
Duha, R. O.,
P. E. Hart and D. G. Stork (2001),
Pattern Classification
, Second
Edition, John Wiley & Sons Inc.
[9]
Friedman, N., D. Geiger and M. Goldszmidt (1997), “Bayesian Network
Classifiers,”
Machine Learning
, No. 29, pp. 131

163
[10]
Hertz, J., A. Krogh and R. G. Palmer (
1991),
Introduction to the Theory of
Neural Computation
, Addison

Wesley
[11]
Kearns, M. J. and U. V. Vazirani (1997),
An Introduction to Computational
Learning Theory
, The MIT Press
[12]
Kohavi, R. and D. H. Wolpert (1996), “Bias plus variance decomposition for
zero

one loss function,” in
Proceedings of the Thirteenth International
Conference on Machine Learning
[13]
Langley, P., W. Iba and K. Thompson (1992), “An Analysis of Bayesian
Classifiers,” in
Proceedings of the Tenth National Conference on Artificial
Intelligence
[14]
Muggleton, S. and L. D. Raedt (1994), “Inductive Logic Programming:
Theory and Methods,” Journal of Logic Programming, Vol. 19

20, pp. 629

679
[15]
Quinlan, J. R. (1986), “Induction of Decision Tree,”
Machine Learning
, Vol. 1
[16]
Quinlan, J. R. (1990), “Learning L
ogic Definitions from relations,”
Machine
Learning
, Vol. 5, pp. 239

266
Copyright
©
by Yu

Ch
i Ho
18
[17]
Quinlan, J. R. (1993),
C4.5: Programs for Machine Learning
, Morgan
Kaufmann
[18]
Ripley, B. D. (1996),
Pattern Recognition and Neural Networks
, Cambridge
University Press
[19]
Sahami, M. (1996),
“Learning Limited Dependence Bayesian Classifiers,” in
Proceedings of the Second International Conference on Knowledge Discovery
and Data Mining
,
pp. 335

338, AAAI Press
[20]
Sammut, C. and R. Banerji (1986), “Learning Concept by Asking Questions,”
Machine Lea
rning: An Artificial Intelligence Approach
, Vol. 2, pp. 167

192,
Kaufmann
[21]
Srinivasan, A., S. Muggleton, R. King and M. Sternberg (1996), “Theories for
Mutagenicity: a Study of First

Order and Feature based Induction,”
Artificial
Intelligence
, Vol. 85, pp.
277

299
[22]
Valiant, L. G. (1984), “A Theory of Learnable,”
Communications of the ACM
,
Vol. 27, pp. 1134

1142
[23]
Wolpert, D. H. and W. G. Macready (1995), “No free lunch theorems for
search,”
Technical Report SFI

TR

95

02

010
, The Santa Fe Institute
[24]
Wolpert, D. H
. and W. G. Macready (1997), “No Free Lunch Theorems for
Optimization,”
IEEE Transactions on Evolutionary Computing
, Vol. 1, No. 1
Comments 0
Log in to post a comment