Learning in Document Analysis and
Understanding
Tutorial  ICDAR’99
Prof. Floriana Esposito
Prof. Donato Malerba
Dipartimento di Informatica
University of Bari,Italy
http://lacam.di.uniba.it:8000/
ICDAR’99  Tutorial
Overview
Motivations
What’s learning
Attributebased learning
Statistical learning
Decisiontree learning
Learning sets of rules
Firstorder learning
Applications to intelligent document
processing
WISDOM++
Conclusions
ICDAR’99  Tutorial
Objectives
Heighten
awareness
many machine learning tools can
be used for document processing
applications
Connect
: inspire/encourage
collaborations
ICDAR’99  Tutorial
Learning in Document Processing:
some data
Updated till 1997, the bibliography is
maintained by the Laboratory for
Language and Media Processing
(LAMP) at the University of Maryland at
College Park.
DOCBIB (
http://documents.
cfar
.
umd
.
edu
/
biblio
/
)
is a collection of over 2750 references on topics
such as preprocessing, representation, feature
extraction, OCR, online recognition,
text/graphics discrimination, signature
verification and layout analysis.
ICDAR’99  Tutorial
Learning in document processing:
Some data
By querying DOCBIB on the word
LEARNING
thirtytwo
references are
found,
twentysix
are papers published
in the 1990’s.
CLUSTERING
six
new references
NEURAL NET
ninetythree
independent references
The bibliography is certainly incomplete!
ICDAR’99  Tutorial
Document processing requires a
large amount of knowledge
Flexible
document processing systems
require a large amount of knowledge.
The
segmentation
of the document image
can be based on layout structure of specific
classes;
The
separation of text from graphics
requires knowledge on how text blocks can
be discriminated from nontext blocks.
Document analysis and understanding as a
branch of artificial intelligence
(Tang
et al.
,
1994).
ICDAR’99  Tutorial
Handcoding knowledge?
Problems related to knowledge
representation and acquisition are relevant
for the development of “intelligent”
document processing systems.
A great effort is made to handcode the
necessary knowledge according to some
formalism
block grammars (Nagy
et al
., 1992)
geometric trees (Dengel & Barth, 1989).
Frames (Bayer
et al
., 1994).
Application of different machine learning
algorithms in order to solve the knowledge
acquisition problem.
ICDAR’99  Tutorial
Typical machine learning applications
Inputs
are examples
Reduce problem to
classification
Generate
rules
from examples
An alternative to eliciting rules from experts
Experts prefer (and excel in) describing
expertise with examples, not rules
Advice of experts is costly
Expertise unavailable.
Greater development speed.
ICDAR’99  Tutorial
Comparing development times
(Michie, 1989)
Name
ML tool
#
Rules
Develop
(
yrs)
Maintain
(
yrs)
MYCIN
m
edical
diagnosis
none
100
100
N/A
XCON
VAX computer
configutation
none
8000
180
30
GASOIL
g
asoil
separator
system
configuration
ExpertEase,
Extran7
2800
1
0.1
BMT
configuration
of
fireprotection
equipment in
buildings
1
st
Class,
RuleMaster
>30000
9
2
ICDAR’99  Tutorial
Overview
Motivations
What’s learning
Attributebased learning
Statistical learning
Decisiontree learning
Learning sets of rules
Firstorder learning
Applications to intelligent document
processing
WISDOM++
Conclusions
ICDAR’99  Tutorial
What is learning ?
A common view:
A system learns
if it makes changes in itself that
enable it to better perform a given task
Learning is a multifaced phenomenon comprising:
acquisition of declarative knowledge
development of motor and cognitive skills
through instruction and practice
organization of knowledge into new more
effective representations
discovery of new facts & theories through
observation and experimentation
ICDAR’99  Tutorial
History
Three paradigms:
Neural modeling and Decision
Theoretic techniques
Symbolic Concept oriented
techniques
Knowledge intensive Learning
ICDAR’99  Tutorial
Neural modeling (19551965) (1986…)
Building general purpose learning systems
with little or no initial structure and/or a
priori knowledge
Tabula Rasa = learning without knowledge
Neural modeling = selforganizing systems
Neural Nework
[McCulloch & Pitts, 1943], [Rashevski, 1948],
[Rosenblatt, 1958], [Selfridge, 1959], [Widrow, 1962],
[Minsky& Papert, 1969], [Rumelhart & McClelland,
1986],
[Hinton, 1989]
Simulation of evolutionary processes
[Holland, 1980], [De Jong, 1989]
ICDAR’99  Tutorial
DecisionTheoretic Techniques
(19551965)
Learning is estimating parameters from a
set of training examples
Discriminant functions
[Nilsson, 1965], [Uhr, 1966], [Samuel, 1959, 1963]
Statistical Decision Theory
[Fu, 1968], [Watanabe, 1960], [Duda & Hart, 1973],
[Kanal, 1974]
Adaptive Control Systems
[Truxal, 1955], [Tsypkin, 1968, 1971, 1973] [Fu,
1971, 1974]
ICDAR’99  Tutorial
Symbolic Conceptoriented
Techniques (19621980)
Learning is obtaining high level concept
descriptions from a set of observations,
using logic or graph structures
Psychological studies
[Hunt & Hovland, 1963], [Feigenbaum, 1963], [Simon
& Lea, 1974]
Task oriented systems
[Buchanan, 1978]
General Purpose Inductive Systems
[Winston, 1975], [Michalski, 19721978] [Hayes Roth
& McDermott, 1978] [Vere, 1975] [Mitchell, 1978]
ICDAR’99  Tutorial
Knowledge Intensive Learning
Systems and Multistrategy Learning
(1980  today)
Learning uses models of deep knowledge
and integrates different method
Exploration and integration of different
strategies
Knowledge intensive techniques
Inductive Logic Programming
Successful applications
Machine Learning Workshops/Conferences ICML
( CMU 1980,… Bled 1999)
ICML Proceedings 19871999 by Morgan Kaufmann
Machine Learning Journal (1986,..) Kluwer Academic
Publishers
ICDAR’99  Tutorial
Learning is the improvement of performance in some
environment through the acquisition of knowledge
resulting from experience in that environment
Pat Langley “Elements of Machine Learning”
Morgan Kaufmann 1996
Environment
Knowledge
Performance
Learning
The general model of
a Learning System
ICDAR’99  Tutorial
A
computer program
is said to learn from
experience E with respect to some class
of tasks T and performance measure P, if
its performance at tasks in T, as
measured by P, improves with experience
E.
a well defined learning program must
identify the class of tasks, the measure of
performance to be improved and the
source of experience.
Learning Systems
ICDAR’99  Tutorial
Example: A handwriting
recognition learning problem
Task T: recognizing and classifying
handwritten words within images
Performance measure P: percent of
words correctly classified
training experience E: a database of
handwritten words with given
classification
Tom M. Mitchell “Machine Learning”
McGraw Hill, 1997
ICDAR’99  Tutorial
Basic Questions
What do Machines learn?
(knowledge Representation)
What kind of problems can be
addressed?
(Tasks)
How do machines learn?
(Reasoning Mechanisms)
ICDAR’99  Tutorial
What do Machines learn?
Numerical Parameters / Probability
Distributions
Grammars / Automata
Decision Trees
Production Rules / Logical Formulas
Taxonomies
Schemes / Frames / Graphs
Neural nets
ICDAR’99  Tutorial
Subsymbolic and Symbolic Learning
A first criterion to distinguish among
the methods is in the cognitive
aspects of the employed knowledge
representation
The
subsymbolic
methods
(neural networks,
genetic algorithms, etc.) employ a general
and uniform knowledge representation, use
little background knowledge, are easily
experimentable.
The
symbolic methods
have the goal of
acquiring humantype concepts, analyzing,
and transforming the knowledge already
possessed into a more “operational” form
ICDAR’99  Tutorial
Both learning and performance
rely on the ability to represent
knowledge
Representing experience
(the INPUT to learning)
representing and organizing the
acquired knowledge
(the OUTPUT of learning)
ICDAR’99  Tutorial
Representing experience
The simplest approach:
boolean
or
binary features
A slightly more sophisticated formalism:
set of nominal attributes
Sometimes it is convenient to use also
numeric attributes
This allows to define the
instance space
:
given K numeric attributes, one can
represent any given instance as a point in a
kdimensional space, with the attributes
defining the axes
(the feature vectors can be viewed as a special
case of a numeric encoding with values
limited to 0 and 1)
ICDAR’99  Tutorial
Representing experience
Some problems require more
sophisticated formalisms (not only
attributes but also relations among
objects)
In this case it is convenient to use more
powerful notations allowing to describe
objects and to express relationships
among objects
sets of relational literals
ICDAR’99  Tutorial
Representing experience
LOGICAL ZEROORDER LANGUAGES
Feature vectors
s = (v
1
, v
2
, …, v
n
)
Pairs (attribute = value)
(shape=triangle)
(size=medium)
(color=red)
(area=0.75)
ICDAR’99  Tutorial
Representing experience
LOGICAL FIRSTORDER LANGUAGES
PREDICATE CALCULUS
IMAGE WORLD
A
B
Entities
Relations
Onrelation
Predicates
ON (B,A)
Functions
Universal function ON
X=ON (B)
X=A
A=ON (B)
Symbols
ICDAR’99  Tutorial
Representing the knowledge
In order to represent the acquired
knowledge many symbolic learning
systems, sometimes defined as
“
conceptual
”, express the output as
conjunctive concepts
These systems generally perform
categorizations and make predictions.
A “
concept
” is the intensional
definition of a class (category),
extensionally represented by all the
instances of that class.
ICDAR’99  Tutorial
Representing the knowledge
Unstructured Domains
Instance
Vector of <Attribute, Value> pairs
e
1
e
1
(<color, red>, <shape, square>, <edgelength, 2>)
A propositional calculus is sufficient to
express concepts, i.e. descriptions that are
combinations of attribute values:
= [color = red]
[shape = square
triangle ]
ICDAR’99  Tutorial
Representing the knowledge
Structured Domains
Instance
Parts, Attributes, Relations
a
b
e = square(a)
triangle(b)
large(a)
small(b)
ontable(a)
on(b,a)
red(b)
green(a)
A firstorder predicate calculus is necessary
to express concepts, i.e. descriptions that
are combinations of attribute values and
relations among parts:
= red(y)
triangle(y)
square(x)
on(y,x
)
ICDAR’99  Tutorial
CONCEPTS
Levels of Concept Descriptions
LOGICAL DESCRIPTIONS
STRUCTURAL CONJUNCTIVE DESCRIPTIONS
ATTRIBUTIONAL CONJUNCTIVE DESCRIPTIONS
p [f(p) = a..b] [ ] …
p [g(p)
2] [ ] ...
nominal
linear
structured
[x
i
] = a
v
b
[x
j
= 2..5]
[x
k
= C]
&
&
a
b
123456..
{
{
C
ICDAR’99  Tutorial
The task
A WAY TO CLASSIFY THE SYSTEMS
One step classification and prediction
Goal
: to increase the accuracy of the
performance system
Multistep inference or problem solving
Goal
: to increase the efficiency of the
provlemsolving learner
Discovery (acquisition of objective
knowledge & theory formation)
Goal
: to increase the performance
when controlling the environment
ICDAR’99  Tutorial
The degree of supervision
SUPERVISED LEARNING
The tutor of the learner gives direct feedback
about the goodness of its performance
for
classification
problems
each
training instance includes an attribute
specifying the class of that instance
(learning from examples). The
goal
is to
induce a concept description that predicts
this attribute
for
problem solving
tasks
the tutor
suggests the correct step at each point in
the reasoning process (learning
apprentices)
ICDAR’99  Tutorial
UNSUPERVISED LEARNING
The feedback by the tutor is absent
for
classification
problems
the
entire set of observations is supplied
(learning from observations): the
goal
is to induce a categorization, i.e.
clusters of observations
for
problem solving
tasks
credit
assignment or determining the degree
to which each move in a sequence
deserves credit or blame for the final
outcome
The degree of supervision
ICDAR’99  Tutorial
How do Machine Learn?
Induction
Deduction
Abduction
Analogy
ICDAR’99  Tutorial
facts
events
observations
theories
rules
models
DEDUCTION
INDUCTION
How do Machine Learn?
ICDAR’99  Tutorial
Inferences
Deductive
inference
all men are mortal
Socrates is a man
Socrates is mortal
Approximate deductive inference
Smoking people are liable to cancer
John is a smoker
it is possible that John will have cancer
ICDAR’99  Tutorial
Inferences
Inductive inference
x
1
has property A
x
2
has property A
…
x
n
has property A
x
X “x has property A”
Abductive inference
A man who drunk a lot has no equilibrium
I see Peter to loose the equilibrium and to
fall down
perhaps Peter drunk a lot
ICDAR’99  Tutorial
Inferences
Analogical inference
b
1
c
a
2
3
4
A:C
1:4
ICDAR’99  Tutorial
Abductiondeductioninduction cycle
Ranking
Hypotheses
Induction
Deduction
Abduction
Request New
Information
Expected
consequences
Initial
information
ICDAR’99  Tutorial
GIVEN
A set of observed facts F
A background knowledge B
A set of hypotheses H
FIND
that hypotheses that, together with B,
explain the observed facts F
H
B = F
The Inductive Paradigm
ICDAR’99  Tutorial
Empirical Learning
(inductively learning from many data)
SUPERVISED (heuristic induction)
GIVEN
A set of classes (concepts) expressed in the
language
Lc
,
{C
i
}
1=1..k
A set of examples of each class, expressed in
the language
Lo
,
E = {E
ci
}, E
ci
E
cj
=
i
j
A background knowledge
B
FIND
for each class C
i
the inductive hypothesis H
i
’
H
i
U B
e where e
E
ci
H
i
U B
e where e
E
cj
with j
i
ICDAR’99  Tutorial
UNSUPERVISED (learning from
observation)
GIVEN
A set of observations, E
A number of concepts we want to discover, N
A background knowledge
FIND
the best partitioning of the set of observations into
N concepts such that, for each class, the
inductive hypothesis H
i
is formalized so that
H
i
U B
e where e
C
i
i = 1..n
H
i
U B
e where e
C
j
j
i
Empirical Learning
(inductively learning from many data)
ICDAR’99  Tutorial
Example
The objects are, for example, saturday
mornings and the classification task
involves the weather
ATTRIBUTES:
outlook, with values {sunny, overcast, rain}
temperature, with values {cool, mild, hot}
humidity, with values {high, normal}
windy, with values {true, false}
ICDAR’99  Tutorial
A small
training
set
No
Outlook
Temp.
Humidity
Windy
Class
D1
sunny
hot
high
F
N
D2
sunny
hot
high
T
N
D3
overcast
hot
high
F
P
D4
rain
mild
high
F
P
D5
rain
cool
normal
F
P
D6
rain
cool
normal
T
N
D7
overcast
cool
normal
T
P
D8
sunny
mild
high
F
N
D9
sunny
cool
normal
F
P
D10
rain
mild
normal
F
P
D11
sunny
mild
normal
T
P
D12
overcast
mild
high
T
P
D13
overcast
hot
normal
F
P
D14
rain
mild
high
T
N
ICDAR’99  Tutorial
How many hypotheses?
Universe containing
u
objects
Training set containing
m
of them
k
classes
Concept labels each element of universe
with a class
Number of concepts consistent with the
training set is
k
um
As to the previous example
14 of 36 possible objects
2
22
possible completions (
4.2 Million!)
ICDAR’99  Tutorial
BIAS
Any reason for preferring one of several
hypotheses consistent with the
training set
Example of biases
Consider only hypotheses expressible in some
language
prefer hypothesis with most concise expression
prefer first discovered consistent hypothesis
assume that relative frequency of class C
objects in training set is the same as class C
objects in the universe
…
ICDAR’99  Tutorial
How many examples do we need?
Theorem: (Blumer, Ehrenfeucht, Haussler &
Warmuth)
Consider
a training set of m examples classified in
accordance with some theory
L
suppose
n
hypotheses {
H
i
, i=1..N
} which include
L
let
H
k
be any hypothesis that agrees with
L
on all
m
examples
Provided
m
1/
ln(N/
)
The probability that
H
k
and
L
differ by more
than
is less than or equal to
ICDAR’99  Tutorial
The deductive paradigm
(explanation based learning
)
GIVEN
a domain theory
: a set of rules and facts
representing the knowledge in the domain
a single example
or
a set of examples
of a
concept
a goal concept
: an approximate definition
describing the concept to be learned
an operationality criterion
: a predicate over
concept definition, specifying the form in with the
learned concept must be expressed
FIND
A new
concept definition
satisfying the
operationality criterion
ICDAR’99  Tutorial
A multicriteria classification of
machine learning methods
CLASSIFICATION
CRITERIA
Primary purpose
Synthetic
Analitic
Type of Input
L. from
L. from
Example
specification
examples
Observation
guided
guided
Type of Primary
Inductive
Deductive
Inference
Analogy
Role of Prior
Empirical
Constructive
multistrategy
constructive
axiomatic
Knowledge
induction
deduction
Empirical
Abduction
integrated
Abstraction
explanation
generalization
empirical &
based Learning
qualitative
constructive
explanation
deductive
(pure)
discovery
generalization
based Learning
generalization
conceptual
Automatic
clustering
Multistrategy
Program
Neural nets
constructive
Synthesis
genetic algorithms
learning
LEARNING PROCESSES
ICDAR’99  Tutorial
Overview
Motivations
What’s learning
Attributebased learning
Statistical learning
Decisiontree learning
Learning sets of rules
Firstorder learning
Applications to intelligent document
processing
WISDOM++
Conclusions
ICDAR’99  Tutorial
Statistical learning methods
Learning in pattern recognition devoted
Learning in pattern recognition devoted
to pattern automatic classification
to pattern automatic classification
is
traditionally solved by trainable classifiers
definable as “devices that sort data into
classes, capable of improving their
performances in response to information
they receive as function of time”
Learning is the measure of the classifier
Learning is the measure of the classifier
effectiveness or performance
effectiveness or performance
often
associated with the feedback and the
capability of adapting to changing
environment
ICDAR’99  Tutorial
Statistical learning methods
Trainable classifiers
Trainable classifiers
are inductive
methods devoted to classification and
can be considered as a special case
of empirical learning.
Using methods of statistical decision
theory, they improve their
performance by adjusting internal
parameters rather than by structural
changes.
They do not use symbolic
descriptions to represent higher level
knowledge
.
ICDAR’99  Tutorial
Any trainable pattern classifier can then be
seen as a machine with “adjustable”
discriminant functions. These define the
behavior of the classifier.
Note: Within this common framework, the
various classification methods differ in
the following main topics:
the
family of discriminant function
used
and their properties;
the
training method
used
Trainable classifiers
ICDAR’99  Tutorial
The basic model for a trainable pattern
classifier
Key assumption
: data to be classified
must be transformed into
n
dimensional vectors of realnumber
features, with finite n;
X = (x
1
, x
2
, …, x
n
)
In geometric terms…
Any pattern can be represented by a
point in a
n
dimensional Euclidean
space E
n
, called the “
pattern space
”
ICDAR’99  Tutorial
Aim: define regions
i
of the space, so
that
X
i
X
i
i=1, .., N
where
N is the number of classes;
i
is the ith class
The basic model for a trainable pattern
classifier
1
1
2
3
P
ICDAR’99  Tutorial
In mathematical terms…
The surfaces separating regions
i
can
be implicitly defined by N scalar and
singlevalued functions:
g
1
(X), g
2
(X), …, g
n
(X)
’
X
i
:g
i
(x)>g
j
(x)
i,j=1, …, N i
j
We call g
i
(x) “
discriminant functions
”
The basic model for a trainable pattern
classifier
ICDAR’99  Tutorial
The basic model of a classifier
The basic model for a trainable pattern
classifier
X
x
j
x
i
x
d
g
k
(x)
g
j
(x)
g
i
(x)
g
j
g
j
g
k
l
0
Response
Maximum selector
Discriminants
Discriminantors
Input pattern
ICDAR’99  Tutorial
Discriminant analysis (Fisher 1936)
Discriminant functions
used
linear
functions established by Statistical
Decision Theory. They have the following
general formulation:
g
i
(x)=p(X
i
)P(
i
)
i=1,.., N
where

p(X
i
)
is the probability density function of
pattern X given
i

P(
i
)
is the a priory probability of
occurrence of category
ICDAR’99  Tutorial
Discriminant analysis (Fisher 1936)
The
Training method
is parametric. The
following assumptions are made about
input population:
a. the
p(X
i
)
are multivariate normal
probability density functions with
unknown mean vectors
M
i
and unknown
covariance matrices
i
p(X
i
)=((2
)
n/2

i

1/2
)
1
e
b. all the covariance matrices
i
are
identical
½(XM
i
)
T
i
1
(XM
i
)
ICDAR’99  Tutorial
Fisher classification functions
g
i
(x) = X
T
1
M
i
+ log P(
i
)  ½ M
i
T
1
M
i
j=1,.., N
The training examples are used to estimate
the parameters
and M
i
and the a priori
probabilities
P(
i
)
,
Note: the classification method used by
discriminant analysis is a nearestneighbor
one: each pattern X is classified according
to the class with the centroid closest to X
in the Generalized Quadratic Distance
D
i
2
= log 
 + (XM
i
)
T
1
(XM
i
) 2 log P(
i
)
ICDAR’99  Tutorial
Overview
Motivations
What’s learning
Attributebased learning
Statistical learning
Decisiontree learning
Learning sets of rules
Firstorder learning
Applications to intelligent document
processing
WISDOM++
Conclusions
ICDAR’99  Tutorial
Decision tree learning
Decision trees classify instances by
sorting them down the tree from the root
to a leaf node providing classification
The learning problem
The learning problem
GIVEN
A set
S
of objects (instances, examples,
observations)
A set
C
of classes
A set
A
of attributes
A set A
A
={a
1
, a
2
, .., a
r
} of discrete values that A
can assume (
A
A
)
FIND
a decision tree
T
which correctly classifies the
objects
S
ICDAR’99  Tutorial
Decision tree learning
HAVING TWO CLASSES (positive,
negative)
the set
S
is associated to the root
for each node
, the best attribute A
*
is
selected, according to a criterion
defined by the user, to make the test in
that node
for each leaf node the name of a class
is defined
ICDAR’99  Tutorial
Decision tree learning
(p,n)
A
(p
1
,n
1
)
(p
i
,n
i
)
(p
r
,n
r
)
a
1
a
i
a
r
...
...
...
...
S
1
S
i
S
r
i
j
S
i
S
j
= Ø
S
i
= { e
S
 A(e) = a
i
}
S
=
U
r
i=1
S
i
ICDAR’99  Tutorial
Decision tree learning
The systems: CLS, IDR, C4, ASSISTANT,
ID5, etc…
Appropriate problems:
instances are represented by attribute
value pairs
the target function has discrete output
values
disjunctive concept descriptions may be
required
the training data may contain errors
the training data may contain missing
values
ICDAR’99  Tutorial
Decision tree learning
The algorithms:
generally are based on the strategy
divide
& conquer
. Since it is necessary to find
a “good” test, or the best classifier
among the attributes, many different
criteria are used to select a test
ICDAR’99  Tutorial
Decision tree learning
The strategy
divide & conquer
allows to handle both continuous and
categorical attributes
allows to express concepts into a
symbolic form
is nonincremental
is efficient
works with noisy data although
generates large trees (pruning is
necessary)
no parameter required
ICDAR’99  Tutorial
Decision tree learning
The heuristic criteria
based on information

entropy gain (minimal entropy)

gain ratio

normalized information gain

reduced description lenght
error based

error reduction in the training set

dissimilarity

GINI index
significance
2

various statistics
ICDAR’99  Tutorial
Decision tree learning
The basic algorithm (ID3)
1. If all the examples in
S
belong to the
same class C then the result is the
same label C
2. Else:
a. select the most discriminating attribute
A, whose values are
a
1
, a
2
, .., a
r
b
. partition
S
into S
1
, S
2
, …, S
r
basing on
the values of A
c. recursively apply the algorithm to each
subset S
i
ICDAR’99  Tutorial
Decision tree learning
As to the forecast problem (positive and
negative conditions to play tennis)
outlook
Sunny
(1,2,8,9,11)
Rain
(4,5,6,10,14)
Overcast
(3,7,12,13)
P
humidity
windy
high
normal
P
N
T
F
P
N
1,2,8
9,11
6,14
4,5,10
ICDAR’99  Tutorial
Decision tree learning
Example for the decision trees (ID3)
A similar training set, but with F° degree for temperature
and % for humidity
No
Outlook
°F
humid.%
Windy
Class
D1
rain
71
96
T
N
D2
rain
65
70
T
N
D3
overcast
72
90
T
P
D4
overcast
83
78
F
P
D5
rain
75
80
F
P
D6
overcast
64
65
T
P
D7
sunny
75
70
T
P
D8
sunny
80
90
T
N
D9
sunny
85
85
F
N
ICDAR’99  Tutorial
Decision tree learning
No
Outlook
°F
humid.%
Windy
Class
D10
overcast
81
75
F
P
D11
rain
68
80
F
P
D2
rain
70
96
F
P
D13
sunny
72
95
F
N
D14
sunny
69
70
F
P
DECISION TREE
Outlook = sunny
humidity > 77.5 : N
humidity < 77.5 : P
Outlook = overcast : P
Outlook = rain
windy = True : N
windy = False : P
ICDAR’99  Tutorial
Decision tree learning
ANOTHER DECISION TREE
Outlook = overcast : P
Outlook = rain
windy = True : N
windy = False : P
temperature < 69.5 :
Outlook = sunny : P
Outlook = sunny
windy = True : P
windy = False : N
temperature > 69.5 :
temperature < 79.5 :
Outlook = overcast : P
ICDAR’99  Tutorial
Decision tree learning
ANOTHER DECISION TREE (cont.)
Outlook = rain :
humidity > 80.5 :
windy = True : P
windy = False : N
humidity < 80.5 : P
temperature > 79.5 :
windy = True : N
windy = False :
humidity > 80.5 :
outlook = sunny : N
outlook = overcast : P
outlook = rain : P
humidity < 80.5 : P
ICDAR’99  Tutorial
Which attribute is the best classifier?
Which attribute is the best classifier?
The entropy could be used to measure the
“impurity”of an arbitrary collection of
examples.
Given a collection
S
containing positive
and negative example of a certain
concept the entropy of
S
Entropy(
S
) = p
+
log
2
p
+
p

log
2
p

where p
+
is the proportion of positive
examples in
S
and p

is the proportion
of negative examples in
S
Decision tree learning
ICDAR’99  Tutorial
The entropy function relative to a boolean
classification as the proportion p
+
varies
between 0 and 1
Mitchell 1997
Decision tree learning
Entr
opy
(
S
)
0.0
0.5
0.5
1.0
1.0
ICDAR’99  Tutorial
Example
Consider the usual problem of time forecast
(14 examples, 9 positive, 5 negative)
the entropy of
S
Entropy([9
+
, 5

]) =
= (9/14) log
2
(9/14)  (5/14) log
2
(5/14) =
= 0.940
note: if all members are positive
Entropy (
S
) = 0
if
S
contains an equal number of positive and
negative examples
Entropy (
S
) = 1
Decision tree learning
ICDAR’99  Tutorial
More generally, if the target concept can
take on C different values, then the
entropy of
S
as to Cwise classification is
Entropy (
S
) =
c
i=1
p
i
log
2
p
i
where p
i
is the proportion of
S
belonging to
class i
Note
the logarithm is still base 2 because entropy
is a measure of expected encoding length
measured in bits
Decision tree learning
ICDAR’99  Tutorial
Information gain
Information gain
the information gain of an attribute A relative
to a collection of examples
S
is defined as
Gain(
S
, A) =
= Entropy(
S
) 
a
A

S
a
 / 
S
 Entropy (
S
a
)
where
S
a
is the subset of
S
in which the
attribute A takes the value a
Gain(
S
, A) is therefore the expected
reduction in entropy caused by knowing
the value of attribute A
Decision tree learning
ICDAR’99  Tutorial
Example
SUPPOSE
S
is a collection of training
example days described by attributes
including wind (weak, strong)
ASSUME 14 examples [9
+
, 5

]. Of these
suppose 6 of the positive and 2 of the
negative have wind = weak and the
remainder have wind = strong
S
= [9
+
, 5

]
S
weak
= [6
+
, 2

]
S
strong
= [3
+
, 3

]
Decision tree learning
ICDAR’99  Tutorial
Gain(
S
, wind) = Entropy(
S
) +

v
{weak, strong}

S
v
 / 
S
 Entropy (
S
v
)=
= Entropy(
S
)  (8/14) Entropy(
S
weak
) +
 (6/14) Entropy(
S
strong
) =
= 0.94  (8/14) 0.811  (6/14) 1.00 = 0.048
Decision tree learning
ICDAR’99  Tutorial
Decision tree learning
outlook
Sunny
Rain
(4,5,6,10,14)
Overcast
(D1,D2,D8,D9,D11)
[2
+
, 3

]
(D4,D5,D6,D10,D14)
[3
+
, 2

]
P
(D3,D7,D12,D13)
[4
+
, 0

]
?
?
(D1,D2, …, D14)
[9
+
, 5

]
Which attribute should be tested here?
ICDAR’99  Tutorial
S
sunny
= {D1, D2, D8, D9, D11}
Gain(
S
sunny
,humidity) =
.970(3/5) 0.0(2/5) 0.0 = .970
Gain(
S
sunny
,temperature) =
=
.970  (2/5) 0.0  (2/5) 1.0  (1/5) 0.0 = .570
Gain(
S
sunny
,wind) =
.970(2/5) 1.0  (3/5) .918 = .019
In the partially learned decision tree resulting from the
first step of ID3, the training examples are sorted to
the corresponding descendant nodes. The
Overcast
descendant has only positive examples and
therefore becomes a leaf node with classification
Yes
. The other two nodes will be further expanded,
by selecting the attribute with highest information
gain relative to the new subsets of examples
Decision tree learning
ICDAR’99  Tutorial
Decision tree learning
+
+

A2


+

+
+

+
A1
A2
+
+


+
A3
A2


+

A4

...
...
...
...
Hypothesis space searched by ID3.
ID3 searches through the space of
possible decision trees from
simplest to increasingly complex,
guided by the information gain
heuristics.
ICDAR’99  Tutorial
Decision tree learning
outlook
Sunny
Rain
Overcast
Yes
humidity
high
normal
no
yes
wind
strong
weak
no
yes
Overfitting in Decision Trees
Overfitting in Decision Trees
Consider adding noisy training example #15:
Sunny, Hot, Normal, Strong, PlayTennis = N
What effect on earlier tree?
ICDAR’99  Tutorial
Decision tree learning
Avoiding overfitting
Avoiding overfitting
How can we avoid overfitting?
Stop growing when data split not statistically
significant
grow full tree, then postprune
How to select the “best” tree:
measure performance over training data
measure performance over separate
validation data set
MDL: minimize
size(tree) + size (misclassifications(tree))
ICDAR’99  Tutorial
Decision tree learning
ReducedError Pruning
ReducedError Pruning
Split data into
training
and
validation
set
Do until further pruning is harmful:
1. Evaluate impact on
validation
set of pruning
each possible node (plus those below it)
2. Greedily remove the one that most improves
validation
set accuracy
produces smallest version of most accurate
subtree
what if data is limited?
ICDAR’99  Tutorial
Decision tree learning
When observations are presented in a stream
and when responses to these observations
are required in a timely manner we need
Incrementality
Incrementality
The incremental versions of this strategies
(Fisher, Schlimmer, Utgoff) assume that
the instances are presented one at a time
statistics are maintained relative to different
tests for each node of the tree
when it is decided to change a test it is
possible
to prune the subtrees or
to modify the subtrees
ICDAR’99  Tutorial
Other methods inspired to decision
Other methods inspired to decision
trees allow to substitute the leaf with
trees allow to substitute the leaf with
the probability distribution of the class
a linear function (perception, regression
models, etc..
Decision tree learning
ICDAR’99  Tutorial
From Decision Trees to Decision
Rules
outlook
Sunny
Rain
Overcast
Yes
humidity
high
normal
no
yes
wind
strong
weak
no
yes
IF Outlook = Sunny AND Humidity = High THEN
Class = no
IF Outlook = Sunny AND Humidity = Normal THEN
Class = yes
IF Outlook = Overcast THEN Class = yes
...
ICDAR’99  Tutorial
Overview
Motivations
What’s learning
Attributebased learning
Statistical learning
Decisiontree learning
Learning sets of rules
Firstorder learning
Applications to intelligent document
processing
WISDOM++
Conclusions
ICDAR’99  Tutorial
Learning Rules Directly
A concept description is expressed in
a logicstyle form as a set of decision
rules (ifthen rules)
decision rules are one of the most
expressible and comprehensible
knowledge representations
Two classes of methods:
Learning attributional rules (AQ
family, CN2, etc.)
Learning relational descriptions
(INDUCE, FOIL, PROGOL, etc. )
ICDAR’99  Tutorial
Concept Learning
Concept [from Latin
concipere
= to
seize (a thought)]
Two opposing views:
EXTENSIONAL
: a specified set of
physical or abstract objects
INTENSIONAL
: a set of (necessary
and) sufficient conditions
(Wittgenstein, 1953).
Induce a set of
sufficient
conditions
from a sample of positive and
negative examples of the concept
ICDAR’99  Tutorial
A Concept Learning Task
Sky
Air
Temp
Humid
Wind
Water
Forecst
Enjoy
Sport
Sunny
Warm
Normal
Strong
Warm
Same
Yes
Sunny
Warm
High
Strong
Warm
Same
Yes
Rainy
Cold
High
Strong
Warm
Change
No
Sunny
Warm
High
Strong
Cool
Change
Yes
Positive and negative training
examples (
instances
) for the target
concept
EnjoySport
Task: to learn to predict the value of
EnjoySport
for an arbitrary day
ICDAR’99  Tutorial
Representing Hypotheses
A hypothesis
h
is a conjunction of
constraints
on the instance attributes
(
conjunctive concept
)
Each constraint can be:
a specific value (e.g., “Water=Warm”)
don’t care (e.g., “Water=
?
”)
no value allowed (e.g., “Water=
Ø
”)
Sky
Temp
Humid
Wind
Water
Forecst
?
Cold
High
?
?
?
h =
sunny, warm, ?, strong, ?, ?
can be rewritten as:
IF Sky=sunny
AirTemp=Warm
Wind=strong
THEN EnjoySport=Yes
ICDAR’99  Tutorial
A Formalization
Given
Instances
X
:
Possible days, each described by the
attributes
Sky
,
Temp
,
Humidity
,
Wind
,
Water
,
Forecast
Training examples
D
:
x
i
, c(x
i
)
Target concept
c
:
EnjoySport :
X
{0,1}
c(x)=1 if EnjoySport=yes
c(x)=0 if EnjoySport=no
Hypotheses H
:
Expressed as conjunction of
constraints on the attributes
ICDAR’99  Tutorial
A Formalization
Determine
A
hypothesis
h
in
H
such that
h(x)
=
c(x)
for all
x
in
X
The learning task is to determine a
hypothesis
h
identical to the target
concept
c
over
the entire set of
instances
X
.
ICDAR’99  Tutorial
What information is available?
Sky
Air
Temp
Humid
Wind
Water
Forecst
Enjoy
Sport
Sunny
Warm
Normal
Strong
Warm
Same
Yes
Sunny
Warm
High
Strong
Warm
Same
Yes
Rainy
Cold
High
Strong
Warm
Change
No
Sunny
Warm
High
Strong
Cool
Change
Yes
D
•
Typically,
D
X
.
•
If
Sky
has
three
possible values, and
•
AirTemp, Humidity,Wind, Water, Forecast
each have two possible values
•
Then
X
has 3·2 ·2 ·2 ·2 · 2 =
96
distinct
instances.
•

D
=4
ICDAR’99  Tutorial
The inductive learning hypothesis
Any hypothesis found to approximate
the target function well over a
sufficiently large set of training
examples will also approximate the
target function well over other
unobserved examples.
ICDAR’99  Tutorial
Concept Learning as Search
Concept learning can be viewed as
the task of searching trough a large
space of hypotheses
H
.
The
goal
of the search is to find the
hypothesis that best fits the training
examples.
The space of hypotheses is
implicitly
defined by the hypotheses
representation.
ICDAR’99  Tutorial
An example
A hypothesis
h
is a conjunction of
constraints
on the instance attributes.
Each constraint can be:
a specific value
don’t care,
?
no value allowed,
Ø
H
has 5·4 ·4 ·4 ·4 · 4 = 5120
syntactically
distinct hypotheses.
ICDAR’99  Tutorial
Semantically distinct hypotheses
Both hypotheses represent the empty
set of instances
every instance is
classified as negative.
Semantically distinct hypotheses:
1+(4·3 ·3 ·3 ·3 · 3) = 973
Typically the search space is much
larger, sometimes infinite.
Sky
Temp
Humid
Wind
Water
Forecst
?
Cold
Ø
?
?
?
Sky
Temp
Humid
Wind
Water
Forecst
Sunny
Warm
High
Ø
?
?
h
h’
ICDAR’99  Tutorial
Efficient search: how ?
Naive search strategy: generateandtest
all hypotheses in
H
.
Impossible for infinite (or very large)
search space.
The search can rely on the structure
defined by a
generaltospecific ordering
of hypotheses
.
Sky
Temp
Humid
Wind
Water
Forecst
Sunny
?
?
Strong
?
?
Sky
Temp
Humid
Wind
Water
Forecst
Sunny
?
?
?
?
?
h
1
h
2
h
2
is more general than
h
1
.
ICDAR’99  Tutorial
Generaltospecific ordering
Given two hypotheses
h
k
and
h
j
h
j
is more general or equal to
h
k
h
j
g
h
k
if and only if any instance that satisfies
h
k
also satisfies
h
j
.
h
j
is strictly more general than
h
k
h
j
>
g
h
k
if and only if
h
j
g
h
k
and
h
k
h
j
.
The inverse
relation
more_specific_than
can be defined as well.
g
h
1
=
Sunny, ?, ?, Strong, ?, ?
h
2
=
Sunny, ?, ?, ?, ?, ?
h
3
=
Sunny, ?, ?, ?, Cool, ?
General
Specific
h
1
h
3
h
2
Hypotheses H
Instances X
x
1
x
2
x
1
=
Sunny, Warm, High, Strong, Cool, Same
x
2
=
Sunny, Warm, High, Light, Warm, Same
ICDAR’99  Tutorial
Terminology
A hypothesis
h
covers
a positive
example if it correctly classifies the
example as positive
h(x)=c(x)=1
An example
x ,c(x)
satisfies
hypothesis
h
when
h(x)
=1
regardless of
whether
x
is a positive or negative
example of the target concept.
A hypothesis
h
is
consistent
with an
example
x ,c(x)
when
h(x)=c(x)
.
A hypothesis
h
is
consistent with a
training set
D
if it is consistent with
each example
x
D
ICDAR’99  Tutorial
Taking advantage of the general
tospecific ordering
One way is to begin with the most
specific possible hypothesis in
H
, then
generalize this hypothesis each time it
fails to cover an observed positive
training example (
bottomup search
)
Find a
maximally specific hypothesis
ICDAR’99  Tutorial
Example
h
Ø, Ø , Ø , Ø , Ø , Ø
Consider the first example in D:
Sunny, Warm, Normal, Strong, Warm, Same
, +
The maximally specific hypothesis is
h
Sunny, Warm, Normal, Strong, Warm, Same
Consider the second example in D:
Sunny, Warm, High, Strong, Warm, Same
, +
The maximally specific hypothesis is
h
Sunny, Warm, ?, Strong, Warm, Same
The third example is ignored since negative:
Rainy, Cold, High, Strong, Warm, Change
, 
Consider the fourth example in D:
Sunny, Warm, High, Strong, Cool, Change
, +
The maximally specific hypothesis is
h
Sunny, Warm, ?, Strong, ?, ?
Instances X
General
Specific
Hypotheses H
h
0
=
Ø, Ø, Ø, Ø, Ø, Ø
h
0
h
1
x
1
x
1
=
Sunny, Warm, Normal, Strong, Warm, Same
h
1
=
Sunny, Warm, Normal, Strong, Warm, Same
h
2,3
x
2
x
2
=
Sunny, Warm, High, Strong, Warm, Same
h
2
=
Sunny, Warm, ?, Strong, Warm, Same
x
3
x
3
=
Rainy, Cold, High, Strong, Warm, Change
h
3
=
Sunny, Warm, ?, Strong, Warm, Same
h
4
x
4
x
4
=
Sunny, Warm, High, Strong, Cool, Change
h
4
=
Sunny, Warm, ?, Strong, ?, ?
ICDAR’99  Tutorial
FINDS Algorithm
1.
Initialize
h
to the most specific hypothesis in
H
2.
For each positive training instance
x
2.1
For each attribute constraint
a
i
in
h
if the constraint
a
i
is satisfied by
x
Then do nothing
Else replace
a
i
in
h
by the next more
general constraint that is satisfied by
x
3.
Output hypothesis
h
ICDAR’99  Tutorial
No revision in case of negative
example: Why?
Basic assumptions:
the target concept
c
is in
H
no errors in training data
h
is the most specific hypothesis in
H
therefore
c
g
h
but
c
will never be satisfied by a
negative example
thus neither will
h
.
ICDAR’99  Tutorial
Limitations of
FindS
Can’t tell whether the learner converged
to the correct target concept
Has it found the only hypothesis in
H
consistent with the data, or there are
many other consistent hypotheses as
well?
Picks a maximally specific hypothesis.
Why should we prefer this hypothesis
over, say, the most general?
ICDAR’99  Tutorial
Limitations of
FindS (cont.)
Can’t tell when training data are
inconsistent
Inconsistency in training examples
can mislead FindS, since it ignores
negative examples. Is it possible to
detect such inconsistency?
There might be several maximally
specific consistent hypotheses.
FindS should backtrack on its
choices in order to explore a different
branch of the partial ordering than the
branch it has selected.
ICDAR’99  Tutorial
Version Space
Return a version space instead of a
single hypothesis.
The
version space
,
VS
H,D
, with
respect to hypothesis space
H
and
training examples
D
, is the subset of
hypotheses from
H
consistent with the
training examples in
D
.
VS
H,D
{h
H  Consistent(h,D)}
ICDAR’99  Tutorial
The
ListThenEliminate
algorithm
1.
VS
H,D
a list containing every
hypothesis in
H
2.
For each training example,
x,c(x)
remove from
VS
H,D
any hypothesis h
for which
h(x)
c(x)
3. Output the list of hypotheses in
VS
H,D
ICDAR’99  Tutorial
Pros and cons
Guaranteed to output all hypotheses
consistent with the training data
Can detect inconsistencies in the
training data
Exhaustive enumeration of all
hypotheses:
possible only for finite spaces H
unrealistic for large spaces H
ICDAR’99  Tutorial
Version Space: A compact
representation
The version space can be represented by its
most general and least general members
(
version space representation theorem
)
H
VS
H,D
G
S
Training instances
Most general hypothesis
General
Specific
ICDAR’99  Tutorial
General boundary
The
general boundary
G
, with respect to
hypothesis space
H
and training data
D
, is
the set of maximally general members of
H
consistent with
D
G
{ g
H  Consistent(g,D)
(
g’
H [(g’ >
g
g)
Consistent(g’,D)] }
ICDAR’99  Tutorial
Specific boundary
The
specific boundary
S
, with respect to
hypothesis space
H
and training data
D
, is
the set of minimally general (i.e., maximally
specific) members of
H
consistent with
D
S
{ s
H  Consistent(s,D)
(
s’
H [(s >
g
s’)
Consistent(s’,D)] }
A Version Space
{
Sunny, ?, ?, ?, ?, ?
,
?, Warm, ?, ?, ?, ?
}
Sunny,?,?,Strong,?,?
Sunny,Warm,?,?,?,?
?,Warm,?,Strong,?,?
{
Sunny, Warm, ?, Strong, ?, ?
}
G
S
CandidateElimination algorithm
UPDATES routine
: For each hypothesis
s
in
S
that is not consistent with
d
•
Remove
s
from
S
•
Add to
S
all minimal generalizations
h
of
s
such that
–
h
is consistent with
d
, and
–
some member of
G
is more general than
h
•
Remove from
S
any hypothesis that is more
general than another hypothesis in
S
G
maximally general hypotheses in
H
S
maximally specific hypotheses in
H
For each training example
d
, do
if
d
is a
positive
example
Remove from
G
any hypothesis
inconsistent with
d
H
G
S
VS
H,D
H
G
S
VS
H,D
H
G
S
VS
H,D
H
G
S
VS
H,D
H
G
S
VS
H,D
CandidateElimination algorithm
(cont.)
if
d
is a
negative
example
Remove from
S
any hypothesis
inconsistent with
d
H
G
S
VS
H,D
H
G
S
VS
H,D
H
G
S
VS
H,D
H
G
S
VS
H,D
H
G
S
VS
H,D
UPDATEG routine
: For each hypothesis
g
in
G
that is not consistent with
d
•
Remove
g
from
G
•
Add to
G
all minimal specializations
h
of
g
such that
–
h
is consistent with
d
, and
–
some member of
S
is more specific than
h
•
Remove from
G
any hypothesis that is less
general than another hypothesis in
G
Sky
Air
Temp
Humid
Wind
Water
Forecst
Enjoy
Sport
Sunny
Warm
Normal
Strong
Warm
Same
Yes
Sunny
Warm
High
Strong
Warm
Same
Yes
Rainy
Cold
High
Strong
Warm
Change
No
Sunny
Warm
High
Strong
Cool
Change
Yes
D
{
Sunny, ?, ?, ?, ?, ?
,
?, Warm, ?, ?, ?, ?
}
Sunny,?,?,Strong,?,?
Sunny,Warm,?,?,?,?
?,Warm,?,Strong,?,?
{
Sunny, Warm, ?, Strong, ?, ?
}
G
S
ICDAR’99  Tutorial
What does the Candidate
Elimination algorithm converge to?
The algorithm will converge towards the
target concept
provided that
there are no errors in the training
examples
there is some hypothesis in H that
correctly describes the target concept
The target concept is learned when
the
S
and
G
boundary sets converge
to a single, identical, hypothesis.
ICDAR’99  Tutorial
Empty Version Space
The algorithm outputs an empty
version space when:
training data contain errors (a
positive example is presented as
negative).
the target concept cannot be
described in the hypothesis
representation.
ICDAR’99  Tutorial
Other characteristics
The CandidateElimination algorithm
performs a
bidirectional search
.
G
and
S
can grow exponentially in the
number of training examples.
How can partially learned
concepts be used?
Sunny, Warm, Normal, Strong, Cool, Change
{
Sunny, ?, ?, ?, ?, ?
,
?, Warm, ?, ?, ?, ?
}
Sunny,?,?,Strong,?,?
Sunny,Warm,?,?,?,?
?,Warm,?,Strong,?,?
{
Sunny, Warm, ?, Strong, ?, ?
}
G
S
POSITIVE: unanimous agreement
How can partially learned
concepts be used?
Rainy, Cold, Normal, Light, Warm, Same
{
Sunny, ?, ?, ?, ?, ?
,
?, Warm, ?, ?, ?, ?
}
Sunny,?,?,Strong,?,?
Sunny,Warm,?,?,?,?
?,Warm,?,Strong,?,?
{
Sunny, Warm, ?, Strong, ?, ?
}
G
S
NEGATIVE: unanimous agreement
How can partially learned
concepts be used?
Sunny, Warm, Normal, Light, Warm, Same
{
Sunny, ?, ?, ?, ?, ?
,
?, Warm, ?, ?, ?, ?
}
Sunny,?,?,Strong,?,?
Sunny,Warm,?,?,?,?
?,Warm,?,Strong,?,?
{
Sunny, Warm, ?, Strong, ?, ?
}
G
S
Half positive, half negative
How can partially learned
concepts be used?
Sunny, Cold, Normal, Strong, Warm, Same
{
Sunny, ?, ?, ?, ?, ?
,
?, Warm, ?, ?, ?, ?
}
Sunny,?,?,Strong,?,?
Sunny,Warm,?,?,?,?
?,Warm,?,Strong,?,?
{
Sunny, Warm, ?, Strong, ?, ?
}
G
S
Two positive, four negative
ICDAR’99  Tutorial
An interactive learning algorithm
The learning algorithm is allowed to
choose the next instance (
query
) and
receive the correct classification from
an external oracle (e.g., nature or a
teacher).
If the algorithm always chooses a
query that is satisfied by only half of
hypotheses in
VS
, then the correct
target concept can be found in
steps
Generally, it’s impossible to adopt this
optimal search strategy.
VS
2
log
ICDAR’99  Tutorial
Dealing with noisy training
instances
Relax the condition that the concept
descriptions be consistent with
all
of the
training instances.
H
G
0
G
1
G
2
S
0
S
1
S
2
Solution for a bounded, predetermined
number of misclassified examples:
maintain several
G
and
S
sets of varying
consistency
.
ICDAR’99  Tutorial
Dealing with noisy training
instances (cont.)
The set
S
i
is consistent with all but
i
of
the positive training examples
The set
G
i
is consistent with all but
i
of
the negative training examples
When
G
0
crosses
S
0
the algorithm can
conclude that no concept in the rule
space is consistent with
all
of the training
instances.
The algorithm can recover and tray to
find a concept that is consistent with
all
but one
of the training examples.
What if the concept is not contained
in the hypothesis space?
Cannot represent disjunctive target concepts,
such as “Sky=Sunny or Sky=Cloudy”
Sky
Air
Temp
Humid
Wind
Water
Forecst
Enjoy
Sport
Sunny
Warm
Normal
Strong
Cool
Change
Yes
Cloudy
Warm
Normal
Strong
Cool
Change
Yes
Rainy
Warm
Normal
Strong
Cool
Change
No
A more expressive hypothesis space is
requested
.
ICDAR’99  Tutorial
A hypothesis space that includes
every possible hypothesis?
Choose H that expresses every teachable
concept (i.e., H is the power set of X).
Consider H’= disjunctions, conjunctions,
negations over previous H.
Sunny,Cold,Normal,?,?,?
?,?,?,?,?,Change
What are S and G in this case?
S is the disjunction of positive examples
presented so far
G is the negation of the disjunction of
negative examples seen so far
Totally useless. No generalization at all !
ICDAR’99  Tutorial
A fundamental property of
inductive inference
“
A learning algorithm that makes
no a priori assumptions regarding
the identity of the target
concept has no rational basis for
classifying any unseen instance.”
Prior assumption
inductive bias
Do not confuse with the
estimation
bias
commonly used in statistics.
ICDAR’99  Tutorial
Linear Regression
0
5
10
15
20
25
0
5
10
15
X
Y
The underlying prior assumption (i.e.,
inductive bias) is that the relationship
between X and Y is linear.
A formal definition of inductive bias
Consider
concept learning algorithm
L
instances
X
target concept
c
training examples
D
c
={
x,c(x)
}
let
L(x
i
,D
c
)
denote the classification assigned to
the instance
x
i
by
L
after training on data
D
c
.
The inductive bias of
L
is any minimal set of
assertions
B
such that for any target concept
c
and corresponding training examples
D
c
(
x
i
X) [(B
D
c
x
i
)
L(x
i
,D
c
) ]
Modeling inductive systems by
equivalent deductive systems
The inductive bias that is explicitly input to the
theorem prover is only implicit in the code of the
learning algorithm.
Training examples
New instance
Learning
Algorithm
Classification of
new instance, or
“don’t know”
Inductive System
Training examples
New instance
Theorem
Prover
Classification of
new instance, or
“don’t know”
Deductive System
Inductive Bias
ICDAR’99  Tutorial
Bias of the CandidateElimination
algorithm
The target concept
c
is contained in the
given hypothesis space
H
.
From
c
H
deductively follows
c
VS
H,D
.
L
outputs the classification
L(x
i
,D
c
)
if
and only if every hypothesis in
VS
H,D
also produces this classification,
including the hypothesis
c
VS
H,D
(inductive bias). Therefore
c(x
i
)=L(x
i
,D
c
)
.
ICDAR’99  Tutorial
Comparing the inductive bias of
learning algorithms
The inductive bias is a nonprocedural means of
characterizing learning algorithms policy for
generalizing beyond the observed data.
Rote learning
:
Store examples,
Classify
x
iff it matches previously
observed example
No inductive bias for the rote learning
.
ICDAR’99  Tutorial
Comparing the inductive bias of
learning algorithms
Inductive bias of the candidate
elimination algorithm
:
the target
concept can be represented in its
hypothesis space.
Inductive bias of
FindS
: the target
concept can be represented in its
hypothesis space + all instances are
negative instances unless the
opposite is entailed by its other
knowledge (a kind of default or
nonmonotonic reasoning)
ICDAR’99  Tutorial
Related work
Winston, P.H. (1970). Learning structural
descriptions from examples, PhD
Dissertation, MIT.
Concept learning can be cast as a search
involving generalization and specialization
operators
Plotkin, G.D. (1970). A note on inductive
generalization. In Meltzer& Michie (Eds.),
Machine Intelligence 5, Edinburgh
University Press
ICDAR’99  Tutorial
Related work
Plotkin, G.D. (1970). A note on inductive
generalization. In Meltzer& Michie (Eds.),
Machine Intelligence 5
, Edinburgh
University Press
Plotkin, G.D. (1971). A further note on
inductive generalization. In Meltzer&
Michie (Eds.),
Machine Intelligence 6
,
Edinburgh University Press
Early formalization of the
more_general_than
relation and of the
related notion of
subsumption
ICDAR’99  Tutorial
Related work
Simon, H.A. & Lea, G. (1973). Problem
solving and rule induction: A unified view.
In Gregg (Ed.),
Knowledge and Cognition
,
Lawrence Erlbaum Associates
Early account of learning as search
through a hypothesis space
ICDAR’99  Tutorial
Related work
Mitchell, T.M. (1979). Version spaces: A
candidate elimination approach to rule
learning,
Proc. 5th IJCAI
, MIT Press
Mitchell, T.M. (1982). Generalization as
search,
Artificial Intelligence
Early formalization of the version spaces
and candidateelimination algoritmhm
ICDAR’99  Tutorial
Related work
Haussler, D. (1988). Quantifying
inductive bias: AI learning algorithms and
Valiant’s learning framework.
Artificial
Intelligence
The size of the general boundary
G
can
grow exponentially in the number of
training examples, even when the
hypothesis space consists of simple
conjunctions of features.
ICDAR’99  Tutorial
Related work
Smith, B.D. & Rosembloom, P (1990).
Incremental nonbacktracking focusing: A
polynomially bounded generalization
algorithm for version spaces.
Proc. 1990
AAAI
A simple change to the representation of
the
G
set can improve complexity in
certain cases
ICDAR’99  Tutorial
Related work
Hirsh, H. (1991). Theoretical
underpinnings of version spaces.
Proc.
12th IJCAI
Learning can be polynomial in the number
of examples in some cases when the G
set is not stored at all.
ICDAR’99  Tutorial
Related work
Hirsh, H. (1990).
Incremental version space
merging: A general framework for concept
learning.
Kluwer.
Hirsh, H. (1994). Generalizing version
spaces.
Machine learning
, 17(1), 546.
Extension for handling bounded noise in real
valued attributes that describe the training
examples.
Generalize the CandidateElimination
algorithm to handle situations in which training
information can be different types of
constraints represented using version spaces.
ICDAR’99  Tutorial
Related work
Sebag, M. (1994). Using constraints to build
version spaces.
Proc. 1994 ECML
.
Sebag, M. (1996). Delaying the choice of
bias: A disjunctive version space approach.
Proc. 13th ICML
.
A separate version space is learned for each
positive training example, then new instances
are classified by combining the votes of these
different version spaces.
In this way it is possible to handle noisy
training examples.
ICDAR’99  Tutorial
Learning disjunctive concepts: How?
CandidateElimination is a
least
commitment algorithm
: it generalizes only
when it is forced to.
Disjunction provides a way of
avoiding
any generalization at all: the algorithm is
never forced to generalize.
In order to learn disjunctive concepts,
some method must be found for
controlling
the introduction of disjunctions,
so to prevent
trivial disjunctions
.
Sequential covering is a widespread
approach to
learning sets of rules
.
Sequential Covering algorithms
Perform repeated candidateelimination runs to find
several conjunctive descriptions that together cover
all of the training instances.
At each run, a conjunctive concept is found that is
consistent with
some
of the positive training
examples and
all
of the negative training examples.
The positive instances that have been accounted for
are removed from further consideration, and the
process is repeated until all positive examples have
been covered.
+
+
+
+
+
+
+




Sequential Covering Algorithm
SequentialCovering(Target_attribute,Attributes,
Examples, Threshold)
Learned_rules
Ø
Rule
LEARNONERULE(Target_attribute,Attributes
Examples)
while PERFORMANCE(Rule,Examples) > Threshold do
Learned_rules
Learned_rules
{ Rule }
Examples
Examples  { examples correctly
covered by Rule }
Rule
LEARNONERULE(Target_attribute,Attributes
Examples)
Learned_rules
sort Learned_rules according to
PERFORMANCE over Examples
return Learned_rules
ICDAR’99  Tutorial
LEARNONERULE
LEARNONERULE accepts a set of
positive and negative training
examples as input, then outputs a
single rule that covers many of the
positive examples and no negative
examples.
High accuracy, but not necessarily
high coverage.
How to implement this procedure?
By applying the CandidateElimination
algorithm
ICDAR’99  Tutorial
Sequential Covering + Candidate
Elimination
1.
Initialize
S
to contain one positive training
example (
seed
example).
2. For each negative training instance apply
the
UpdateG
routine to
G
.
3. Choose a description
g
from
G
as one
conjunction for the solution set.
Since
UpdateG
has been applied using all the
negative examples,
g
covers no negative
examples.However,
g
may cover several of the
positive examples.
4. Remove from further consideration all
positive examples covered by
g
.
5. Repeat steps 1 through 4 until all positive
examples are covered.
ICDAR’99  Tutorial
Sequential Covering + Candidate
Elimination
S
1
+
+
+
+
+
+
+








+

g
1
g
2
S
2
ICDAR’99  Tutorial
Generaltospecific search
A different approach to implementing
LEARNONERULE is to organize the
hypothesis space search in the same
fashion as the ID3 algorithm, but to
follow only the most promising branch
in the tree at each step.
Topdown
or
generaltospecific
greedy search through the space of
possible rules
ICDAR’99  Tutorial
The search space for rule
preconditions
IF Sky=sunny and Humidity=high
THEN EnjoySport=yes
IF Sky=sunny and AirTemp=warm
THEN EnjoySport=yes
IF Sky=sunny
THEN EnjoySport=yes
IF AirTemp=warm
THEN EnjoySport=yes
IF Humidity=normal
THEN EnjoySport=yes
IF
THEN EnjoySport=yes
The search starts by considering the most
general rule precondition possible (the
empty test that matches every example).
At each step, the test that most improves
rule performance is added. Information
gain can be used as heuristics.
ICDAR’99  Tutorial
Beam search
The generaltospecific search is generally
a greedy
depthfirst
search with no
backtracking.
Danger of a suboptimal choice at any step
To reduce this risk a
beam search
is
sometimes performed.
A list of the
k
best candidates at each step
is kept, rather than a
single
best candidate.
Both CN2 (Clark & Niblett, 1989) and AQ
(Michalski, 1969, 1986) perform a topdown
beam search. AQ is also based on
seed
examples to guide the search for rules.
ICDAR’99  Tutorial
Simultaneous vs. sequential
covering algorithm
Sets of rules can be obtained by converting
decision trees (Quinlan, 1987).
ID3 can be considered a
simultaneous
covering
algorithm, in contrast to
sequential covering
algorithms.
Sequential covering algorithm are slower.
To learn a set of
n
rules, each containing
k
attributevalue tests in their preconditions,
sequential covering algorithm will perform
n·k
primitive search steps (independent
choices).
ICDAR’99  Tutorial
Simultaneous vs. sequential
covering algorithm
Simultaneous covering algorithms are
generally faster, since each test with
m
possible results contribute to choosing the
preconditions for at least
m
rules.
Sequential covering algorithms make a
large number of
independent
choices,
while simultaneous covering algorithm
make a low number of
dependent
choices.
ICDAR’99  Tutorial
Computational complexity
Worst case analysis
A: number of attributes
V: max number of values per attribute
N: size of the data set
b: beam
Average case behavior can be very different
Approach
Algorithm
Training
Testing
Space
Version
space
CEA
O
(ANV)
O
(AV)
O
(AV)
Decision
trees
C4.5
O
(A
2
N)
O
(
lg A)
O
(N
lg A)
Rule
induction
CN2
O
(A
2
N b
2
)
O
(Ab)
O
(Ab)
ICDAR’99  Tutorial
Induce rules directly or convert a
decision tree to a set of rules?
The answer depend on how much training
data is available.
If data is plentiful, then it may support the
larger number of independent decisions
required by sequential covering algorithms.
If data is scarce, the sharing of decisions
regarding preconditions of different rule
may be more effective.
ICDAR’99  Tutorial
Induce rules directly or convert a
decision tree to a set of rules?
The answer depend on the concept
learning task.
If the concept description is highly
disjunctive with many independent
conditions, decision tree learning
algorithms perform poorly when data is
scarce.
Replication problem
Decision tree representing the Boolean concept AB
CD
A
True
False
B
True
True
False
C
False
D
True
False
True
False
True
False
C
False
D
True
False
True
False
True
False
ICDAR’99  Tutorial
Singleconcept rule learning
In
singleconcept learning
, the
learning element is presented with
positive and negative instances of
some concept.
The system has to find rules that
effectively describe the concept under
study.
Given a new case
x
, if it does not
satisfy the preconditions of any rule,
then it is considered as a negative
instance (
default classification
).
ICDAR’99  Tutorial
Alternatively ...
The system might learn rules for both
positive and negative instances of the
concept.
IF <precondition1> THEN Positive
IF <precondition2> THEN Negative
This is a simple case of
multiple
concept
learning
.
+
+
+
+
+
+
+




Preconditions
are not
necessarily
mutually
exclusive.
ICDAR’99  Tutorial
Changes to Sequential Covering
algorithm
LEARNONERULE should be modified
to accept an additional input argument
specifying the target value of interest.
LEARNONERULE(Target_attribute,
Target_Value
, Attributes, Examples).
Similarly, the PERFORMANCE
subroutine should be changed in order
to prefer those hypotheses that cover a
higher number of examples with respect
to the target value of interest.
Information gain is no more appropriate.
ICDAR’99  Tutorial
Classification of new cases
No partitioning of the instance space.
Given a new instance
x
three different
situations can occur:
no classification
: the instance
satisfies no precondition.
Single classification
: the instance
satisfies the precondition of rules with
same conclusion (either positive or
negative).
Multiple classification
: the instance
satisfies the preconditions of rules
with different conclusions.
ICDAR’99  Tutorial
Default rule
The “default” classification might be
desirable if one is attempting to learn a
target concept such as
“pregnant
women who are likely to have twins”
.
The fraction of positive examples in the
entire population is small, so the rule
set would be more compact and
intelligible to humans if it identifies only
the positive examples of the concept.
Positive
region
Negative
region
Instance
space
ICDAR’99  Tutorial
Learning multiple concepts
The learning system is presented with
training examples that are instances
of several concepts, and it must find
several concept descriptions.
For each concept description, there is
a corresponding region in the
instance space.
Instance
space
A
B
C
ICDAR’99  Tutorial
Multipleconcept learning
When concepts are independent, a
multipleconcept learning problem can be
reformulated as a sequence of single
concept learning problems.
The union of the sets of rules learned for
each concept is the output of the multiple
concept learning algorithm.
Approach followed in AQ (Michalski, 1986).

+
+
+
+

+












+
+
+
+

+




+





+
ICDAR’99  Tutorial
Multiple classification
If concepts are mutually exclusive,
multiple classification can be an
undesirable result.
To avoid multiple classifications, the
addition of a new rule to the set of
learned rules may require the
modification of preconditions of existing
rules (
knowledge integration problem
).
When underlying concepts are
overlapping, multiple classification might
be a desirable feature.
ICDAR’99  Tutorial
Learning multiple independent
concepts
In multiple concept learning problems,
examples are typically described by
feature vectors:
a
1
, a
2
, …, a
n
,
c
where
c
is the target attribute. Distinct
values of
c
represent different concepts.
Concepts are intended as mutually
exclusive, that is
independent
.
ICDAR’99  Tutorial
Learning multiple dependent
concepts
In the general case, examples can be
described by feature vectors:
a
1
, a
2
, …, a
n
,
c
1
, c
2
, …, c
m
where
c
i
are target attributes.
Concepts are not necessarily
independent
.
Sky, Temp, …, Forecast,
EnjoySport,
EnjoyWork, PreferredMusic
Learned rules may take into these concept
dependencies
IF EnjoySport = yes and Forecast=change
THEN PreferredMusic = Jazz
ICDAR’99  Tutorial
Learning multiple dependent
concepts (cont.)
The instance space of a single
concept learning problem is defined
by some target attributes as well.
Sky, Temp, …, Forecast, EnjoySport,
PreferredMusic
Which target attributes should be
considered?
Discover the dependencies between
attributes before starting the learning
process, then define instance spaces
accordingly.
ICDAR’99  Tutorial
Learning multiple dependent
concepts (cont.)
Instance spaces for three singleconcept
learning problems:
Sky, Temp, …, Forecast,
EnjoySport
Sky, Temp, …, Forecast, EnjoySport,
EnjoyWork
Sky, Temp, …, Forecast, EnjoySport,
EnjoyWork,
PreferredMusic
EnjoySport
PreferredMusic
EnjoyWork
ICDAR’99  Tutorial
Learning multiple dependent
concepts (cont.)
Possible attribute dependencies can be
detected by means of statistical
techniques.
EnjoySport
PreferredMusic
EnjoyWork
ICDAR’99  Tutorial
Learning multiple dependent
concepts (cont.)
Concept dependencies can be discovered
on line
, i.e. during the learning process, by
simultaneously working with
different
instance spaces
for each learning problem.
This is equivalent to explore
different search
spaces
for each concept to be learned.
Main issue
: learning
useless
concept
descriptions, e.g.
IF EnjoySport = yes THEN EnjoyWork = no
IF EnjoyWork = no THEN Enjoysport = yes
ICDAR’99  Tutorial
Related work
Michalski, R.S. (1969). On the quasi
minimal solution of the general covering
problem.
Proc. 1st Int. Symposium on
Information Processing
, Bled
Michalski, R.S., Mozetic, I., Hong, J., and
Lavrac, N. (1986). The multipurpose
incremental learning system AQ15 and its
testing application to three medical
domains.
Proc. 5th AAAI
Early definition of the sequential covering
strategy.
Use of seed examples to guide the
generaltospecific search of single rules.
Unordered setlike list of rules.
ICDAR’99  Tutorial
Related work
Clark, P., and Niblett, T. (1989). The CN2
induction algorithm.
Machine Learning
,
3
.
Generaltospecific search of single rules
performed
à la
ID3 (no seed example).
Ordered list of rules.
ICDAR’99  Tutorial
Related work
Quinlan, J.R. (1987). Generating
production rules from decision trees.
Proc.
10th IJCAI
.
Early work on a simultaneous covering
algorithm.
ICDAR’99  Tutorial
Related work
Malerba, D., Semeraro, G., and Esposito
F. (1997).
A Multistrategy Approach to
Learning Multiple Dependent Concepts. In
C., Taylor & R., Nakhaeizadeh (Eds.),
Machine Learning and
Statistics: The
Interface
, 87106, Wiley
.
Early investigation of the problem of
learning multiple dependent concepts.
Application to document understanding.
ICDAR’99  Tutorial
Propositional rules
The rule
IF Sky=sunny
AirTemp=Warm
Wind=strong
THEN EnjoySport=Yes
is variablefree: both the precondition and
the conclusion are expressed by
(conjunctions of)
attribute=value
pairs.
Rules like this are said
propositional
,
since they can be expressed as formulas
of the
propositional calculus
.
Sky_sunny
AirTemp_Warm
Wind_strong
EnjoySport
Propositional rules are induced from
examples represented as feature vectors.
ICDAR’99  Tutorial
Overview
Motivations
What’s learning
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο