Learning in Document Analysis and Understanding

imminentpoppedAI and Robotics

Feb 23, 2014 (3 years and 4 months ago)

131 views

Learning in Document Analysis and
Understanding
Tutorial - ICDAR’99
Prof. Floriana Esposito
Prof. Donato Malerba

Dipartimento di Informatica
University of Bari,Italy
http://lacam.di.uniba.it:8000/
ICDAR’99 - Tutorial
Overview

Motivations

What’s learning

Attribute-based learning

Statistical learning

Decision-tree learning

Learning sets of rules

First-order learning

Applications to intelligent document
processing

WISDOM++

Conclusions
ICDAR’99 - Tutorial
Objectives

Heighten
awareness

many machine learning tools can
be used for document processing
applications

Connect
: inspire/encourage
collaborations
ICDAR’99 - Tutorial
Learning in Document Processing:
some data

Updated till 1997, the bibliography is
maintained by the Laboratory for
Language and Media Processing
(LAMP) at the University of Maryland at
College Park.

DOCBIB (
http://documents.
cfar
.
umd
.
edu
/
biblio
/
)
is a collection of over 2750 references on topics
such as preprocessing, representation, feature
extraction, OCR, on-line recognition,
text/graphics discrimination, signature
verification and layout analysis.
ICDAR’99 - Tutorial
Learning in document processing:
Some data

By querying DOCBIB on the word

LEARNING

thirty-two
references are
found,
twenty-six
are papers published
in the 1990’s.

CLUSTERING

six
new references

NEURAL NET
ninety-three

independent references

The bibliography is certainly incomplete!
ICDAR’99 - Tutorial
Document processing requires a
large amount of knowledge

Flexible
document processing systems
require a large amount of knowledge.

The
segmentation
of the document image
can be based on layout structure of specific
classes;

The
separation of text from graphics

requires knowledge on how text blocks can
be discriminated from non-text blocks.

Document analysis and understanding as a
branch of artificial intelligence

(Tang
et al.
,
1994).
ICDAR’99 - Tutorial
Hand-coding knowledge?

Problems related to knowledge
representation and acquisition are relevant
for the development of “intelligent”
document processing systems.

A great effort is made to hand-code the
necessary knowledge according to some
formalism

block grammars (Nagy
et al
., 1992)

geometric trees (Dengel & Barth, 1989).

Frames (Bayer
et al
., 1994).

Application of different machine learning
algorithms in order to solve the knowledge
acquisition problem.
ICDAR’99 - Tutorial
Typical machine learning applications

Inputs
are examples

Reduce problem to
classification

Generate
rules
from examples

An alternative to eliciting rules from experts

Experts prefer (and excel in) describing
expertise with examples, not rules

Advice of experts is costly

Expertise unavailable.

Greater development speed.
ICDAR’99 - Tutorial
Comparing development times
(Michie, 1989)
Name
ML tool
#
Rules
Develop
(
yrs)
Maintain
(
yrs)
MYCIN
m
edical
diagnosis
none
100
100
N/A
XCON
VAX computer
configutation
none
8000
180
30
GASOIL
g
as-oil
separator
system
configuration
ExpertEase,
Extran7
2800
1
0.1
BMT
configuration
of
fire-protection
equipment in
buildings
1
st

Class,
RuleMaster
>30000
9
2
ICDAR’99 - Tutorial
Overview

Motivations

What’s learning

Attribute-based learning

Statistical learning

Decision-tree learning

Learning sets of rules

First-order learning

Applications to intelligent document
processing

WISDOM++

Conclusions
ICDAR’99 - Tutorial
What is learning ?
A common view:
A system learns

if it makes changes in itself that
enable it to better perform a given task


Learning is a multi-faced phenomenon comprising:

acquisition of declarative knowledge

development of motor and cognitive skills
through instruction and practice

organization of knowledge into new more
effective representations

discovery of new facts & theories through
observation and experimentation

ICDAR’99 - Tutorial
History
Three paradigms:

Neural modeling and Decision
Theoretic techniques

Symbolic Concept oriented
techniques

Knowledge intensive Learning
ICDAR’99 - Tutorial
Neural modeling (1955-1965) (1986-…)
Building general purpose learning systems
with little or no initial structure and/or a
priori knowledge

Tabula Rasa = learning without knowledge

Neural modeling = self-organizing systems
Neural Nework
[McCulloch & Pitts, 1943], [Rashevski, 1948],
[Rosenblatt, 1958], [Selfridge, 1959], [Widrow, 1962],
[Minsky& Papert, 1969], [Rumelhart & McClelland,
1986],
[Hinton, 1989]
Simulation of evolutionary processes
[Holland, 1980], [De Jong, 1989]
ICDAR’99 - Tutorial
Decision-Theoretic Techniques
(1955-1965)
Learning is estimating parameters from a
set of training examples

Discriminant functions

[Nilsson, 1965], [Uhr, 1966], [Samuel, 1959, 1963]

Statistical Decision Theory

[Fu, 1968], [Watanabe, 1960], [Duda & Hart, 1973],
[Kanal, 1974]

Adaptive Control Systems

[Truxal, 1955], [Tsypkin, 1968, 1971, 1973] [Fu,
1971, 1974]
ICDAR’99 - Tutorial
Symbolic Concept-oriented
Techniques (1962-1980)
Learning is obtaining high level concept
descriptions from a set of observations,
using logic or graph structures

Psychological studies

[Hunt & Hovland, 1963], [Feigenbaum, 1963], [Simon
& Lea, 1974]

Task oriented systems

[Buchanan, 1978]

General Purpose Inductive Systems

[Winston, 1975], [Michalski, 1972-1978] [Hayes Roth
& McDermott, 1978] [Vere, 1975] [Mitchell, 1978]
ICDAR’99 - Tutorial
Knowledge Intensive Learning
Systems and Multistrategy Learning
(1980 - today)
Learning uses models of deep knowledge
and integrates different method

Exploration and integration of different
strategies


Knowledge intensive techniques

Inductive Logic Programming

Successful applications
Machine Learning Workshops/Conferences ICML
( CMU 1980,… Bled 1999)
ICML Proceedings 1987-1999 by Morgan Kaufmann
Machine Learning Journal (1986,..) Kluwer Academic
Publishers

ICDAR’99 - Tutorial
Learning is the improvement of performance in some
environment through the acquisition of knowledge
resulting from experience in that environment
Pat Langley “Elements of Machine Learning”
Morgan Kaufmann 1996
Environment
Knowledge
Performance
Learning
The general model of
a Learning System
ICDAR’99 - Tutorial

A
computer program
is said to learn from
experience E with respect to some class
of tasks T and performance measure P, if
its performance at tasks in T, as
measured by P, improves with experience
E.

a well defined learning program must
identify the class of tasks, the measure of
performance to be improved and the
source of experience.
Learning Systems
ICDAR’99 - Tutorial
Example: A handwriting
recognition learning problem

Task T: recognizing and classifying
handwritten words within images

Performance measure P: percent of
words correctly classified

training experience E: a database of
handwritten words with given
classification
Tom M. Mitchell “Machine Learning”
McGraw Hill, 1997
ICDAR’99 - Tutorial
Basic Questions

What do Machines learn?
(knowledge Representation)

What kind of problems can be
addressed?
(Tasks)

How do machines learn?
(Reasoning Mechanisms)
ICDAR’99 - Tutorial
What do Machines learn?

Numerical Parameters / Probability
Distributions

Grammars / Automata

Decision Trees

Production Rules / Logical Formulas

Taxonomies

Schemes / Frames / Graphs

Neural nets
ICDAR’99 - Tutorial
Subsymbolic and Symbolic Learning

A first criterion to distinguish among
the methods is in the cognitive
aspects of the employed knowledge
representation

The
subsymbolic

methods
(neural networks,
genetic algorithms, etc.) employ a general
and uniform knowledge representation, use
little background knowledge, are easily
experimentable.

The
symbolic methods
have the goal of
acquiring human-type concepts, analyzing,
and transforming the knowledge already
possessed into a more “operational” form
ICDAR’99 - Tutorial
Both learning and performance
rely on the ability to represent
knowledge

Representing experience
(the INPUT to learning)

representing and organizing the
acquired knowledge
(the OUTPUT of learning)
ICDAR’99 - Tutorial
Representing experience
The simplest approach:

boolean
or
binary features
A slightly more sophisticated formalism:

set of nominal attributes
Sometimes it is convenient to use also

numeric attributes
This allows to define the
instance space
:

given K numeric attributes, one can
represent any given instance as a point in a
k-dimensional space, with the attributes
defining the axes
(the feature vectors can be viewed as a special
case of a numeric encoding with values
limited to 0 and 1)
ICDAR’99 - Tutorial
Representing experience

Some problems require more
sophisticated formalisms (not only
attributes but also relations among
objects)

In this case it is convenient to use more
powerful notations allowing to describe
objects and to express relationships
among objects
sets of relational literals
ICDAR’99 - Tutorial
Representing experience
LOGICAL ZERO-ORDER LANGUAGES
Feature vectors
s = (v
1
, v
2
, …, v
n
)
Pairs (attribute = value)

(shape=triangle)


(size=medium)


(color=red)


(area=0.75)
ICDAR’99 - Tutorial
Representing experience
LOGICAL FIRST-ORDER LANGUAGES
PREDICATE CALCULUS
IMAGE WORLD
A
B
Entities
Relations
On-relation
Predicates
ON (B,A)
Functions
Universal function ON
X=ON (B)
X=A
A=ON (B)
Symbols
ICDAR’99 - Tutorial
Representing the knowledge

In order to represent the acquired
knowledge many symbolic learning
systems, sometimes defined as

conceptual
”, express the output as
conjunctive concepts


These systems generally perform
categorizations and make predictions.

A “
concept
” is the intensional
definition of a class (category),
extensionally represented by all the
instances of that class.
ICDAR’99 - Tutorial
Representing the knowledge

Unstructured Domains
Instance

Vector of <Attribute, Value> pairs
e
1
e
1
(<color, red>, <shape, square>, <edge-length, 2>)
A propositional calculus is sufficient to
express concepts, i.e. descriptions that are
combinations of attribute values:


= [color = red]

[shape = square

triangle ]
ICDAR’99 - Tutorial
Representing the knowledge

Structured Domains
Instance

Parts, Attributes, Relations
a
b
e = square(a)

triangle(b)

large(a)

small(b)


on-table(a)

on(b,a)

red(b)

green(a)
A first-order predicate calculus is necessary
to express concepts, i.e. descriptions that
are combinations of attribute values and
relations among parts:


= red(y)

triangle(y)

square(x)

on(y,x
)
ICDAR’99 - Tutorial
CONCEPTS
Levels of Concept Descriptions
LOGICAL DESCRIPTIONS
STRUCTURAL CONJUNCTIVE DESCRIPTIONS
ATTRIBUTIONAL CONJUNCTIVE DESCRIPTIONS


p [f(p) = a..b] [ ] …


p [g(p)

2] [ ] ...
nominal
linear
structured
[x
i
] = a
v
b
[x
j
= 2..5]
[x
k
= C]
&
&
a
b
123456..
{
{
C
ICDAR’99 - Tutorial
The task
A WAY TO CLASSIFY THE SYSTEMS
One step classification and prediction

Goal
: to increase the accuracy of the
performance system
Multistep inference or problem solving

Goal
: to increase the efficiency of the
provlem-solving learner
Discovery (acquisition of objective
knowledge & theory formation)

Goal
: to increase the performance
when controlling the environment
ICDAR’99 - Tutorial
The degree of supervision
SUPERVISED LEARNING
The tutor of the learner gives direct feedback
about the goodness of its performance

for
classification
problems

each
training instance includes an attribute
specifying the class of that instance
(learning from examples). The
goal
is to
induce a concept description that predicts
this attribute

for
problem solving
tasks

the tutor
suggests the correct step at each point in
the reasoning process (learning
apprentices)
ICDAR’99 - Tutorial
UNSUPERVISED LEARNING
The feedback by the tutor is absent

for
classification
problems

the
entire set of observations is supplied
(learning from observations): the
goal

is to induce a categorization, i.e.
clusters of observations

for
problem solving
tasks

credit
assignment or determining the degree
to which each move in a sequence
deserves credit or blame for the final
outcome
The degree of supervision
ICDAR’99 - Tutorial
How do Machine Learn?

Induction

Deduction

Abduction

Analogy
ICDAR’99 - Tutorial
facts
events
observations
theories
rules
models
DEDUCTION
INDUCTION
How do Machine Learn?
ICDAR’99 - Tutorial
Inferences
Deductive

inference

all men are mortal

Socrates is a man


Socrates is mortal
Approximate deductive inference

Smoking people are liable to cancer

John is a smoker


it is possible that John will have cancer
ICDAR’99 - Tutorial
Inferences
Inductive inference

x
1
has property A

x
2
has property A


x
n
has property A
 

x

X “x has property A”
Abductive inference

A man who drunk a lot has no equilibrium

I see Peter to loose the equilibrium and to
fall down


perhaps Peter drunk a lot
ICDAR’99 - Tutorial
Inferences
Analogical inference
b
1
c
a
2
3
4
A:C

1:4
ICDAR’99 - Tutorial
Abduction-deduction-induction cycle
Ranking
Hypotheses
Induction
Deduction
Abduction
Request New
Information
Expected
consequences
Initial
information
ICDAR’99 - Tutorial
GIVEN

A set of observed facts F

A background knowledge B

A set of hypotheses H
FIND

that hypotheses that, together with B,
explain the observed facts F
H

B |= F
The Inductive Paradigm
ICDAR’99 - Tutorial
Empirical Learning
(inductively learning from many data)
SUPERVISED (heuristic induction)
GIVEN

A set of classes (concepts) expressed in the
language
Lc
,
{C
i
}
1=1..k

A set of examples of each class, expressed in
the language
Lo
,
E = {E
ci
}, E
ci


E
cj
=



i

j

A background knowledge
B
FIND
for each class C
i
the inductive hypothesis H
i



H
i
U B

e where e

E
ci
H
i
U B

e where e

E
cj
with j

i
ICDAR’99 - Tutorial
UNSUPERVISED (learning from
observation)
GIVEN

A set of observations, E

A number of concepts we want to discover, N

A background knowledge
FIND
the best partitioning of the set of observations into
N concepts such that, for each class, the
inductive hypothesis H
i
is formalized so that

H
i
U B

e where e

C
i
i = 1..n

H
i
U B

e where e

C
j
j

i
Empirical Learning
(inductively learning from many data)
ICDAR’99 - Tutorial
Example
The objects are, for example, saturday
mornings and the classification task
involves the weather
ATTRIBUTES:
outlook, with values {sunny, overcast, rain}
temperature, with values {cool, mild, hot}
humidity, with values {high, normal}
windy, with values {true, false}
ICDAR’99 - Tutorial
A small
training
set
No
Outlook
Temp.
Humidity
Windy
Class
D1
sunny
hot
high
F
N
D2
sunny
hot
high
T
N
D3
overcast
hot
high
F
P
D4
rain
mild
high
F
P
D5
rain
cool
normal
F
P
D6
rain
cool
normal
T
N
D7
overcast
cool
normal
T
P
D8
sunny
mild
high
F
N
D9
sunny
cool
normal
F
P
D10
rain
mild
normal
F
P
D11
sunny
mild
normal
T
P
D12
overcast
mild
high
T
P
D13
overcast
hot
normal
F
P
D14
rain
mild
high
T
N
ICDAR’99 - Tutorial
How many hypotheses?

Universe containing
u
objects

Training set containing
m
of them

k
classes

Concept labels each element of universe
with a class

Number of concepts consistent with the
training set is
k
u-m

As to the previous example

14 of 36 possible objects

2
22
possible completions (

4.2 Million!)
ICDAR’99 - Tutorial
BIAS
Any reason for preferring one of several
hypotheses consistent with the
training set
Example of biases

Consider only hypotheses expressible in some
language

prefer hypothesis with most concise expression

prefer first discovered consistent hypothesis

assume that relative frequency of class C
objects in training set is the same as class C
objects in the universe

ICDAR’99 - Tutorial
How many examples do we need?
Theorem: (Blumer, Ehrenfeucht, Haussler &
Warmuth)
Consider
a training set of m examples classified in
accordance with some theory
L
suppose
n
hypotheses {
H
i
, i=1..N
} which include
L
let
H
k
be any hypothesis that agrees with
L
on all
m

examples
Provided
m

1/

ln(N/

)
The probability that
H
k
and
L
differ by more
than

is less than or equal to

ICDAR’99 - Tutorial
The deductive paradigm
(explanation based learning
)
GIVEN

a domain theory
: a set of rules and facts
representing the knowledge in the domain

a single example
or
a set of examples
of a
concept

a goal concept
: an approximate definition
describing the concept to be learned

an operationality criterion
: a predicate over
concept definition, specifying the form in with the
learned concept must be expressed
FIND

A new
concept definition
satisfying the
operationality criterion
ICDAR’99 - Tutorial
A multicriteria classification of
machine learning methods
CLASSIFICATION
CRITERIA

Primary purpose
Synthetic
Analitic
Type of Input
L. from
L. from
Example-
specification-
examples
Observation
guided
guided
Type of Primary
Inductive
Deductive
Inference
Analogy
Role of Prior
Empirical
Constructive
multistrategy
constructive
axiomatic
Knowledge
induction
deduction
Empirical
Abduction
integrated
Abstraction
explanation-
generalization
empirical &
based Learning
qualitative
constructive
explanation-
deductive
(pure)
discovery
generalization
based Learning
generalization
conceptual
Automatic
clustering
Multistrategy
Program
Neural nets
constructive
Synthesis
genetic algorithms
learning
LEARNING PROCESSES
ICDAR’99 - Tutorial
Overview

Motivations

What’s learning

Attribute-based learning

Statistical learning

Decision-tree learning

Learning sets of rules

First-order learning

Applications to intelligent document
processing

WISDOM++

Conclusions
ICDAR’99 - Tutorial
Statistical learning methods
Learning in pattern recognition devoted
Learning in pattern recognition devoted
to pattern automatic classification
to pattern automatic classification
is
traditionally solved by trainable classifiers
definable as “devices that sort data into
classes, capable of improving their
performances in response to information
they receive as function of time”
Learning is the measure of the classifier
Learning is the measure of the classifier
effectiveness or performance
effectiveness or performance
often
associated with the feedback and the
capability of adapting to changing
environment
ICDAR’99 - Tutorial
Statistical learning methods

Trainable classifiers
Trainable classifiers

are inductive
methods devoted to classification and
can be considered as a special case
of empirical learning.

Using methods of statistical decision
theory, they improve their
performance by adjusting internal
parameters rather than by structural
changes.

They do not use symbolic
descriptions to represent higher level
knowledge
.
ICDAR’99 - Tutorial
Any trainable pattern classifier can then be
seen as a machine with “adjustable”
discriminant functions. These define the
behavior of the classifier.
Note: Within this common framework, the
various classification methods differ in
the following main topics:

the
family of discriminant function
used
and their properties;

the
training method
used
Trainable classifiers
ICDAR’99 - Tutorial
The basic model for a trainable pattern
classifier
Key assumption
: data to be classified
must be transformed into
n-
dimensional vectors of real-number
features, with finite n;
X = (x
1
, x
2
, …, x
n
)
In geometric terms…
Any pattern can be represented by a
point in a
n
-dimensional Euclidean
space E
n
, called the “
pattern space

ICDAR’99 - Tutorial
Aim: define regions

i
of the space, so
that
X
 
i

X
 
i

i=1, .., N
where

N is the number of classes;


i
is the i-th class
The basic model for a trainable pattern
classifier

1

1

2

3
P
ICDAR’99 - Tutorial
In mathematical terms…
The surfaces separating regions

i
can
be implicitly defined by N scalar and
single-valued functions:
g
1
(X), g
2
(X), …, g
n
(X)



X
 
i
:g
i
(x)>g
j
(x)

i,j=1, …, N i

j
We call g
i
(x) “
discriminant functions

The basic model for a trainable pattern
classifier
ICDAR’99 - Tutorial
The basic model of a classifier
The basic model for a trainable pattern
classifier
X
x
j
x
i
x
d
g
k
(x)
g
j
(x)
g
i
(x)
g
j
g
j
g
k
l
0
Response
Maximum selector
Discriminants
Discriminantors
Input pattern
ICDAR’99 - Tutorial
Discriminant analysis (Fisher 1936)
Discriminant functions
used

linear
functions established by Statistical
Decision Theory. They have the following
general formulation:
g
i
(x)=p(X|

i
)P(

i
)

i=1,.., N
where
-
p(X|

i
)
is the probability density function of
pattern X given

i

-
P(

i
)
is the a priory probability of
occurrence of category
ICDAR’99 - Tutorial
Discriminant analysis (Fisher 1936)
The
Training method
is parametric. The
following assumptions are made about
input population:
a. the
p(X|

i
)
are multivariate normal
probability density functions with
unknown mean vectors
M
i

and unknown
covariance matrices

i

p(X|

i
)=((2

)
n/2
|

i
|
1/2
)
-1
e
b. all the covariance matrices

i
are
identical
-½(X-M
i
)
T

i
-1
(X-M
i
)
ICDAR’99 - Tutorial
Fisher classification functions
g
i
(x) = X
T

-1
M
i
+ log P(

i
) - ½ M
i
T

-1
M
i


j=1,.., N
The training examples are used to estimate
the parameters


and M
i
and the a priori
probabilities
P(

i
)
,
Note: the classification method used by
discriminant analysis is a nearest-neighbor
one: each pattern X is classified according
to the class with the centroid closest to X
in the Generalized Quadratic Distance
D
i
2
= log |

| + (X-M
i
)
T

-1
(X-M
i
) -2 log P(

i
)

ICDAR’99 - Tutorial
Overview

Motivations

What’s learning

Attribute-based learning

Statistical learning

Decision-tree learning

Learning sets of rules

First-order learning

Applications to intelligent document
processing

WISDOM++

Conclusions
ICDAR’99 - Tutorial
Decision tree learning
Decision trees classify instances by
sorting them down the tree from the root
to a leaf node providing classification
The learning problem
The learning problem
GIVEN

A set
S
of objects (instances, examples,
observations)

A set
C
of classes

A set
A
of attributes

A set A
A
={a
1
, a
2
, .., a
r
} of discrete values that A
can assume (

A


A
)

FIND

a decision tree
T
which correctly classifies the
objects


S
ICDAR’99 - Tutorial
Decision tree learning
HAVING TWO CLASSES (positive,
negative)

the set
S
is associated to the root

for each node

, the best attribute A
*
is
selected, according to a criterion
defined by the user, to make the test in
that node

for each leaf node the name of a class
is defined
ICDAR’99 - Tutorial
Decision tree learning
(p,n)
A
(p
1
,n
1
)
(p
i
,n
i
)
(p
r
,n
r
)
a
1
a
i
a
r
...
...
...
...
S
1
S
i
S
r


i

j
S
i

S
j
= Ø
S
i
= { e


S
| A(e) = a
i
}
S
=
U
r
i=1
S
i
ICDAR’99 - Tutorial
Decision tree learning
The systems: CLS, IDR, C4, ASSISTANT,
ID5, etc…
Appropriate problems:

instances are represented by attribute
value pairs

the target function has discrete output
values

disjunctive concept descriptions may be
required

the training data may contain errors

the training data may contain missing
values
ICDAR’99 - Tutorial
Decision tree learning
The algorithms:
generally are based on the strategy
divide
& conquer
. Since it is necessary to find
a “good” test, or the best classifier
among the attributes, many different
criteria are used to select a test
ICDAR’99 - Tutorial
Decision tree learning
The strategy
divide & conquer

allows to handle both continuous and
categorical attributes

allows to express concepts into a
symbolic form

is non-incremental

is efficient

works with noisy data although
generates large trees (pruning is
necessary)

no parameter required
ICDAR’99 - Tutorial
Decision tree learning
The heuristic criteria

based on information
-
entropy gain (minimal entropy)
-
gain ratio
-
normalized information gain
-
reduced description lenght

error based
-
error reduction in the training set
-
dissimilarity
-
GINI index

significance


2
-
various statistics
ICDAR’99 - Tutorial
Decision tree learning
The basic algorithm (ID3)
1. If all the examples in
S

belong to the
same class C then the result is the
same label C
2. Else:
a. select the most discriminating attribute
A, whose values are
a
1
, a
2
, .., a
r
b
. partition
S

into S
1
, S
2
, …, S
r
basing on
the values of A
c. recursively apply the algorithm to each
subset S
i
ICDAR’99 - Tutorial
Decision tree learning
As to the forecast problem (positive and
negative conditions to play tennis)
outlook
Sunny
(1,2,8,9,11)
Rain
(4,5,6,10,14)
Overcast
(3,7,12,13)
P
humidity
windy
high
normal
P
N
T
F
P
N
1,2,8
9,11
6,14
4,5,10
ICDAR’99 - Tutorial
Decision tree learning
Example for the decision trees (ID3)
A similar training set, but with F° degree for temperature
and % for humidity
No
Outlook
°F
humid.%
Windy
Class
D1
rain
71
96
T
N
D2
rain
65
70
T
N
D3
overcast
72
90
T
P
D4
overcast
83
78
F
P
D5
rain
75
80
F
P
D6
overcast
64
65
T
P
D7
sunny
75
70
T
P
D8
sunny
80
90
T
N
D9
sunny
85
85
F
N
ICDAR’99 - Tutorial
Decision tree learning
No
Outlook
°F
humid.%
Windy
Class
D10
overcast
81
75
F
P
D11
rain
68
80
F
P
D2
rain
70
96
F
P
D13
sunny
72
95
F
N
D14
sunny
69
70
F
P
DECISION TREE
Outlook = sunny
humidity > 77.5 : N
humidity < 77.5 : P
Outlook = overcast : P
Outlook = rain
windy = True : N
windy = False : P
ICDAR’99 - Tutorial
Decision tree learning
ANOTHER DECISION TREE
Outlook = overcast : P
Outlook = rain
windy = True : N
windy = False : P
temperature < 69.5 :
Outlook = sunny : P
Outlook = sunny
windy = True : P
windy = False : N
temperature > 69.5 :
temperature < 79.5 :
Outlook = overcast : P
ICDAR’99 - Tutorial
Decision tree learning
ANOTHER DECISION TREE (cont.)
Outlook = rain :
humidity > 80.5 :
windy = True : P
windy = False : N
humidity < 80.5 : P
temperature > 79.5 :
windy = True : N
windy = False :
humidity > 80.5 :
outlook = sunny : N

outlook = overcast : P

outlook = rain : P
humidity < 80.5 : P
ICDAR’99 - Tutorial
Which attribute is the best classifier?
Which attribute is the best classifier?
The entropy could be used to measure the
“impurity”of an arbitrary collection of
examples.
Given a collection
S
containing positive
and negative example of a certain
concept the entropy of
S

Entropy(
S

) = -p
+
log
2
p
+
-p
-
log
2
p
-
where p
+
is the proportion of positive
examples in
S
and p
-
is the proportion
of negative examples in
S

Decision tree learning
ICDAR’99 - Tutorial
The entropy function relative to a boolean
classification as the proportion p
+
varies
between 0 and 1
Mitchell 1997
Decision tree learning
Entr
opy
(
S

)
0.0
0.5
0.5
1.0
1.0
ICDAR’99 - Tutorial
Example
Consider the usual problem of time forecast
(14 examples, 9 positive, 5 negative)
the entropy of
S

Entropy([9
+
, 5
-
]) =
= -(9/14) log
2
(9/14) - (5/14) log
2
(5/14) =
= 0.940
note: if all members are positive
Entropy (
S
) = 0
if
S
contains an equal number of positive and
negative examples
Entropy (
S
) = 1
Decision tree learning
ICDAR’99 - Tutorial
More generally, if the target concept can
take on C different values, then the
entropy of
S
as to C-wise classification is
Entropy (
S
) =

c
i=1
-p
i
log
2
p
i
where p
i
is the proportion of
S
belonging to
class i
Note
the logarithm is still base 2 because entropy
is a measure of expected encoding length
measured in bits
Decision tree learning
ICDAR’99 - Tutorial
Information gain
Information gain
the information gain of an attribute A relative
to a collection of examples
S
is defined as
Gain(
S
, A) =
= Entropy(
S

) -

a

A
|
S

a
| / |
S
| Entropy (
S

a
)
where
S

a
is the subset of
S
in which the
attribute A takes the value a
Gain(
S
, A) is therefore the expected
reduction in entropy caused by knowing
the value of attribute A
Decision tree learning
ICDAR’99 - Tutorial
Example
SUPPOSE
S
is a collection of training
example days described by attributes
including wind (weak, strong)
ASSUME 14 examples [9
+
, 5
-
]. Of these
suppose 6 of the positive and 2 of the
negative have wind = weak and the
remainder have wind = strong

S
= [9
+
, 5
-
]

S

weak
= [6
+
, 2
-
]

S

strong
= [3
+
, 3
-
]
Decision tree learning
ICDAR’99 - Tutorial
Gain(
S
, wind) = Entropy(
S
) +
-

v

{weak, strong}
|
S

v
| / |
S
| Entropy (
S

v
)=
= Entropy(
S
) - (8/14) Entropy(
S

weak
) +
- (6/14) Entropy(
S

strong
) =
= 0.94 - (8/14) 0.811 - (6/14) 1.00 = 0.048
Decision tree learning
ICDAR’99 - Tutorial
Decision tree learning
outlook
Sunny
Rain
(4,5,6,10,14)
Overcast
(D1,D2,D8,D9,D11)
[2
+
, 3
-
]
(D4,D5,D6,D10,D14)
[3
+
, 2
-
]
P
(D3,D7,D12,D13)
[4
+
, 0
-
]
?
?
(D1,D2, …, D14)
[9
+
, 5
-
]
Which attribute should be tested here?
ICDAR’99 - Tutorial
S

sunny
= {D1, D2, D8, D9, D11}
Gain(
S

sunny
,humidity) =
.970-(3/5) 0.0-(2/5) 0.0 = .970
Gain(
S

sunny
,temperature) =
=
.970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570
Gain(
S

sunny
,wind) =
.970-(2/5) 1.0 - (3/5) .918 = .019
In the partially learned decision tree resulting from the
first step of ID3, the training examples are sorted to
the corresponding descendant nodes. The
Overcast

descendant has only positive examples and
therefore becomes a leaf node with classification
Yes
. The other two nodes will be further expanded,
by selecting the attribute with highest information
gain relative to the new subsets of examples
Decision tree learning
ICDAR’99 - Tutorial
Decision tree learning
+
+
-
A2
-
-
+
-
+
+
-
+
A1
A2
+
+
-
-
+
A3
A2
-
-
+
-
A4
-
...
...
...
...
Hypothesis space searched by ID3.
ID3 searches through the space of
possible decision trees from
simplest to increasingly complex,
guided by the information gain
heuristics.
ICDAR’99 - Tutorial
Decision tree learning
outlook
Sunny
Rain
Overcast
Yes
humidity
high
normal
no
yes
wind
strong
weak
no
yes
Overfitting in Decision Trees
Overfitting in Decision Trees
Consider adding noisy training example #15:
Sunny, Hot, Normal, Strong, PlayTennis = N
What effect on earlier tree?
ICDAR’99 - Tutorial
Decision tree learning
Avoiding overfitting
Avoiding overfitting
How can we avoid overfitting?

Stop growing when data split not statistically
significant

grow full tree, then post-prune
How to select the “best” tree:

measure performance over training data

measure performance over separate
validation data set

MDL: minimize
size(tree) + size (misclassifications(tree))
ICDAR’99 - Tutorial
Decision tree learning
Reduced-Error Pruning
Reduced-Error Pruning
Split data into
training
and
validation
set
Do until further pruning is harmful:
1. Evaluate impact on
validation
set of pruning
each possible node (plus those below it)
2. Greedily remove the one that most improves
validation
set accuracy

produces smallest version of most accurate
subtree

what if data is limited?
ICDAR’99 - Tutorial
Decision tree learning
When observations are presented in a stream
and when responses to these observations
are required in a timely manner we need
Incrementality
Incrementality
The incremental versions of this strategies
(Fisher, Schlimmer, Utgoff) assume that

the instances are presented one at a time

statistics are maintained relative to different
tests for each node of the tree

when it is decided to change a test it is
possible

to prune the sub-trees or

to modify the sub-trees
ICDAR’99 - Tutorial
Other methods inspired to decision
Other methods inspired to decision
trees allow to substitute the leaf with
trees allow to substitute the leaf with



the probability distribution of the class

a linear function (perception, regression
models, etc..
Decision tree learning
ICDAR’99 - Tutorial
From Decision Trees to Decision
Rules
outlook
Sunny
Rain
Overcast
Yes
humidity
high
normal
no
yes
wind
strong
weak
no
yes
IF Outlook = Sunny AND Humidity = High THEN
Class = no
IF Outlook = Sunny AND Humidity = Normal THEN
Class = yes
IF Outlook = Overcast THEN Class = yes
...
ICDAR’99 - Tutorial
Overview

Motivations

What’s learning

Attribute-based learning

Statistical learning

Decision-tree learning

Learning sets of rules

First-order learning

Applications to intelligent document
processing

WISDOM++

Conclusions
ICDAR’99 - Tutorial
Learning Rules Directly

A concept description is expressed in
a logic-style form as a set of decision
rules (if-then rules)

decision rules are one of the most
expressible and comprehensible
knowledge representations

Two classes of methods:

Learning attributional rules (AQ
family, CN2, etc.)

Learning relational descriptions
(INDUCE, FOIL, PROGOL, etc. )
ICDAR’99 - Tutorial
Concept Learning

Concept [from Latin
concipere
= to
seize (a thought)]

Two opposing views:

EXTENSIONAL
: a specified set of
physical or abstract objects

INTENSIONAL
: a set of (necessary
and) sufficient conditions
(Wittgenstein, 1953).

Induce a set of
sufficient
conditions
from a sample of positive and
negative examples of the concept
ICDAR’99 - Tutorial
A Concept Learning Task
Sky
Air
Temp
Humid
Wind
Water
Forecst
Enjoy
Sport
Sunny
Warm
Normal
Strong
Warm
Same
Yes
Sunny
Warm
High
Strong
Warm
Same
Yes
Rainy
Cold
High
Strong
Warm
Change
No
Sunny
Warm
High
Strong
Cool
Change
Yes

Positive and negative training
examples (
instances
) for the target
concept
EnjoySport

Task: to learn to predict the value of
EnjoySport

for an arbitrary day
ICDAR’99 - Tutorial
Representing Hypotheses

A hypothesis
h
is a conjunction of
constraints
on the instance attributes
(
conjunctive concept
)

Each constraint can be:

a specific value (e.g., “Water=Warm”)

don’t care (e.g., “Water=
?
”)

no value allowed (e.g., “Water=
Ø
”)
Sky
Temp
Humid
Wind
Water
Forecst
?
Cold
High
?
?
?
h =

sunny, warm, ?, strong, ?, ?

can be rewritten as:
IF Sky=sunny

AirTemp=Warm

Wind=strong
THEN EnjoySport=Yes
ICDAR’99 - Tutorial
A Formalization

Given

Instances

X
:
Possible days, each described by the
attributes
Sky
,
Temp
,
Humidity
,
Wind
,
Water
,
Forecast


Training examples

D
:


x
i
, c(x
i
)


Target concept

c
:
EnjoySport :
X

{0,1}
c(x)=1 if EnjoySport=yes
c(x)=0 if EnjoySport=no

Hypotheses H
:
Expressed as conjunction of
constraints on the attributes
ICDAR’99 - Tutorial
A Formalization

Determine

A
hypothesis

h
in
H
such that
h(x)
=
c(x)
for all
x
in
X

The learning task is to determine a
hypothesis
h
identical to the target
concept
c
over
the entire set of
instances

X
.
ICDAR’99 - Tutorial
What information is available?
Sky
Air
Temp
Humid
Wind
Water
Forecst
Enjoy
Sport
Sunny
Warm
Normal
Strong
Warm
Same
Yes
Sunny
Warm
High
Strong
Warm
Same
Yes
Rainy
Cold
High
Strong
Warm
Change
No
Sunny
Warm
High
Strong
Cool
Change
Yes
D

Typically,
D



X
.

If
Sky

has

three

possible values, and

AirTemp, Humidity,Wind, Water, Forecast

each have two possible values

Then

X

has 3·2 ·2 ·2 ·2 · 2 =
96
distinct
instances.


|
D
|=4
ICDAR’99 - Tutorial
The inductive learning hypothesis

Any hypothesis found to approximate
the target function well over a
sufficiently large set of training
examples will also approximate the
target function well over other
unobserved examples.
ICDAR’99 - Tutorial
Concept Learning as Search

Concept learning can be viewed as
the task of searching trough a large
space of hypotheses
H
.

The
goal
of the search is to find the
hypothesis that best fits the training
examples.

The space of hypotheses is
implicitly

defined by the hypotheses
representation.
ICDAR’99 - Tutorial
An example

A hypothesis
h
is a conjunction of
constraints
on the instance attributes.
Each constraint can be:

a specific value

don’t care,
?

no value allowed,
Ø

H
has 5·4 ·4 ·4 ·4 · 4 = 5120
syntactically
distinct hypotheses.
ICDAR’99 - Tutorial
Semantically distinct hypotheses

Both hypotheses represent the empty
set of instances

every instance is
classified as negative.

Semantically distinct hypotheses:
1+(4·3 ·3 ·3 ·3 · 3) = 973


Typically the search space is much
larger, sometimes infinite.
Sky
Temp
Humid
Wind
Water
Forecst
?
Cold
Ø
?
?
?
Sky
Temp
Humid
Wind
Water
Forecst
Sunny
Warm
High
Ø
?
?
h
h’
ICDAR’99 - Tutorial
Efficient search: how ?

Naive search strategy: generate-and-test
all hypotheses in
H
.

Impossible for infinite (or very large)
search space.

The search can rely on the structure
defined by a
general-to-specific ordering
of hypotheses
.
Sky
Temp
Humid
Wind
Water
Forecst
Sunny
?
?
Strong
?
?
Sky
Temp
Humid
Wind
Water
Forecst
Sunny
?
?
?
?
?
h
1
h
2

h
2
is more general than
h
1
.
ICDAR’99 - Tutorial
General-to-specific ordering

Given two hypotheses
h
k
and
h
j

h
j

is more general or equal to

h
k

h
j


g

h
k
if and only if any instance that satisfies
h
k
also satisfies
h
j
.
h
j

is strictly more general than

h
k

h
j

>
g

h
k
if and only if
h
j


g

h
k
and
h
k

h
j
.

The inverse

relation
more_specific_than

can be defined as well.
g


h
1
=

Sunny, ?, ?, Strong, ?, ?

h
2
=

Sunny, ?, ?, ?, ?, ?

h
3
=

Sunny, ?, ?, ?, Cool, ?

General
Specific
h
1
h
3
h
2
Hypotheses H
Instances X
x
1
x
2
x
1
=

Sunny, Warm, High, Strong, Cool, Same

x
2
=

Sunny, Warm, High, Light, Warm, Same

ICDAR’99 - Tutorial
Terminology

A hypothesis
h

covers
a positive
example if it correctly classifies the
example as positive
h(x)=c(x)=1


An example

x ,c(x)


satisfies

hypothesis
h
when
h(x)
=1
regardless of
whether
x
is a positive or negative
example of the target concept.

A hypothesis
h
is
consistent
with an
example

x ,c(x)

when
h(x)=c(x)
.

A hypothesis
h
is
consistent with a
training set

D
if it is consistent with
each example
x

D
ICDAR’99 - Tutorial
Taking advantage of the general-
to-specific ordering

One way is to begin with the most
specific possible hypothesis in
H
, then
generalize this hypothesis each time it
fails to cover an observed positive
training example (
bottom-up search
)

Find a
maximally specific hypothesis
ICDAR’99 - Tutorial
Example

h



Ø, Ø , Ø , Ø , Ø , Ø


Consider the first example in D:


Sunny, Warm, Normal, Strong, Warm, Same

, +

The maximally specific hypothesis is
h



Sunny, Warm, Normal, Strong, Warm, Same


Consider the second example in D:

Sunny, Warm, High, Strong, Warm, Same

, +

The maximally specific hypothesis is
h



Sunny, Warm, ?, Strong, Warm, Same


The third example is ignored since negative:

Rainy, Cold, High, Strong, Warm, Change

, -

Consider the fourth example in D:


Sunny, Warm, High, Strong, Cool, Change

, +

The maximally specific hypothesis is
h



Sunny, Warm, ?, Strong, ?, ?

Instances X



General
Specific
Hypotheses H
h
0
=

Ø, Ø, Ø, Ø, Ø, Ø

h
0
h
1
x
1
x
1
=

Sunny, Warm, Normal, Strong, Warm, Same

h
1
=

Sunny, Warm, Normal, Strong, Warm, Same

h
2,3
x
2
x
2
=

Sunny, Warm, High, Strong, Warm, Same

h
2
=

Sunny, Warm, ?, Strong, Warm, Same

x
3
x
3
=

Rainy, Cold, High, Strong, Warm, Change

h
3
=

Sunny, Warm, ?, Strong, Warm, Same

h
4
x
4
x
4
=

Sunny, Warm, High, Strong, Cool, Change

h
4
=

Sunny, Warm, ?, Strong, ?, ?

ICDAR’99 - Tutorial
FIND-S Algorithm
1.
Initialize
h
to the most specific hypothesis in
H
2.
For each positive training instance
x
2.1
For each attribute constraint
a
i
in
h

if the constraint
a
i
is satisfied by
x

Then do nothing
Else replace
a
i
in
h
by the next more
general constraint that is satisfied by
x

3.
Output hypothesis
h
ICDAR’99 - Tutorial
No revision in case of negative
example: Why?

Basic assumptions:

the target concept
c
is in
H

no errors in training data

h
is the most specific hypothesis in
H

therefore
c


g

h


but
c
will never be satisfied by a
negative example

thus neither will
h
.
ICDAR’99 - Tutorial
Limitations of
Find-S

Can’t tell whether the learner converged
to the correct target concept

Has it found the only hypothesis in
H

consistent with the data, or there are
many other consistent hypotheses as
well?

Picks a maximally specific hypothesis.

Why should we prefer this hypothesis
over, say, the most general?
ICDAR’99 - Tutorial
Limitations of
Find-S (cont.)

Can’t tell when training data are
inconsistent

Inconsistency in training examples
can mislead Find-S, since it ignores
negative examples. Is it possible to
detect such inconsistency?

There might be several maximally
specific consistent hypotheses.

Find-S should backtrack on its
choices in order to explore a different
branch of the partial ordering than the
branch it has selected.
ICDAR’99 - Tutorial
Version Space

Return a version space instead of a
single hypothesis.

The
version space
,
VS
H,D
, with
respect to hypothesis space
H
and
training examples
D
, is the subset of
hypotheses from
H
consistent with the
training examples in
D
.
VS
H,D

{h

H | Consistent(h,D)}

ICDAR’99 - Tutorial
The
List-Then-Eliminate
algorithm
1.

VS
H,D


a list containing every
hypothesis in
H
2.
For each training example,

x,c(x)

remove from
VS
H,D
any hypothesis h
for which
h(x)

c(x)
3. Output the list of hypotheses in
VS
H,D
ICDAR’99 - Tutorial
Pros and cons

Guaranteed to output all hypotheses
consistent with the training data

Can detect inconsistencies in the
training data

Exhaustive enumeration of all
hypotheses:

possible only for finite spaces H

unrealistic for large spaces H
ICDAR’99 - Tutorial
Version Space: A compact
representation

The version space can be represented by its
most general and least general members
(
version space representation theorem
)
H
VS
H,D
G
S
Training instances
Most general hypothesis
General
Specific
ICDAR’99 - Tutorial
General boundary

The
general boundary

G
, with respect to
hypothesis space
H
and training data
D
, is
the set of maximally general members of
H

consistent with
D
G

{ g

H | Consistent(g,D)


(



g’

H [(g’ >
g
g)

Consistent(g’,D)] }

ICDAR’99 - Tutorial
Specific boundary

The
specific boundary

S
, with respect to
hypothesis space
H
and training data
D
, is
the set of minimally general (i.e., maximally
specific) members of
H
consistent with
D
S

{ s

H | Consistent(s,D)


(



s’

H [(s >
g
s’)

Consistent(s’,D)] }

A Version Space
{

Sunny, ?, ?, ?, ?, ?

,

?, Warm, ?, ?, ?, ?

}


Sunny,?,?,Strong,?,?



Sunny,Warm,?,?,?,?



?,Warm,?,Strong,?,?

{

Sunny, Warm, ?, Strong, ?, ?

}

G
S
Candidate-Elimination algorithm

UPDATE-S routine
: For each hypothesis
s

in
S
that is not consistent with
d

Remove
s
from
S

Add to
S
all minimal generalizations
h
of
s

such that

h
is consistent with
d
, and

some member of
G
is more general than
h

Remove from
S
any hypothesis that is more
general than another hypothesis in
S

G


maximally general hypotheses in
H

S


maximally specific hypotheses in
H

For each training example
d
, do

if
d
is a
positive
example

Remove from
G
any hypothesis
inconsistent with
d
H
G
S
VS
H,D


H
G
S
VS
H,D



H
G
S
VS
H,D



H
G
S
VS
H,D




H
G
S
VS
H,D


Candidate-Elimination algorithm
(cont.)

if
d
is a
negative
example

Remove from
S
any hypothesis
inconsistent with
d
H
G
S
VS
H,D
H
G
S
VS
H,D

H
G
S
VS
H,D

H
G
S
VS
H,D


H
G
S
VS
H,D



UPDATE-G routine
: For each hypothesis
g

in
G
that is not consistent with
d

Remove
g
from
G

Add to
G
all minimal specializations
h
of
g

such that

h
is consistent with
d
, and

some member of
S
is more specific than
h

Remove from
G
any hypothesis that is less
general than another hypothesis in
G
Sky
Air
Temp
Humid
Wind
Water
Forecst
Enjoy
Sport
Sunny
Warm
Normal
Strong
Warm
Same
Yes
Sunny
Warm
High
Strong
Warm
Same
Yes
Rainy
Cold
High
Strong
Warm
Change
No
Sunny
Warm
High
Strong
Cool
Change
Yes
D
{

Sunny, ?, ?, ?, ?, ?

,

?, Warm, ?, ?, ?, ?

}


Sunny,?,?,Strong,?,?



Sunny,Warm,?,?,?,?



?,Warm,?,Strong,?,?

{

Sunny, Warm, ?, Strong, ?, ?

}

G
S
ICDAR’99 - Tutorial
What does the Candidate-
Elimination algorithm converge to?
The algorithm will converge towards the
target concept
provided that


there are no errors in the training
examples


there is some hypothesis in H that
correctly describes the target concept
The target concept is learned when
the
S
and
G
boundary sets converge
to a single, identical, hypothesis.
ICDAR’99 - Tutorial
Empty Version Space

The algorithm outputs an empty
version space when:

training data contain errors (a
positive example is presented as
negative).

the target concept cannot be
described in the hypothesis
representation.
ICDAR’99 - Tutorial
Other characteristics

The Candidate-Elimination algorithm
performs a
bi-directional search
.

G
and
S
can grow exponentially in the
number of training examples.
How can partially learned
concepts be used?

Sunny, Warm, Normal, Strong, Cool, Change

{

Sunny, ?, ?, ?, ?, ?

,

?, Warm, ?, ?, ?, ?

}


Sunny,?,?,Strong,?,?



Sunny,Warm,?,?,?,?



?,Warm,?,Strong,?,?

{

Sunny, Warm, ?, Strong, ?, ?

}

G
S
POSITIVE: unanimous agreement
How can partially learned
concepts be used?

Rainy, Cold, Normal, Light, Warm, Same

{

Sunny, ?, ?, ?, ?, ?

,

?, Warm, ?, ?, ?, ?

}


Sunny,?,?,Strong,?,?



Sunny,Warm,?,?,?,?



?,Warm,?,Strong,?,?

{

Sunny, Warm, ?, Strong, ?, ?

}

G
S
NEGATIVE: unanimous agreement
How can partially learned
concepts be used?

Sunny, Warm, Normal, Light, Warm, Same

{

Sunny, ?, ?, ?, ?, ?

,

?, Warm, ?, ?, ?, ?

}


Sunny,?,?,Strong,?,?



Sunny,Warm,?,?,?,?



?,Warm,?,Strong,?,?

{

Sunny, Warm, ?, Strong, ?, ?

}

G
S
Half positive, half negative
How can partially learned
concepts be used?

Sunny, Cold, Normal, Strong, Warm, Same

{

Sunny, ?, ?, ?, ?, ?

,

?, Warm, ?, ?, ?, ?

}


Sunny,?,?,Strong,?,?



Sunny,Warm,?,?,?,?



?,Warm,?,Strong,?,?

{

Sunny, Warm, ?, Strong, ?, ?

}

G
S
Two positive, four negative
ICDAR’99 - Tutorial
An interactive learning algorithm

The learning algorithm is allowed to
choose the next instance (
query
) and
receive the correct classification from
an external oracle (e.g., nature or a
teacher).

If the algorithm always chooses a
query that is satisfied by only half of
hypotheses in
VS
, then the correct
target concept can be found in
steps

Generally, it’s impossible to adopt this
optimal search strategy.


VS
2
log
ICDAR’99 - Tutorial
Dealing with noisy training
instances

Relax the condition that the concept
descriptions be consistent with
all
of the
training instances.
H
G
0
G
1
G
2
S
0
S
1
S
2

Solution for a bounded, predetermined
number of misclassified examples:
maintain several
G
and
S
sets of varying
consistency
.
ICDAR’99 - Tutorial
Dealing with noisy training
instances (cont.)

The set
S
i
is consistent with all but
i
of
the positive training examples

The set
G
i
is consistent with all but
i
of
the negative training examples

When
G
0
crosses
S
0
the algorithm can
conclude that no concept in the rule
space is consistent with
all
of the training
instances.

The algorithm can recover and tray to
find a concept that is consistent with
all
but one
of the training examples.
What if the concept is not contained
in the hypothesis space?

Cannot represent disjunctive target concepts,
such as “Sky=Sunny or Sky=Cloudy”
Sky
Air
Temp
Humid
Wind
Water
Forecst
Enjoy
Sport
Sunny
Warm
Normal
Strong
Cool
Change
Yes
Cloudy
Warm
Normal
Strong
Cool
Change
Yes
Rainy
Warm
Normal
Strong
Cool
Change
No

A more expressive hypothesis space is
requested
.

ICDAR’99 - Tutorial
A hypothesis space that includes
every possible hypothesis?

Choose H that expresses every teachable
concept (i.e., H is the power set of X).

Consider H’= disjunctions, conjunctions,
negations over previous H.

Sunny,Cold,Normal,?,?,?





?,?,?,?,?,Change



What are S and G in this case?

S is the disjunction of positive examples
presented so far

G is the negation of the disjunction of
negative examples seen so far
Totally useless. No generalization at all !
ICDAR’99 - Tutorial
A fundamental property of
inductive inference

A learning algorithm that makes
no a priori assumptions regarding
the identity of the target
concept has no rational basis for
classifying any unseen instance.”
Prior assumption


inductive bias
Do not confuse with the
estimation
bias
commonly used in statistics.
ICDAR’99 - Tutorial
Linear Regression
0
5
10
15
20
25
0
5
10
15
X
Y
The underlying prior assumption (i.e.,
inductive bias) is that the relationship
between X and Y is linear.
A formal definition of inductive bias
Consider

concept learning algorithm
L

instances
X

target concept
c

training examples
D
c
={

x,c(x)

}

let
L(x
i
,D
c
)
denote the classification assigned to
the instance
x
i
by
L
after training on data
D
c
.
The inductive bias of
L
is any minimal set of
assertions
B
such that for any target concept
c

and corresponding training examples
D
c
(

x
i

X) [(B

D
c


x
i
)

L(x
i
,D
c
) ]
Modeling inductive systems by
equivalent deductive systems
The inductive bias that is explicitly input to the
theorem prover is only implicit in the code of the
learning algorithm.
Training examples
New instance
Learning
Algorithm
Classification of
new instance, or
“don’t know”
Inductive System
Training examples
New instance
Theorem
Prover
Classification of
new instance, or
“don’t know”
Deductive System
Inductive Bias
ICDAR’99 - Tutorial
Bias of the Candidate-Elimination
algorithm

The target concept
c
is contained in the
given hypothesis space
H
.

From
c

H
deductively follows
c

VS
H,D
.

L
outputs the classification
L(x
i
,D
c
)

if
and only if every hypothesis in
VS
H,D

also produces this classification,
including the hypothesis
c

VS
H,D

(inductive bias). Therefore
c(x
i
)=L(x
i
,D
c
)
.

ICDAR’99 - Tutorial
Comparing the inductive bias of
learning algorithms

The inductive bias is a nonprocedural means of
characterizing learning algorithms policy for
generalizing beyond the observed data.

Rote learning
:
Store examples,
Classify
x
iff it matches previously
observed example

No inductive bias for the rote learning
.
ICDAR’99 - Tutorial
Comparing the inductive bias of
learning algorithms

Inductive bias of the candidate
elimination algorithm
:
the target
concept can be represented in its
hypothesis space.

Inductive bias of
Find-S
: the target
concept can be represented in its
hypothesis space + all instances are
negative instances unless the
opposite is entailed by its other
knowledge (a kind of default or
nonmonotonic reasoning)
ICDAR’99 - Tutorial
Related work


Winston, P.H. (1970). Learning structural
descriptions from examples, PhD
Dissertation, MIT.
Concept learning can be cast as a search
involving generalization and specialization
operators
Plotkin, G.D. (1970). A note on inductive
generalization. In Meltzer& Michie (Eds.),
Machine Intelligence 5, Edinburgh
University Press
ICDAR’99 - Tutorial
Related work


Plotkin, G.D. (1970). A note on inductive
generalization. In Meltzer& Michie (Eds.),
Machine Intelligence 5
, Edinburgh
University Press


Plotkin, G.D. (1971). A further note on
inductive generalization. In Meltzer&
Michie (Eds.),
Machine Intelligence 6
,
Edinburgh University Press
Early formalization of the
more_general_than
relation and of the
related notion of

-subsumption
ICDAR’99 - Tutorial
Related work


Simon, H.A. & Lea, G. (1973). Problem
solving and rule induction: A unified view.
In Gregg (Ed.),
Knowledge and Cognition
,
Lawrence Erlbaum Associates
Early account of learning as search
through a hypothesis space
ICDAR’99 - Tutorial
Related work


Mitchell, T.M. (1979). Version spaces: A
candidate elimination approach to rule
learning,
Proc. 5th IJCAI
, MIT Press


Mitchell, T.M. (1982). Generalization as
search,
Artificial Intelligence
Early formalization of the version spaces
and candidate-elimination algoritmhm
ICDAR’99 - Tutorial
Related work


Haussler, D. (1988). Quantifying
inductive bias: AI learning algorithms and
Valiant’s learning framework.
Artificial
Intelligence
The size of the general boundary
G
can
grow exponentially in the number of
training examples, even when the
hypothesis space consists of simple
conjunctions of features.
ICDAR’99 - Tutorial
Related work


Smith, B.D. & Rosembloom, P (1990).
Incremental non-backtracking focusing: A
polynomially bounded generalization
algorithm for version spaces.
Proc. 1990
AAAI
A simple change to the representation of
the
G
set can improve complexity in
certain cases
ICDAR’99 - Tutorial
Related work


Hirsh, H. (1991). Theoretical
underpinnings of version spaces.
Proc.
12th IJCAI
Learning can be polynomial in the number
of examples in some cases when the G
set is not stored at all.
ICDAR’99 - Tutorial
Related work


Hirsh, H. (1990).
Incremental version space
merging: A general framework for concept
learning.
Kluwer.

Hirsh, H. (1994). Generalizing version
spaces.
Machine learning
, 17(1), 5-46.

Extension for handling bounded noise in real-
valued attributes that describe the training
examples.

Generalize the Candidate-Elimination
algorithm to handle situations in which training
information can be different types of
constraints represented using version spaces.
ICDAR’99 - Tutorial
Related work


Sebag, M. (1994). Using constraints to build
version spaces.
Proc. 1994 ECML
.


Sebag, M. (1996). Delaying the choice of
bias: A disjunctive version space approach.
Proc. 13th ICML
.


A separate version space is learned for each
positive training example, then new instances
are classified by combining the votes of these
different version spaces.


In this way it is possible to handle noisy
training examples.
ICDAR’99 - Tutorial
Learning disjunctive concepts: How?

Candidate-Elimination is a
least-
commitment algorithm
: it generalizes only
when it is forced to.

Disjunction provides a way of
avoiding

any generalization at all: the algorithm is
never forced to generalize.

In order to learn disjunctive concepts,
some method must be found for
controlling
the introduction of disjunctions,
so to prevent
trivial disjunctions
.

Sequential covering is a widespread
approach to
learning sets of rules
.
Sequential Covering algorithms

Perform repeated candidate-elimination runs to find
several conjunctive descriptions that together cover
all of the training instances.

At each run, a conjunctive concept is found that is
consistent with
some
of the positive training
examples and
all
of the negative training examples.

The positive instances that have been accounted for
are removed from further consideration, and the
process is repeated until all positive examples have
been covered.
+
+
+
+
+
+
+
-
-
-
-
Sequential Covering Algorithm
Sequential-Covering(Target_attribute,Attributes,
Examples, Threshold)

Learned_rules

Ø

Rule

LEARN-ONE-RULE(Target_attribute,Attributes

Examples)

while PERFORMANCE(Rule,Examples) > Threshold do

Learned_rules

Learned_rules

{ Rule }

Examples

Examples - { examples correctly
covered by Rule }

Rule

LEARN-ONE-RULE(Target_attribute,Attributes

Examples)

Learned_rules

sort Learned_rules according to
PERFORMANCE over Examples

return Learned_rules
ICDAR’99 - Tutorial
LEARN-ONE-RULE

LEARN-ONE-RULE accepts a set of
positive and negative training
examples as input, then outputs a
single rule that covers many of the
positive examples and no negative
examples.

High accuracy, but not necessarily
high coverage.

How to implement this procedure?
By applying the Candidate-Elimination
algorithm
ICDAR’99 - Tutorial
Sequential Covering + Candidate
Elimination
1.
Initialize
S
to contain one positive training
example (
seed
example).
2. For each negative training instance apply
the
Update-G
routine to
G
.
3. Choose a description
g
from
G
as one
conjunction for the solution set.

Since
Update-G
has been applied using all the
negative examples,
g
covers no negative
examples.However,
g
may cover several of the
positive examples.
4. Remove from further consideration all
positive examples covered by
g
.
5. Repeat steps 1 through 4 until all positive
examples are covered.
ICDAR’99 - Tutorial
Sequential Covering + Candidate
Elimination
S
1
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
+
-
g
1
g
2
S
2
ICDAR’99 - Tutorial
General-to-specific search

A different approach to implementing
LEARN-ONE-RULE is to organize the
hypothesis space search in the same
fashion as the ID3 algorithm, but to
follow only the most promising branch
in the tree at each step.

Top-down
or
general-to-specific

greedy search through the space of
possible rules
ICDAR’99 - Tutorial
The search space for rule
preconditions
IF Sky=sunny and Humidity=high
THEN EnjoySport=yes
IF Sky=sunny and AirTemp=warm
THEN EnjoySport=yes
IF Sky=sunny
THEN EnjoySport=yes
IF AirTemp=warm
THEN EnjoySport=yes
IF Humidity=normal
THEN EnjoySport=yes
IF
THEN EnjoySport=yes

The search starts by considering the most
general rule precondition possible (the
empty test that matches every example).

At each step, the test that most improves
rule performance is added. Information
gain can be used as heuristics.
ICDAR’99 - Tutorial
Beam search

The general-to-specific search is generally
a greedy
depth-first
search with no
backtracking.

Danger of a suboptimal choice at any step

To reduce this risk a
beam search
is
sometimes performed.

A list of the
k
best candidates at each step
is kept, rather than a
single
best candidate.

Both CN2 (Clark & Niblett, 1989) and AQ
(Michalski, 1969, 1986) perform a top-down
beam search. AQ is also based on
seed

examples to guide the search for rules.
ICDAR’99 - Tutorial
Simultaneous vs. sequential
covering algorithm

Sets of rules can be obtained by converting
decision trees (Quinlan, 1987).

ID3 can be considered a
simultaneous
covering
algorithm, in contrast to
sequential covering
algorithms.

Sequential covering algorithm are slower.

To learn a set of
n
rules, each containing
k

attribute-value tests in their preconditions,
sequential covering algorithm will perform
n·k
primitive search steps (independent
choices).
ICDAR’99 - Tutorial
Simultaneous vs. sequential
covering algorithm

Simultaneous covering algorithms are
generally faster, since each test with
m

possible results contribute to choosing the
preconditions for at least
m
rules.

Sequential covering algorithms make a
large number of
independent
choices,
while simultaneous covering algorithm
make a low number of
dependent
choices.
ICDAR’99 - Tutorial
Computational complexity

Worst case analysis

A: number of attributes

V: max number of values per attribute

N: size of the data set

b: beam

Average case behavior can be very different
Approach
Algorithm
Training
Testing
Space
Version
space
CEA
O
(ANV)
O
(AV)
O
(AV)
Decision
trees
C4.5
O
(A
2
N)
O
(
lg A)
O
(N
lg A)
Rule
induction
CN2
O
(A
2
N b
2
)
O
(Ab)
O
(Ab)
ICDAR’99 - Tutorial
Induce rules directly or convert a
decision tree to a set of rules?

The answer depend on how much training
data is available.

If data is plentiful, then it may support the
larger number of independent decisions
required by sequential covering algorithms.


If data is scarce, the sharing of decisions
regarding preconditions of different rule
may be more effective.
ICDAR’99 - Tutorial
Induce rules directly or convert a
decision tree to a set of rules?

The answer depend on the concept
learning task.

If the concept description is highly
disjunctive with many independent
conditions, decision tree learning
algorithms perform poorly when data is
scarce.
Replication problem
Decision tree representing the Boolean concept AB

CD
A
True
False
B
True
True
False
C
False
D
True
False
True
False
True
False
C
False
D
True
False
True
False
True
False
ICDAR’99 - Tutorial
Single-concept rule learning

In
single-concept learning
, the
learning element is presented with
positive and negative instances of
some concept.

The system has to find rules that
effectively describe the concept under
study.

Given a new case
x
, if it does not
satisfy the preconditions of any rule,
then it is considered as a negative
instance (
default classification
).
ICDAR’99 - Tutorial
Alternatively ...

The system might learn rules for both
positive and negative instances of the
concept.
IF <precondition1> THEN Positive
IF <precondition2> THEN Negative

This is a simple case of
multiple-
concept

learning
.
+
+
+
+
+
+
+
-
-
-
-

Preconditions
are not
necessarily
mutually
exclusive.
ICDAR’99 - Tutorial
Changes to Sequential Covering
algorithm

LEARN-ONE-RULE should be modified
to accept an additional input argument
specifying the target value of interest.

LEARN-ONE-RULE(Target_attribute,
Target_Value
, Attributes, Examples).

Similarly, the PERFORMANCE
subroutine should be changed in order
to prefer those hypotheses that cover a
higher number of examples with respect
to the target value of interest.
Information gain is no more appropriate.

ICDAR’99 - Tutorial
Classification of new cases

No partitioning of the instance space.

Given a new instance
x
three different
situations can occur:

no classification
: the instance
satisfies no precondition.

Single classification
: the instance
satisfies the precondition of rules with
same conclusion (either positive or
negative).

Multiple classification
: the instance
satisfies the preconditions of rules
with different conclusions.
ICDAR’99 - Tutorial
Default rule

The “default” classification might be
desirable if one is attempting to learn a
target concept such as
“pregnant
women who are likely to have twins”
.

The fraction of positive examples in the
entire population is small, so the rule
set would be more compact and
intelligible to humans if it identifies only
the positive examples of the concept.
Positive
region
Negative
region
Instance
space
ICDAR’99 - Tutorial
Learning multiple concepts

The learning system is presented with
training examples that are instances
of several concepts, and it must find
several concept descriptions.

For each concept description, there is
a corresponding region in the
instance space.
Instance
space
A
B
C
ICDAR’99 - Tutorial
Multiple-concept learning

When concepts are independent, a
multiple-concept learning problem can be
reformulated as a sequence of single-
concept learning problems.

The union of the sets of rules learned for
each concept is the output of the multiple-
concept learning algorithm.

Approach followed in AQ (Michalski, 1986).
-
+
+
+
+
-
+
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
-
+
-
-
-
-
+
-
-
-
-
-
+
ICDAR’99 - Tutorial
Multiple classification

If concepts are mutually exclusive,
multiple classification can be an
undesirable result.

To avoid multiple classifications, the
addition of a new rule to the set of
learned rules may require the
modification of preconditions of existing
rules (
knowledge integration problem
).

When underlying concepts are
overlapping, multiple classification might
be a desirable feature.
ICDAR’99 - Tutorial
Learning multiple independent
concepts
In multiple concept learning problems,
examples are typically described by
feature vectors:

a
1
, a
2
, …, a
n
,
c

where
c
is the target attribute. Distinct
values of
c
represent different concepts.

Concepts are intended as mutually
exclusive, that is
independent
.
ICDAR’99 - Tutorial
Learning multiple dependent
concepts
In the general case, examples can be
described by feature vectors:

a
1
, a
2
, …, a
n
,
c
1
, c
2
, …, c
m

where
c
i
are target attributes.

Concepts are not necessarily
independent
.

Sky, Temp, …, Forecast,
EnjoySport,
EnjoyWork, PreferredMusic


Learned rules may take into these concept
dependencies
IF EnjoySport = yes and Forecast=change
THEN PreferredMusic = Jazz
ICDAR’99 - Tutorial
Learning multiple dependent
concepts (cont.)

The instance space of a single-
concept learning problem is defined
by some target attributes as well.

Sky, Temp, …, Forecast, EnjoySport,
PreferredMusic


Which target attributes should be
considered?

Discover the dependencies between
attributes before starting the learning
process, then define instance spaces
accordingly.
ICDAR’99 - Tutorial
Learning multiple dependent
concepts (cont.)

Instance spaces for three single-concept
learning problems:


Sky, Temp, …, Forecast,
EnjoySport




Sky, Temp, …, Forecast, EnjoySport,

EnjoyWork



Sky, Temp, …, Forecast, EnjoySport,
EnjoyWork,
PreferredMusic

EnjoySport
PreferredMusic
EnjoyWork
ICDAR’99 - Tutorial
Learning multiple dependent
concepts (cont.)

Possible attribute dependencies can be
detected by means of statistical
techniques.
EnjoySport
PreferredMusic
EnjoyWork
ICDAR’99 - Tutorial
Learning multiple dependent
concepts (cont.)

Concept dependencies can be discovered
on line
, i.e. during the learning process, by
simultaneously working with
different
instance spaces
for each learning problem.

This is equivalent to explore
different search
spaces
for each concept to be learned.

Main issue
: learning
useless
concept
descriptions, e.g.
IF EnjoySport = yes THEN EnjoyWork = no
IF EnjoyWork = no THEN Enjoysport = yes
ICDAR’99 - Tutorial
Related work


Michalski, R.S. (1969). On the quasi-
minimal solution of the general covering
problem.
Proc. 1st Int. Symposium on
Information Processing
, Bled


Michalski, R.S., Mozetic, I., Hong, J., and
Lavrac, N. (1986). The multi-purpose
incremental learning system AQ15 and its
testing application to three medical
domains.
Proc. 5th AAAI
Early definition of the sequential covering
strategy.
Use of seed examples to guide the
general-to-specific search of single rules.
Unordered set-like list of rules.
ICDAR’99 - Tutorial
Related work


Clark, P., and Niblett, T. (1989). The CN2
induction algorithm.
Machine Learning
,
3
.

General-to-specific search of single rules
performed
à la
ID3 (no seed example).
Ordered list of rules.
ICDAR’99 - Tutorial
Related work


Quinlan, J.R. (1987). Generating
production rules from decision trees.
Proc.
10th IJCAI
.

Early work on a simultaneous covering
algorithm.
ICDAR’99 - Tutorial
Related work


Malerba, D., Semeraro, G., and Esposito
F. (1997).
A Multistrategy Approach to
Learning Multiple Dependent Concepts. In
C., Taylor & R., Nakhaeizadeh (Eds.),
Machine Learning and

Statistics: The
Interface
, 87-106, Wiley
.

Early investigation of the problem of
learning multiple dependent concepts.
Application to document understanding.
ICDAR’99 - Tutorial
Propositional rules

The rule
IF Sky=sunny

AirTemp=Warm

Wind=strong
THEN EnjoySport=Yes

is variable-free: both the precondition and
the conclusion are expressed by
(conjunctions of)
attribute=value
pairs.

Rules like this are said
propositional
,
since they can be expressed as formulas
of the
propositional calculus
.
Sky_sunny

AirTemp_Warm

Wind_strong


EnjoySport

Propositional rules are induced from
examples represented as feature vectors.


ICDAR’99 - Tutorial
Overview

Motivations

What’s learning