Machine Learning I - Outline

courageouscellistΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

68 εμφανίσεις

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Machine Learning I
-

Outline


Introduction to ML


Definitions of ML


ML as a multidisciplinary field


A framework for learning


Inductive Learning

-

Version Space Search

-

Decision Tree Learning


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Introduction to ML


The ability to learn is one of the most important components of
intelligent behaviour


System good in doing a specific job


performs costly computations to solve the problem


does not remember solutions


every time it solves the problem it performs the same
sequence of computations again


A successful understanding of how to make computers learn
would open up many new uses of computers


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Application Areas of ML


ML algorithms for:


computers learning from medical records which treatments
are best for new diseases


speech recognition
-

recognition of spoken words


data mining
-

discovering of valuable knowledge from large
databases of loan applications, financial transactions,
medical records, etc.


prediction and diagnostics
-

prediction of recovery rates of
pneumonia patients, detection of fraudulent use of credit
cards


driving an autonomous vehicle
-

computer controlled
vehicles ...


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Machine Learning: Definitions


T. Mitchell (1997): A computer program learns if it improves its
performance at some task through experience


T. Mitchell (a formal definition, 1997): A computer program is
said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks
in T as measured by P, improves with experience E.


Definition of learning by H. Simon (1983): Any change in a
system that allows it to perform better the second time on
repetition of the same task or on task drawn from the same
population.

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Issues involved in the learning programs


Learning involves changes in the learner


Learning involves generalisation from experience


Performance should improve not only on the repetition of
the same task but also on similar tasks in the domain


Learner is given a limited experience to acquire knowledge
that will generalise correctly to unseen instances of the
domain. This is the problem of induction


Learning algorithms must generalise heuristically



must select
the important aspects of their experience


Learning algorithms must prevent and detect possibilities that
changes in the system may actually degrade its performance

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

ML as Multidisciplinary Field

Key ideas from the fields that impact ML
:


AI
-

learning symbolic representations of concepts, using prior
knowledge together with training data to guide learning


Bayesian methods
-

estimating values of unobserved variables


Computational complexity theory
-

theoretical bounds on complexity
of different learning tasks


Control theory
-

procedures that learn to control processes and to
predict the next state of the controlled process


Information theory
-

measures of entropy and information content,
minimum description length approaches, optimal codes


Statistics
-

confidence intervals and statistical tests


Philosophy
-

Occam’s razor, suggesting that the simplest hypothesis is
the best


Psychology and neurobiology
-

motivation for artificial neural
networks

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

A Framework for Learning


A well
-
defined learning problem is identified by


class of tasks,


measure of performance to be improved, and


the source of experience.


Example 1: A checkers learning problem


Task: playing checkers


Performance measure: percent of games won against
opponents


Training experience: playing practice games against itself


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

A Framework for Learning


Example 2: A handwriting recognition learning problem


Task: recognising and classifying handwritten words within
images


Performance measure: percent of words correctly classified


Training experience: database of classified handwritten
words


ML algorithms vary in their goals, in the representation of
learned knowledge, in the available training data, and in the
learning strategies


all learn by searching through a space of possible concepts to
find an acceptable generalisation


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Inductive Concept Learning: Definitions


What is
induction
?


Induction is reasoning from properties of individuals to
properties of sets of individuals


What is a
concept
?


U

-

universal set of objects (observations)


a concept
C

is a subset of objects in
U
,
C

U


Examples:


C

is a set of all black birds (if
U

is a set of all birds)


C

is a set of mammalian (if
U

is a set of all animals)


Each concept can be thought of as a boolean
-
valued function
defined over the set
U


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Inductive Concept Learning: Definitions


What is
concept learning
?


To learn a concept C means to be able to recognize which
objects in U belong to C


What is
inductive concept learning
?


Given a sample of positive and negative training examples of
the concept C


Find a procedure (a predictor, a classifier) able to tell, for each
x

U, whether
x

C

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Concept Learning

and the General
-
To
-
Specific Ordering


Concept learning as a problem of searching through a space of
potential hypotheses for the hypothesis that best fits the training
data


In many cases the search can be efficiently organised by taking
advantage of a naturally occurring structure over the hypothesis
space
-

a
general
-
to
-
specific ordering

of hypothesis


Version spaces and the Candidate
-
elimination algorithm

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Describing objects and concepts


Formal description languages:


example space L
E

-

language describing instances


hypothesis space L
H

-

language describing concepts


Terminology:


hypothesis
H

-

a concept description


example
e

=

(ObjectDescription, ClassLabel)


positive example
e
+

-

description of a positive instance of
C


negative example
e
-

-

description of a non
-
instance


example set
E
:
E

=
E
+



E
-

for learning a simple concept
C


coverage:
H

covers
e
, if
e

satisfies (fulfils, matches) the
conditions stated in
H


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Prototypical Concept Learning Task


Given:


Instances
X
: Possible days, each described by the attributes


Sky
,
AirTemp
,
Humidity
,
Wind
,
Water
,
Forecast


Target function
c
:
EnjoySport
:
X

{0,1}


Hypotheses
H
: Conjunctions of literals. E.g.


?,
Cold
,
High
,?,?,?

.


Training examples
E
: Positive and negative examples of the target
function


x
1,
c
(
x
1)

,…,

xm
,
c
(
xm
)

.


Determine:

A hypothesis
h

in
H

such that
h
(
x
)=
c
(
x
) for all
x

in
E
.


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Representing Hypothesis


Many possible representations


Here, h is conjunction of constraints on attributes


Each constraint can be


a specific value (e.g.
Water
=
Warm
)


don’t care (e.g.
Water
=?)


no value allowed (e.g.
Water
=

)


Example:

Sky

AirTemp

Humid

Wind

Water

Forecast



Sunny

? ?
Strong

?
Same


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Training Examples for Enjoy Sport


Sky Temp Humidity Wind Water Forecast EnjoySport

Sunny Warm Normal Strong Warm Same YES

Sunny Warm High Strong Warm Same YES

Rainy Cold High Strong Warm Change NO

Sunny Warm High Strong Cool Change YES


What is the general concept?


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

is more_general_than_or_equal_to relation


Definition of
more_general_than_or_equal_to

relation:


Let
h
j

and
h
k

be boolean
-
valued functions defined over
X
. Then
h
j

is more_general_than_or_equal_to

h
k

(
h
j


g
h
k
) iff



(

x

X
) [(
h
k
(
x
)=1)

(
h
j
(
x
)=1)]

In our case the most general hypothesis
-

that every day is a
positive example
-

is represented by






?, ?, ?, ?, ?, ?

,

and the most specific possible hypothesis
-

that no day is
positive example
-

is represented by







,

,

,

,

,



.

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Example of the Ordering of Hypotheses


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Find
-
S: Finding a Maximally specific Hypothesis


Algorithm:

1. Initialise
h

to the most specific hypothesis in
H

2.
For

each positive training instance x


For

each attribute constraint
a
i

in
h


If

the constraint
a
i

is satisfied by
x


then

do nothing


else

replace
a
i

in
h

by the next more general


constraint that is satisfied by
x

3. Output hypothesis h

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Find
-
S Algorithm in Action


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Conclusions on Find
-
S Algorithm


Find
-
S is guaranteed to output the most specific hypothesis
within H that is consistent with the positive training examples


Issues:


Has the learner converged to the only correct target concept
consistent with the training data?


Why prefer the most specific hypothesis?


Are the training examples consistent?

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Version Space

and the Candidate
-
Elimination Algorithm


A hypothesis h is
consistent

with a set of training examples
E

of
target concept
c

iff
h
(
x
)=
c
(
x
) for each training example

x
,
c
(
x
)


in
E
.



Consistent
(
h
,
E
)


(

x
,
c
(
x
)




E
)
h
(
x
) =
c
(
x
)


The version space,
VS
H,E
, with respect to hypothesis space
H

and
training examples
E
, is the subset of hypotheses from
H

consistent with all training examples in
E
.




VS
H,E



{
h

H
|
Consistent
(
h
,
E
)}

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Version Space Example


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Representing Version Space


The
General boundary
,
G
, of version space
VS
H,E
, is the set of
its maximally general members


The
Specific boundary
,
S
, of version space
VS
H,E
, is the set of
its maximally specific members


Every member of the version space lies between these
boundaries


VS
H,E
, = {
h

H

| (

s

S
) (

g

G
) (
g

h

s
)}



where
x

y

means
x

is more general or equal to
y

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Candidate Elimination Algorithm

G



maximally general hypothesis in
H

S


maximally specific hypothesis in
H

For each training example
e
, do


If
e

is positive example


delete from
G

descriptions not covering
e


replace
S

(by generalisation) by the set of least general (most
specific) descriptions covering
e


remove from
S

redundant elements

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Candidate Elimination Algorithm


If
e

is negative example


delete from
S

descriptions covering
e


replace
G

(by specialisation) by the set of most general
descriptions not covering
e


remove from
G

redundant elements


The detailed implementation of the operations “
compute
minimal generalisations
” and “
compute minimal
specialisations
” of given hypothesis on the specific
representations for instances and hypotheses.

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Converging Boundaries of the G and S sets



J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Example Trace
(1)



J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Example Trace
(2)



J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Example Trace
(3)



J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Example Trace
(4)



J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

How to Classify new Instances?


New instance
i

is classified as a positive instance if every
hypothesis in the current version space classifies it as positive.


Efficient test
-

iff the instance satisfies every member of
S


New instance
i

is classified as a negative instance if every
hypothesis in the current version space classifies it as negative.


Efficient test
-

iff the instance satisfies none of the members
of
G


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

New Instances to be Classified

A




Sunny, Warm, Normal, Strong, Cool, Change


(YES)

B




Rainy, Cold, Normal, Light, Warm, Same



(NO)

C




Sunny, Warm, Normal, Light, Warm, Same



(P
pos
(C)=3/6)

D




Sunny, Cold, Normal, Strong, Warm, Same


(P
pos
(C)=2/6)




J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Remarks on Version Space

and Candidate
-
Elimination


The algorithm outputs a set of all hypotheses consistent with the
training examples


iff there are no errors in the training data


iff there is some hypothesis in
H

that correctly describes the
target concept


The target concept is exactly learned when the
S

and
G

boundary sets converge to a single identical hypothesis.


Applications


learning regularities in chemical mass spectroscopy


learning control rules for heuristic search

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Decision Tree Learning

Method for approximating discrete
-
valued target function

-

learned function is represented by a decision tree


Decision tree representation


Appropriate problems for decision tree learning


Decision tree learning algorithm


Entropy, Information gain


Overfitting

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Decision Tree Representation


Representation:


Internal node test on some property (attribute)


Branch corresponds to attribute value


Leaf node assigns a classification


Decision trees represent a disjunction of conjunctions of
constraints on the attribute values of instances

(
Outlook
=
Sunny



Humidity

=
Normal
)



(
Outlook

=
Overcast
)



(
Outlook

=
Rain



Wind

=
Weak
)


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Decision Tree Example



J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Appropriate problems for decision Trees


Instances are represented by attribute
-
value pairs


Target function has discrete output values


Disjunctive hypothesis may be required


Possibly noisy training data


data may contain errors


data may contain missing attribute values

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Play tennis: Training examples

Day

Outlook

Temperature

Humidity

Wind

Play Tennis

D1


Sunny

Hot

High

Weak

No

D2


Sunny

Hot

High

Strong

No

D3


Overcast

Hot

High

Weak

Yes

D4


Rain

Mild

High

Weak

Yes

D5


Rain

Cool

Normal

Weak

Yes

D6


Rain

Cool

Normal

Strong

No

D7


Overcast

Cool

Normal

Strong

Yes

D8


Sunny

Mild

High

Weak

No

D9


Sunny

Cool

Normal

Weak

Yes

D10

Rain

Mild

Normal

Weak

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Learning of Decision Trees

Top
-
Down Induction of Decision Trees


Algorithm: The
ID3

learning algorithm (Quinlan, 1986)


If all examples from
E

belong to the same class
C
j


then label the leaf with
C
j


else


select the “best” decision attribute
A

with values


v
1
,
v
2
, …,
v
n
for next node


divide the training set
S

into
S
1
, …,
S
n

according to
values
v
1
,…,
vn


recursively build subtrees
T
1
, …,
T
n

for
S
1
, …,
S
n


generate decision tree
T


Which attribute is best?

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Entropy


S

-

a sample of training examples;


p
+

(
p
-
) is a proportion of positive (negative) examples in
S


Entropy
(
S
) = expected number of bits needed to encode the
classification of an arbitrary member of
S



Information theory: optimal length code assigns



-
log
2
p

bits to message having probability
p


Expected number of bits to encode “+” or “
-
” of random
member of
S
:




Entropy
(
S
)


-

p
-



log
2
p
-

-

p
+



log
2
p
+



Generally for
c

different classes



Entropy
(
S
)



c
-

p
i



log
2
p
i

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Entropy


The entropy function relative to a boolean classification, as the
proportion of positive examples varies between 0 and 1


entropy as a measure of impurity in a collection of examples

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Information Gain Search Heuristic


Gain
(
S
,
A
)
-

the expected reduction in entropy caused by
partitioning the examples of
S

according to the attribute
A
.


a measure of the effectiveness of an attribute in classifying
the training data









Values
(
A
)
-

possible values of the attribute
A


Sv

-

subset of
S
, for which attribute
A

has value
v


The best attribute has maximal
Gain
(
S
,
A
)


Aim is to minimise the number of tests needed for class.


J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Play Tennis: Information Gain

Values
(
Wind
) = {
Weak
,
Strong
}


S

= [9+, 5
-
],
E
(
S
) = 0.940


S
weak

= [6+, 2
-
],
E
(
S
weak
) = 0.811


S
strong

= [3+, 3
-
],
E
(
S
strong
) = 1.0

Gain
(
S
,
Wind
) =
E
(
S
)
-

(8/14)


E
(
S
weak
)
-

(6/14)


E
(
S
strong
) =



0.940
-

(8/14)


0.811
-

(6/14)


1.0 = 0.048

Gain
(
S
,
Outlook
) = 0.246

Gain
(
S
,
Humidity
) = 0.151

Gain
(
S
,
Temperature
) = 0.029

J. Kubalík,
Gerstner Laboratory for Intelligent Decision Making and Control

Remarks on ID3


ID3 maintains only a single current hypothesis


No backtracking in its search


convergence to a locally optimal solution


ID3 strategy prefers shorter trees over longer ones; high information
gain attributes are placed close to the root


Simplest tree should be the least likely to include unnecessary
constraints


Overfitting in Decision Trees
-

pruning


Statistically
-
based search choices



Robust to noisy data