Instance based and Bayesian

lettuceescargatoireΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

52 εμφανίσεις

Instance based and Bayesian
learning

Kurt Driessens


with slide ideas from
a.o
.
Hendrik

Blockeel
, Pedro
Domingos
, David Page,
Tom
Dietterich

and
Eamon

Keogh

Overview

Nearest neighbor methods


Similarity


Problems:


dimensionality of data, efficiency, etc.


Solutions:


weighting, edited NN, kD
-
trees, etc.


Naïve Bayes


Including an introduction to Bayesian ML methods


Nearest Neighbor: A very simple idea

Imagine the world’s
music collection
represented in
some space


When you like a song,
other songs residing
close to it should
also be interesting


Picture from Oracle

Nearest Neighbor Algorithm

1.
Store all the examples <
x
i
,y
i
>

2.
Classify a new example
x

by finding the
stored example
x
k

that most resembles it and
predicts that example’s class y
k

+

+

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

-

?

Some properties


Learning is very fast

(although we come back to this later)


No information is lost


Hypothesis space


variable size


complexity of the hypothesis rises with the
number of stored examples

Decision Boundaries

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

Voronoi diagram

Boundaries
are not
computed!

Keeping All Information

Advantage
: no details lost

Disadvantage
: "details" may be noise

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

-

-

+

+

k
-
Nearest
-
Neighbor: kNN

To improve robustness against noisy learning
examples, use a set of nearest neighbors


For classification:

use voting



k
-
Nearest
-
Neighbor: kNN (2)

For regression:

use the mean



1

1

1

2

2

4

4

4

4

3

3

3

5

5

5

Lazy vs Eager Learning

kNN doesn’t do anything until it needs to make a
prediction =
lazy learner


Learning is fast!


Predictions require work and can be slow


Eager learners
start computing as soon as they
receive data

Decision tree algorithms, neural networks, …


Learning can be slow


Predictions are usually fast!

Similarity measures

Distance metrics: measure of dis
-
similarity

E.g. Manhattan, Euclidean or L
n
-
norm for numerical
attributes






Hamming distance for nominal attributes



d
(
x
,
y
)


(
x
i
,
y
i
)
i

1
n

whe
r
e

(
x
i
,
y
i
)

0
i
f
x
i

y
i

(
x
i
,
y
i
)

1
i
f
x
i

y
i
Distance definition = critical!

E.g. comparing humans

1.
1.85m, 37yrs

2.
1.83m, 35yrs

3.
1.65m, 37yrs


d(1,2) = 2.00

0999975


d(1,3) = 0.2

d(2,3) = 2.00808





1.
185cm, 37yrs

2.
183cm, 35yrs

3.
165cm, 37yrs


d(1,2) = 2.8284


d(1,3) = 20.0997


d(2,3) = 18.1107




Normalize attribute values

Rescale all dimensions such that the range is
equal, e.g. [
-
1,1] or [0,1]


For [0,1] range:


with m
i

the minimum and M
i

the maximum value for attribute i



x
i
'

x
i

m
i
M
i

m
i
Curse of dimensionality

Assume a uniformly distributed set of 5000
examples

To capture 5 nearest neighbors we need:


in 1 dim: 0.1% of the range


in 2 dim: = 3.1% of the range


in n dim: 0.1%
1/n



0.1%
Curse of Dimensionality (2)

With 5000 points in 10 dimensions, each
attribute range must be covered approx. 50%
to find 5 neighbors …

?

Curse of Noisy Features

Irrelevant features destroy the metric’s
meaningfulness

Consider a 1dim problem where the query x is at the origin,
the nearest neighbor x
1

is at 0.1 and the second neighbor
x
2

at 0.5 (after normalization)

Now add a uniformly
random feature. What is
the probability that x
2

becomes the closest
neighbor?


approx. 15% !!

Curse of Noisy Features (2)

Location of x
1

vs x
2

on informative dimension

Weighted Distances

Solution: Give each attribute a different weight
in the distance computation

for each
attribute

for
each
class

for each
example in
that class

Selecting attribute weights

Several options:


Experimentally find out which weights work well
(cross
-
validation)


Other solutions
,
e.g. (
Langley,1996)

1.
Normalize attributes (to scale 0
-
1)

2.
Then select weights according to "average attribute
similarity within class”

More distances

Strings


Levenshtein

distance/edit distance

=
minimal number of changes
needed to change
one word into the other

Allowed edits/changes:

1.
delete character

2.
insert character

3.
change character

(not used by some other edit
-
distances)

Even more distances









n
i
i
i
c
q
C
Q
D
1
2
,
Q

C

D
(
Q
,
C
)

Given two time series:



Q

=
q
1

q
n



C

=
c
1

c
n


Euclidean


Start and end times are critical!

D
(
Q
,
R
)

R

Sequence distances (2)

Dynamic Time Warping





Dimensionality reduction

Fixed Time Axis

Sequences are aligned “one to one”.


Warped


Time Axis

Nonlinear alignments are possible.

Distance
-
weighted kNN

k places arbitrary border on example relevance


Idea: give higher weight to closer instances

Can now use all training instances instead of only k
(“Shepard’s method”)






2
1
1
)
,
(
1
with
)
(
)
(
ˆ
i
q
i
k
i
i
k
i
i
i
q
x
x
d
w
w
x
f
w
x
f






!

In high
-
dimensional spaces, a function of d that “goes to zero fast

enough” is needed.
(Again “curse of dimensionality”.)

Fast Learning


Slow Predictions

Efficiency


For each prediction, kNN needs to compute the
distance (i.e. compare all attributes) for ALL stored
examples


Prediction time = linear in the size of the data
-
set


For large training sets and/or complex distances, this
can be too slow to be practical

(1) Edited k
-
nearest neighbor

Use only part of the training data


Less storage


Order dependent


Sensitive to noisy data


More advanced
alternatives exist (= IB3)

(2) Pipeline filters

Reduce time spent on far
-
away examples by
using more efficient distance
-
estimates first



Eliminate most examples using rough distance
approximations


Compute more precise distances for examples in
the neighborhood

(3) kD
-
trees

Use a clever data
-
structure to eliminate the
need to compute all distances


kD
-
trees are similar to decision trees except


splits are made on the median/mean value of
dimension with highest variance


each node stores one data point, leaves can be
empty

Example kD
-
tree









Use a form of A* search using the minimum distance to a
node as an underestimate of the true closest distance

Finds closest neighbor in logarithmic (depth of tree) time

kD
-
trees (cont.)

Building a good kD
-
tree may take some time



Learning time is no longer 0


Incremental learning is no longer trivial


kD
-
tree will no longer be balanced


re
-
building the tree is recommended when the max
-
depth becomes larger than 2* the minimal required
depth (= log(N) with N training examples)

Cover trees
are more advanced, more complex,
and more efficient!!

(4) Using
Prototypes

The rough decision surfaces of nearest neighbor
can sometimes be considered a disadvantage






Solve two problems at once by using prototypes


= Representative for a whole group of instances

+

+

+

+

-

-

-

+

+

+

+

-

-

-

+

-

Prototypes (cont.)

Prototypes can be:


Single instance, replacing a group


Other structure (e.g., rectangle, rule, ...)

-
> in this case: need to define distance




+

+

+

+

-

-

-

Recommender Systems through
instance based learning

Movie

Alice (1)

Bob (2)

Carol (3)

Dave (4)


(romance)


(action)

Love at last

5

5

0

0

0.9

0

Romance forever

5

?

?

0

1.0

0.01

Cute puppies of love

?

4

0

?

0.99

0

Nonstop car chases

0

0

5

4

0.1

1.0

Swords vs. karate

0

0

5

?

0

0.9

Predict ratings for films users have not yet seen (or rated).

Recommender Systems

Predict through instance based regression:

Some Comments on
k
-
NN

Positive


Easy to implement


Good “baseline” algorithm /
experimental control


Incremental learning easy


Psychologically plausible
model of human memory

Negative


Led astray by irrelevant
features


No insight into domain (no
explicit model)


Choice of distance function
is problematic


Doesn’t exploit/notice
structure in examples

Summary


Generalities of instance based learning


Basic idea, (dis)advantages, Voronoi diagrams, lazy
vs. eager learning


Various instantiations


kNN, distance
-
weighted methods, ...


Rescaling attributes


Use of prototypes


Bayesian learning

This is going to be very introductory



Describing (results of) learning processes


MAP and ML hypotheses


Developing practical learning algorithms


Naïve Bayes learner


application: learning to classify texts


Learning Bayesian belief networks

Bayesian approaches

Several roles for probability theory in machine
learning:


describing existing learners


e.g. compare them with “optimal” probabilistic
learner


developing practical learning algorithms


e.g. “Naïve Bayes” learner


Bayes’ theorem

plays a central role


Basics of probability


P(A): probability that A happens


P(A|B): probability that A happens, given that
B happens (“conditional probability”)


Some rules:


complement: P(not A) = 1
-

P(A)


disjunction: P(A or B) = P(A)+P(B)
-
P(A and B)


conjunction: P(A and B) = P(A) P(B|A)

= P(A) P(B) if A and B independent


total probability:P(A) =

i

P(A|B
i
) P(B
i
)

With each B
i

mutually exclusive

Bayes’ Theorem




P(A|B) = P(B|A) P(A) / P(B)


Mainly 2 ways of using Bayes’ theorem:


Applied to learning a hypothesis h from data D:


P(h|D) = P(D|h) P(h) / P(D) ~ P(D|h)P(h)


P(h): a priori probability that h is correct


P(h|D): a posteriori probability that h is correct


P(D): probability of obtaining data D


P(D|h): probability of obtaining data D if h is correct


Applied to classification of a single example e:


P(class|e) = P(e|class)P(class)/P(e)

Bayes’ theorem: Example

Example:


assume some lab test for a disease has 98%
chance of giving positive result if disease is
present, and 97% chance of giving negative result
if disease is absent


assume furthermore 0.8% of population has this
disease


given a positive result, what is the probability that
the disease is present?

P(
Dis|Pos
) = P(
Pos|Dis
)P(Dis) / P(
Pos
) =

0.98*0.008 / (0.98*0.008 + 0.03*0.992)

MAP and ML hypotheses

Task:
Given the current data D and some
hypothesis space H, return the hypothesis h in H
that is most likely to be correct
.


Note: this h is
optimal

in a certain sense


no method can exist that finds with higher
probability the correct h

MAP hypothesis

Given some data D and a hypothesis space H,
find the hypothesis
h

H

that has the highest
probability of being correct; i.e., P(
h|D
) is
maximal

This hypothesis is called the
maximal a posteriori

hypothesis

h
MAP

:
h
MAP

=
argmax
h

H

P(
h|D
)

=
argmax
h

H

P(
D|h
)P(h)/P(D) =
argmax
h

H

P(
D|h
)P(h)


last equality holds because P(D) is constant


So : we need P(
D|h
) and P(h) for all
h

H

to compute
h
MAP


ML hypothesis


P(h): a priori probability that h is correct


What if no preferences for one h over another?


Then assume P(h) = P(h’) for all h, h’

H


Under this assumption h
MAP

is called the
maximum
likelihood hypothesis

h
ML


h
ML

= argmax
h

H

P(D|h)

(because P(h) constant)


How to find h
MAP

or h
ML
?


brute force method: compute P(D|h), P(h) for all h

H


usually not feasible

Naïve Bayes classifier

Simple & popular classification method


Based on Bayes’ rule + assumption of
conditional independence


assumption often violated in practice


even then, it usually works well


Example application: classification of text
documents

Classification using Bayes rule

Given attribute values, what is most probable value of
target variable?








Problem: too much data needed to estimate P(a
1
…a
n
|v
j
)

)
(
)
|
,...,
,
(
max
arg
)
,...,
,
(
)
(
)
|
,...,
,
(
max
arg
)
,...,
,
|
(
max
arg
2
1
2
1
2
1
2
1
j
j
n
V
v
n
j
j
n
V
v
n
j
V
v
MAP
v
P
v
a
a
a
P
a
a
a
P
v
P
v
a
a
a
P
a
a
a
v
P
v
j
j
j






The Naïve Bayes classifier

Naïve Bayes assumption
: attributes are
independent, given the class


P(a
1
,…,a
n
|v
j
) = P(a
1
|v
j
)P(a
2
|v
j
)…P(a
n
|v
j
)


also called
conditional independence

(given the
class)



Under that assumption, v
MAP

becomes






i
j
i
j
V
v
NB
v
a
P
v
P
v
j
)
|
(
)
(
max
arg
Learning a Naïve Bayes classifier

To
learn

such a classifier: just estimate P(v
j
),
P(a
i
|v
j
) from data




How to estimate?


simplest: standard estimate from statistics


estimate probability from sample proportion


e.g., estimate P(A|B) as count(A and B) / count(B)


in practice, something more complicated
needed…




i
j
i
j
V
v
NB
v
a
P
v
P
v
j
)
|
(
ˆ
)
(
ˆ
max
arg
Estimating probabilities

Problem:


What if attribute value
a
i

never observed for class
v
j
?


Estimate P(
a
i
|v
j
)=0 because count(
a
i

and
v
j
) = 0 ?


Effect is too strong: this 0 makes the whole product 0!

Solution: use m
-
estimate


interpolates between observed value
n
c
/n and a priori
estimate p
-
> estimate may get close to 0 but never 0


m is weight given to a priori estimate

m
n
mp
n
v
a
P
c
j
i



)
|
(
ˆ
Learning to classify text

Example application:


given text of newsgroup article, guess which
newsgroup it is taken from


Naïve bayes turns out to work well on this
application


How to apply NB?


Key issue : how do we represent examples? what
are the attributes?


Representation

Binary classification (+/
-
) or multiple classes
possible


Attributes = word frequencies


Vocabulary = all words that occur in learning task


# attributes = size of vocabulary


Attribute value = word count or frequency in the
text (using m
-
estimate)


= “Bag of Words” representation

Algorithm

procedure

learn_naïve_bayes_text(
E
: set of articles, V: set of classes)


Voc = all words and tokens occurring in E


estimate P(v
j
) and P(w
k
|v
j
) for all w
k

in E and v
j

in V:



N
j

= number of articles of class j



N = number of articles



P(v
j
) = N
j
/N



n
kj

= number of times word w
k

occurs in text of class j



n
j

= number of words in class j (counting doubles)



P(w
k
|v
j
) = (n
kj
+1)/(n
j
+|Voc|)


procedure

classify_naïve_bayes_text(A: article)


remove from A all words/tokens that are not in Voc


return argmax
vj

V

P(v
j
)

i

P(a
i
|v
j
)

Some (old) experimental results:


1000 articles taken from 20 newsgroups


guess correct newsgroup for unseen documents


89% classification accuracy with previous
approach



Note: more recent approaches based on SVMs,
… have been reported to work better


But Naïve Bayes still used in practice, e.g., for
spam detection

Bayesian Belief Networks

Consider two extremes of spectrum:


guessing joint probability distribution


would yield optimal classifier


but infeasible in practice (too much data needed)


Naïve Bayes


much more feasible


but strong assumptions of conditional independence


Is there something in between?


make some independence assumptions, but only
where reasonable

Bayesian belief networks

Bayesian belief network consists of

1:
graph


intuitively: indicates which variables “directly influence”
which other variables


arrow from A to B: A has direct effect on B


parents(X) = set of all nodes directly influencing X


formally: each node is
conditionally independent of
each of its non
-
descendants, given its parents


conditional independence: cf. Naïve Bayes


X conditionally independent of Y given Z iff P(X|Y,Z) = P(X|Z)

2:
conditional probability tables


for each node X : P(X|parents(X)) is given

Example


Burglary or earthquake may cause alarm to go off


Alarm going off may cause one of
neighbours

to
call

Burglary

Earthquake

Alarm

John calls

Mary calls



B,E B,
-
E
-
B,E
-
B,
-
E


A

0.9 0.8 0.4 0.01


-
A

0.1 0.2 0.6 0.99

E

0.01

-
E

0.99



A
-
A


M

0.9 0.2


-
M

0.1 0.8

B

0.05

-
B

0.95



A
-
A


J

0.8 0.1


-
J

0.2 0.9

Network topology usually reflects
direct causal
influences


other structure also possible


but may render network more complex

Mary calls

Earthquake

Burglary

John calls

Alarm

Burglary

Earthquake

Alarm

John calls

Mary calls

Graph + conditional probability tables allow to
construct joint probability distribution of all
variables


P(X
1
,X
2
,…,X
n
) =

i

P(X
i
|parents(X
i
))


In other words:
bayesian belief network carries full
information on joint probability distribution

Inference

Given values for certain nodes, infer probability
distribution for values of other nodes



General algorithm quite complicated


See
, e.g.,

Russel

&
Norvig
, 1995:
Artificial
Intelligence, a Modern Approach

General case

In general: inference is NP
-
complete


approximating methods, e.g. Monte
-
Carlo

to be predicted

evidence (observed)

unobserved

Learning
bayesian

networks


Assume structure of network given:


only conditional probability tables to be learnt


training examples may include values for all
variables, or just for some of them


when all variables observable:


estimating probabilities as easy as for Naïve Bayes


e.g. estimate P(A|B,C) as count(A,B,C)/count(B,C)


when not all variables observable:


methods based on gradient descent or EM


When structure of network not given:


search for structure + tables


e.g. propose structure, learn tables


propose change to structure, relearn, see whether
better results


active research topic

To remember


Importance of Bayes’ theorem


MAP, ML, MDL


definitions, characterising learners from this
perspective, relationship MDL
-
MAP


Bayes optimal classifier, Gibbs classifier


Naïve Bayes: how it works, assumptions made,
application to text classification


Bayesian networks: representation, inference,
learning