Machine Learning 21431

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

138 εμφανίσεις

Machine Learning 21431

Instance Based Learning

Outline


K
-
Nearest Neighbor


Locally weighted learning


Local linear models


Radial basis functions


Literature & Software


T. Mitchell, “Machine Learning”, chapter 8,
“Instance
-
Based Learning”


“Locally Weighted Learning”, Christopher
Atkeson, Andrew Moore, Stefan Schaal

ftp:/ftp.cc.gatech.edu/pub/people/cga/air.html

R. Duda et ak, “Pattern recognition”, chapter 4
“Non
-
Parametric Techniques”


Netlab toolbox


k
-
nearest neighbor classification


Radial basis function networks



When to Consider Nearest
Neighbors


Instances map to points in R
N


Less than 20 attributes per instance


Lots of training data

Advantages:


Training is very fast


Learn complex target functions


Do not loose information

Disadvantages:


Slow at query time


Easily fooled by irrelevant attributes

Instance Based Learning

Key idea: just store all training examples <x
i
,f(x
i
)>

Nearest neighbor:


Given query instance x
q
, first locate nearest
training example x
n
, then estimate f(x
q
)=f(x
n
)

K
-
nearest neighbor:


Given x
q
, take vote among its k nearest neighbors
(if discrete
-
valued target function)


Take mean of f values of k nearest neighbors (if
real
-
valued) f(x
q
)=

i=1
k

f(x
i
)/k

Voronoi Diagram

query point q
f

nearest neighbor q
i

3
-
Nearest Neighbors

query point q
f

3 nearest neighbors

2x
,
1o

7
-
Nearest Neighbors

query point q
f

7 nearest neighbors

3x
,
4o

Behavior in the Limit


Consider p(x) the probability that instance x is
classified as positive (1) versus negative (0)


Nearest neighbor:

As number of instances


approaches Gibbs
algorithm


Gibbs algorithm: with probability p(x) predict , else 0


K
-
nearest neighbors:

As number of instances


approaches Bayes
optimal classifier


Bayes optimal: if p(x)> 0.5 predict 1 else 0

Nearest Neighbor (continuous)

3
-
nearest neighbor

Nearest Neighbor (continuous)

5
-
nearest neighbor

Nearest Neighbor (continuous)

1
-
nearest neighbor

Locally Weighted Regression


Forms an explicit approximation f*(x) for region
surrounding query point x
q
.


Fit linear function to k nearest neighbors


Fit quadratic function


Produces piecewiese approximation of f


Squared error over k nearest neighbors

E(x
q
) =

xi


nearest neighbors

(f*(x
q
)
-
f(x
i
))
2



Distance weighted error over all neighbors

E(x
q
) =

i

(f*(x
q
)
-
f(x
i
))
2

K(d(x
i
,x
q
))

Locally Weighted Regression


Regression means approximating a real
-
valued
target function


Residual is the error f*(x)
-
f(x)


in approximating the target function


Kernel function is the function of distance that is
used to determine the weight of each training
example. In other words, the kernel function is the
function K such that w
i
=K(d(x
i
,x
q
))

Distance Weighted k
-
NN

Give more weight to neighbors closer to the
query point

f*(x
q
) =

i=1
k

w
i

f(x
i
) /

i=1
k
w
i

where w
i
=K(d(x
q
,x
i
))

and d(x
q
,x
i
) is the distance between x
q

and x
i

Instead of only k
-
nearest neighbors use all
training examples (Shepard’s method)


Distance Weighted Average


Weighting the data:

f*(x
q
) =

i

f(x
i
) K(d(x
i
,xq))/

i

K(d(x
i
,x
q
))

Relevance of a data point (x
i
,f(x
i
)) is measured
by calculating the distance d(x
i
,x
q
) between
the query x
q

and the input vector x
i


Weighting the error criterion:



E(x
q
) =

i

(f*(x
q
)
-
f(x
i
))
2

K(d(x
i
,x
q
))


the best estimate f*(x
q
) will minimize the cost
E(x
q
), therefore

E(x
q
)/

f*(x
q
)=0

Kernel Functions

Distance Weighted NN

K(d(x
q
,x
i
)) = 1/ d(x
q
,x
i
)
2


Distance Weighted NN

K(d(x
q
,x
i
)) = 1/(d
0
+d(x
q
,x
i
))
2


Distance Weighted NN

K(d(x
q
,x
i
)) = exp(
-
(d(x
q
,x
i
)/

0
)
2
)

Example: Mexican Hat


f(x
1
,x
2
)=sin(x
1
)sin(x
2
)/x
1
x
2

approximation

Example: Mexican Hat


residual

Locally Weighted Linear Regression


Local linear function


f*(x) = w
0

+

n

w
n

x
n



Error criterion


E =

i

(
w
0

+

n

w
n

x
qn

-
f(x
i
))
2

K(d(x
i
,x
q
))


Gradient descent

D
w
n

=

i

(
f*(x
q
)
-

f(x
i
)) x
n

K(d(x
i
,x
q
))


Least square solution

w

= ((KX)
T

KX)
-
1

(KX)
T

f(X)

with KX NxM matrix of row vectors
K(d(x
i
,x
q
))

x
i

and

f(X) is a vector whose i
-
th element is f(x
i
)





Curse of Dimensionality

Imagine instances described by 20 attributes but only
are relevant to target function

Curse of dimensionality
: nearest neighbor is easily
misled when instance space is high
-
dimensional

One approach:


Stretch j
-
th axis by weight z
j
, where z
1
,…,z
n

chosen
to minimize prediction error


Use cross
-
validation to automatically choose weights
z
1
,…,z
n



Note setting z
j

to zero eliminates this dimension
alltogether (feature subset selection)

Linear Global Models


The model is linear in the parameters
b
k
, which
can be estimated using a least squares algorithm


f
^
(x
i
) =

k=1
D

w
k

x
ki

or
F(x)
=

X
b

Where x
i
=(x
1
,…,x
D
)
i
, i=1..N, with D the input dimension
and N the number of data points.

Estimate the
b
k

by minimizing the error criterion


E=

i=1
N

(
f
^
(x
i
)


y
i
)
2


(
X
T
X
)
b


X
T

F(X)


b

=
(X
T

X)
-
1

X
T

F(X)



b
k
=

m=1
D


n=1
N

(

l=1
D

x
T
kl

x
lm
)
-
1

x
T
mn

f(x
n
)


Linear Regression Example

Linear Local Models


Estimate the parameters
b
k

such that they locally
(near the query point x
q
) match the training data
either by


weighting the data:

w
i
=K(d(x
i
,x
q
))
1/2

and transforming

z
i
=w
i

x
i


v
i
=w
i

y
i


or by weighting the error criterion:


E=

i=1
N

(
x
i
T

b



y
i
)
2
K(d(x
i
,x
q
))


still linear in
b

with LSQ

solution

b

=
((WX)
T

WX)
-
1

(WX)
T

F(X)



Linear Local Model Example

query point

X
q
=0.35

Kernel K(x,x
q
)

Local linear

model:

f
^
(x)=b
1
x+b
0

f
^
(x
q
)=0.266

Linear Local Model Example

Design Issues in Local Regression


Local model order (constant, linear, quadratic)


Distance function d


feature scaling: d(x,q)=(

j=1
d

m
j
(x
j
-
q
j
)
2
)
1/2


irrelevant dimensions m
j
=0


kernel function K


smoothing parameter bandwidth h in K(d(x,q)/h)


h=|m| global bandwidth


h= distance to k
-
th nearest neighbor point


h=h(q) depending on query point


h=h
i
depending on stored data points

See paper by Atkeson [1996] ”Locally Weighted Learning”

Radial Basis Function Network


Global approximation to target function in terms
of linear combination of local approximations


Used, e.g. for image classification


Similar to back
-
propagation neural network but
activation function is Gaussian rather than
sigmoid


Closely related to distance
-
weighted regression
but ”eager” instead of ”lazy”

Radial Basis Function Network

input layer

Kernel functions

output f(x)

x
i

K
n
(d(x
n
,x))=

exp(
-
1/2 d(x
n
,x)
2
/

2
)

w
n

linear parameters

f(x)=w
0
+

n=1
k

w
n

K
n
(d(x
n
,x))

Training Radial Basis Function Networks


How to choose the center x
n

for each Kernel
function K
n
?


scatter uniformly across instance space


use distribution of training instances (clustering)


How to train the weights?


Choose mean x
n

and variance

n

for each K
n


non
-
linear optimization or EM


Hold K
n

fixed and use local linear regression to
compute the optimal weights w
n


Radial Basis Network Example

K
1
(d(x
1
,x))=

exp(
-
1/2 d(x
1
,x)
2
/

2
)

w
1

x+ w
0

f
^
(x) = K
1

(w
1

x+ w
0
)


+ K
2

(w
3

x + w
2
)

and Eager Learning


Lazy: wait for query before generalizing


k
-
nearest neighbors, weighted linear regression


Eager: generalize before seeing query


Radial basis function networks, decision trees, back
-
propagation, LOLIMOT


Eager learner must create global approximation


Lazy learner can create local approximations


If they use the same hypothesis space, lazy can
represent more complex functions (H=linear
functions)

Laboration 3


Distance weighted average


Cross
-
validation for optimal kernel width



Leave 1
-
out cross
-
validation

f*(x
q
) =

i

q

f(x
i
) K(d(x
i
,xq))/

i

q

K(d(x
i
,x
q
))



Cross
-
validation for feature subset selection


Neural Network