Machine Learning 21431
Instance Based Learning
Outline
K

Nearest Neighbor
Locally weighted learning
Local linear models
Radial basis functions
Literature & Software
T. Mitchell, “Machine Learning”, chapter 8,
“Instance

Based Learning”
“Locally Weighted Learning”, Christopher
Atkeson, Andrew Moore, Stefan Schaal
ftp:/ftp.cc.gatech.edu/pub/people/cga/air.html
R. Duda et ak, “Pattern recognition”, chapter 4
“Non

Parametric Techniques”
Netlab toolbox
k

nearest neighbor classification
Radial basis function networks
When to Consider Nearest
Neighbors
Instances map to points in R
N
Less than 20 attributes per instance
Lots of training data
Advantages:
Training is very fast
Learn complex target functions
Do not loose information
Disadvantages:
Slow at query time
Easily fooled by irrelevant attributes
Instance Based Learning
Key idea: just store all training examples <x
i
,f(x
i
)>
Nearest neighbor:
Given query instance x
q
, first locate nearest
training example x
n
, then estimate f(x
q
)=f(x
n
)
K

nearest neighbor:
Given x
q
, take vote among its k nearest neighbors
(if discrete

valued target function)
Take mean of f values of k nearest neighbors (if
real

valued) f(x
q
)=
i=1
k
f(x
i
)/k
Voronoi Diagram
query point q
f
nearest neighbor q
i
3

Nearest Neighbors
query point q
f
3 nearest neighbors
2x
,
1o
7

Nearest Neighbors
query point q
f
7 nearest neighbors
3x
,
4o
Behavior in the Limit
Consider p(x) the probability that instance x is
classified as positive (1) versus negative (0)
Nearest neighbor:
As number of instances
approaches Gibbs
algorithm
Gibbs algorithm: with probability p(x) predict , else 0
K

nearest neighbors:
As number of instances
approaches Bayes
optimal classifier
Bayes optimal: if p(x)> 0.5 predict 1 else 0
Nearest Neighbor (continuous)
3

nearest neighbor
Nearest Neighbor (continuous)
5

nearest neighbor
Nearest Neighbor (continuous)
1

nearest neighbor
Locally Weighted Regression
Forms an explicit approximation f*(x) for region
surrounding query point x
q
.
Fit linear function to k nearest neighbors
Fit quadratic function
Produces piecewiese approximation of f
Squared error over k nearest neighbors
E(x
q
) =
xi
nearest neighbors
(f*(x
q
)

f(x
i
))
2
Distance weighted error over all neighbors
E(x
q
) =
i
(f*(x
q
)

f(x
i
))
2
K(d(x
i
,x
q
))
Locally Weighted Regression
Regression means approximating a real

valued
target function
Residual is the error f*(x)

f(x)
in approximating the target function
Kernel function is the function of distance that is
used to determine the weight of each training
example. In other words, the kernel function is the
function K such that w
i
=K(d(x
i
,x
q
))
Distance Weighted k

NN
Give more weight to neighbors closer to the
query point
f*(x
q
) =
i=1
k
w
i
f(x
i
) /
i=1
k
w
i
where w
i
=K(d(x
q
,x
i
))
and d(x
q
,x
i
) is the distance between x
q
and x
i
Instead of only k

nearest neighbors use all
training examples (Shepard’s method)
Distance Weighted Average
Weighting the data:
f*(x
q
) =
i
f(x
i
) K(d(x
i
,xq))/
i
K(d(x
i
,x
q
))
Relevance of a data point (x
i
,f(x
i
)) is measured
by calculating the distance d(x
i
,x
q
) between
the query x
q
and the input vector x
i
Weighting the error criterion:
E(x
q
) =
i
(f*(x
q
)

f(x
i
))
2
K(d(x
i
,x
q
))
the best estimate f*(x
q
) will minimize the cost
E(x
q
), therefore
E(x
q
)/
f*(x
q
)=0
Kernel Functions
Distance Weighted NN
K(d(x
q
,x
i
)) = 1/ d(x
q
,x
i
)
2
Distance Weighted NN
K(d(x
q
,x
i
)) = 1/(d
0
+d(x
q
,x
i
))
2
Distance Weighted NN
K(d(x
q
,x
i
)) = exp(

(d(x
q
,x
i
)/
0
)
2
)
Example: Mexican Hat
f(x
1
,x
2
)=sin(x
1
)sin(x
2
)/x
1
x
2
approximation
Example: Mexican Hat
residual
Locally Weighted Linear Regression
Local linear function
f*(x) = w
0
+
n
w
n
x
n
Error criterion
E =
i
(
w
0
+
n
w
n
x
qn

f(x
i
))
2
K(d(x
i
,x
q
))
Gradient descent
D
w
n
=
i
(
f*(x
q
)

f(x
i
)) x
n
K(d(x
i
,x
q
))
Least square solution
w
= ((KX)
T
KX)

1
(KX)
T
f(X)
with KX NxM matrix of row vectors
K(d(x
i
,x
q
))
x
i
and
f(X) is a vector whose i

th element is f(x
i
)
Curse of Dimensionality
Imagine instances described by 20 attributes but only
are relevant to target function
Curse of dimensionality
: nearest neighbor is easily
misled when instance space is high

dimensional
One approach:
Stretch j

th axis by weight z
j
, where z
1
,…,z
n
chosen
to minimize prediction error
Use cross

validation to automatically choose weights
z
1
,…,z
n
Note setting z
j
to zero eliminates this dimension
alltogether (feature subset selection)
Linear Global Models
The model is linear in the parameters
b
k
, which
can be estimated using a least squares algorithm
f
^
(x
i
) =
k=1
D
w
k
x
ki
or
F(x)
=
X
b
Where x
i
=(x
1
,…,x
D
)
i
, i=1..N, with D the input dimension
and N the number of data points.
Estimate the
b
k
by minimizing the error criterion
E=
i=1
N
(
f
^
(x
i
)
–
y
i
)
2
(
X
T
X
)
b
㴠
X
T
F(X)
b
=
(X
T
X)

1
X
T
F(X)
b
k
=
m=1
D
n=1
N
(
l=1
D
x
T
kl
x
lm
)

1
x
T
mn
f(x
n
)
Linear Regression Example
Linear Local Models
Estimate the parameters
b
k
such that they locally
(near the query point x
q
) match the training data
either by
weighting the data:
w
i
=K(d(x
i
,x
q
))
1/2
and transforming
z
i
=w
i
x
i
v
i
=w
i
y
i
or by weighting the error criterion:
E=
i=1
N
(
x
i
T
b
–
y
i
)
2
K(d(x
i
,x
q
))
still linear in
b
with LSQ
solution
b
=
((WX)
T
WX)

1
(WX)
T
F(X)
Linear Local Model Example
query point
X
q
=0.35
Kernel K(x,x
q
)
Local linear
model:
f
^
(x)=b
1
x+b
0
f
^
(x
q
)=0.266
Linear Local Model Example
Design Issues in Local Regression
Local model order (constant, linear, quadratic)
Distance function d
feature scaling: d(x,q)=(
j=1
d
m
j
(x
j

q
j
)
2
)
1/2
irrelevant dimensions m
j
=0
kernel function K
smoothing parameter bandwidth h in K(d(x,q)/h)
h=m global bandwidth
h= distance to k

th nearest neighbor point
h=h(q) depending on query point
h=h
i
depending on stored data points
See paper by Atkeson [1996] ”Locally Weighted Learning”
Radial Basis Function Network
Global approximation to target function in terms
of linear combination of local approximations
Used, e.g. for image classification
Similar to back

propagation neural network but
activation function is Gaussian rather than
sigmoid
Closely related to distance

weighted regression
but ”eager” instead of ”lazy”
Radial Basis Function Network
input layer
Kernel functions
output f(x)
x
i
K
n
(d(x
n
,x))=
exp(

1/2 d(x
n
,x)
2
/
2
)
w
n
linear parameters
f(x)=w
0
+
n=1
k
w
n
K
n
(d(x
n
,x))
Training Radial Basis Function Networks
How to choose the center x
n
for each Kernel
function K
n
?
scatter uniformly across instance space
use distribution of training instances (clustering)
How to train the weights?
Choose mean x
n
and variance
n
for each K
n
non

linear optimization or EM
Hold K
n
fixed and use local linear regression to
compute the optimal weights w
n
Radial Basis Network Example
K
1
(d(x
1
,x))=
exp(

1/2 d(x
1
,x)
2
/
2
)
w
1
x+ w
0
f
^
(x) = K
1
(w
1
x+ w
0
)
+ K
2
(w
3
x + w
2
)
and Eager Learning
Lazy: wait for query before generalizing
k

nearest neighbors, weighted linear regression
Eager: generalize before seeing query
Radial basis function networks, decision trees, back

propagation, LOLIMOT
Eager learner must create global approximation
Lazy learner can create local approximations
If they use the same hypothesis space, lazy can
represent more complex functions (H=linear
functions)
Laboration 3
Distance weighted average
Cross

validation for optimal kernel width
Leave 1

out cross

validation
f*(x
q
) =
i
q
f(x
i
) K(d(x
i
,xq))/
i
q
K(d(x
i
,x
q
))
Cross

validation for feature subset selection
Neural Network
Comments 0
Log in to post a comment