Introduction to Machine Learning

Τεχνίτη Νοημοσύνη και Ρομποτική

14 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

148 εμφανίσεις

Introduction to
Machine Learning
Brown University CSCI 1950-F, Spring 2012
Prof. Erik Sudderth
Lecture 16:
Kernels & Perceptrons
Gaussian Process Regression & Classification
Many figures courtesy Kevin Murphy’s textbook,
Machine Learning: A Probabilistic Perspective
Mercer Kernel Functions
X
arbitrary input space (vectors, functions, strings, graphs, !)
•!A kernel function maps pairs of inputs to real numbers:
k:X ×X →R
k(x
i
,x
j
) = k(x
j
,x
i
)
Intuition: Larger values indicate inputs are “more similar”
•!A kernel function is positive semidefinite if and only if for
any , and any ,
the Gram matrix is positive semidefinite:
n ≥ 1
x = {x
1
,x
2
,...,x
n
}
K ∈ R
n×n
K
ij
= k(x
i
,x
j
)
•!Mercer’s Theorem: Assuming certain technical conditions,
every positive definite kernel function can be represented as
k(x
i
,x
j
) =
d
￿
￿=1
φ
￿
(x
i

￿
(x
j
)
for some feature mapping
(but may need )
d →∞
φ
Exponential Kernels
X
real vectors of some fixed dimension
k
(x
i
,x
j
) = exp
￿

￿
|x
i
−x
j
|
σ
￿
γ
￿
We can construct a covariance matrix by evaluating kernel at any
set of inputs, and then sample from the zero-mean Gaussian
distribution with that covariance. This is a Gaussian process.
0 < γ ≤ 2
String Kernels
X
strings of characters from some finite alphabet, of size A
•!Feature vector: Count of number of times that every
substring, of every possible length, occurs within string
•!Using suffix trees, the kernel can be evaluated in time
linear in the length of the input strings
Amino
Acids
x
x
￿
D = A+A
2
+A
3
+A
4
+∙ ∙ ∙
Kernelizing Learning Algorithms
•!Start with any learning algorithm based on features

•!Manipulate steps in algorithm so that it depends not directly on
features, but only their inner products:
•!Write code that only uses calls to kernel function
•!Basic identity: Squared distance between feature vectors
φ(x)
(Don’t worry that computing features might be expensive or impossible.)
k(x
i
,x
j
) = φ(x
i
)
T
φ(x
j
)
•!Feature-based nearest neighbor classification
•!Feature-based clustering algorithms (later)
•!Feature-based nearest centroid classification:
||
φ(x
i
)

φ(x
j
)
||
2
2
= k(x
i
,x
i
) +k(x
j
,x
j
)

2k(x
i
,x
j
)
ˆ
y
test
= arg min
c
||
φ(x
test
)

µ
c
||
2
µ
c
=
1
N
c
￿
i
|
y
i
=c
φ(x
i
)
mean of the N
c
training
examples of class c
Perceptron MARK 1 Computer
Frank Rosenblatt, late 1950s
Decision Rule:
ˆy
i
= I(θ
T
φ(x
i
) > 0)
Learning Rule:
If ˆy
k
= y
k

k+1
= θ
k
If ˆy
k
￿= y
k

k+1
= θ
k
+ ˜y
k
φ(x
k
)
˜y
k
= 2y
k
−1 ∈ {+1,−1}
Kernelized Perceptron Algorithm
Decision Rule:
Learning Rule:
If ˆy
k
= y
k

k+1
= θ
k
If ˆy
k
￿= y
k

k+1
= θ
k
+ ˜y
k
φ(x
k
)
˜y
k
= 2y
k
−1 ∈ {+1,−1}
Problem: May be intractable to compute/store
φ(x
k
),θ
k
ˆy
test
= I(θ
T
φ(x
test
) > 0)
Decision Rule:
Learning Rule:
ˆy
test
= I
￿
N
￿
i=1
ˆs
i
k(x
test
,x
i
) > 0
￿
If ˆy
k
= y
k
,s
k,k+1
= s
k,k
If ˆy
k
￿= y
k
,s
k,k+1
= s
k,k
+ ˜y
k
Representation:
D feature weights
Initialize with . By induction, for all k
θ
0
= 0
θ
k
=
N
￿
i =1
s
i k
φ(x
i
)
f o r s o m e i n t e g e r ss
i k
R e p r e s e n t a t i o n:
N t r a i n i n g
e x a m p l e w e i g h t s
Gaussian Processes
•!Linear regression models predict outputs by a linear function
of fixed, usually non-linear features:
•!Consider Gaussian prior on weight vector for regularization:
•!What is the joint distribution of the predictions for any inputs?
f
(x) = w
T
φ(x
)
φ
(x)
∈ R
m×1
p(w) =
N
(w
|
0,α
−1
I
m
)
w ∈ R
m×1
x
=
{
x
1
,x
2
,...,x
n
}
p(f) =
N
(f
|
0,α
−1
ΦΦ
T
) =
N
(f
|
0,K
)
K
ij
= α
−1
φ(x
i
)
T
φ(x
j
)
•!This is a Gaussian process: Not a single Gaussian distribution,
but a family of Gaussian distributions, one for each n and x
f
= [f(x
1
),...,f(x
n
)]
T
= Φ
w
Gaussian Process Regression
•!Feature-based regression estimates m-dim. feature vector
•!GP regression estimates n-dim. function at training data:
x
=
{
x
1
,x
2
,...,x
n
}
K
ij
= α
−1
φ(x
i
)
T
φ(x
j
)
f
= [f(x
1
),...,f(x
n
)]
T
= Φ
w
p(y
i
|
f
i
) =
N
(y
i
|
f
i

−1
)
noisy observation of
unobserved function
p(f) =
N
(f
|
0,K
)
p(y) = N(y | 0,C)
C = K +β
−1
I
n
•!To make a prediction for a test point, we don’t need to know
the underlying weight vector, only the distribution
•!Mean and covariance computed by applying standard
formulas for Gaussian conditionals to covariance matrix C
p(y
n+1
| x
n+1
,x,y) = N(y
n+1
| m(x
n+1
),σ
2
(x
n+1
))
1D Gaussian Process Regression
−5
0
5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−5
0
5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Samples from Prior Posterior Given 5
Noise-Free Observations
Squared exponential kernel or radial basis function (RBF) kernel
2D Gaussian Processes
!2
0
2
!2
0
2
!2
!1
0
1
2
input x1
input x2
output y
!2
0
2
!2
0
2
!2
!1
0
1
2
input x1
input x2
output y
!2
0
2
!2
0
2
!2
!1
0
1
2
input x1
input x2
output y
Gaussian Process Hyperparameters
−8
−6
−4
−2
0
2
4
6
8
−3
−2
−1
0
1
2
3
−8
−6
−4
−2
0
2
4
6
8
−3
−2
−1
0
1
2
3
−8
−6
−4
−2
0
2
4
6
8
−3
−2
−1
0
1
2
3
How should we fit to data?
•!Cross-validation
•!Maximize marginal likelihood
(empirical Bayes, tractable
for GP regression)
Hyperparameter Marginal Likelihoods
10
0
10
1
10
!1
10
0
characteristic lengthscale
noise standard deviation
!5
0
5
!2
!1
0
1
2
input, x
output, y
!5
0
5
!2
!1
0
1
2
input, x
output, y
Global
Minimum
Local
Minimum
Example: CO
2
Concentration Over Time
Mauna Loa Observatory in Hawaii, analyzed by Rasmussen & Williams 2006
Mixing Kernels for CO
2
GP Regression
Smooth global trend
Seasonal periodicity
Medium term irregularities
Correlated Observation Noise
Generalized Linear Models
•!Recall parametric generalized linear models (GLMs):
p(y
i
|
x
i
,w) = exp
{
y
i
f
i

A(f
i
)
}
f
i
= w
T
φ(x
i
)
any exponential
family distribution
p(w) =
N
(w
|
0,α
−1
I
m
)
w ∈ R
m×1
•!Gaussian processes lead to nonparametric GLMs:
any exponential
family distribution
p(y
i
| x
i
,f
i
) = exp{y
i
f
i
−A(f
i
)}
p(f) =
N
(f
|
0,K
)
K
ij
= k(x
i
,x
j
)
•!The Mercer kernel function corresponds to some set of
underlying features, but we need not know or compute them
•!The model is “nonparametric” because the number of
underlying features, and hence parameters, can be infinite
Gaussian Process Classification
Bernoulli
distribution
p(y
i
| x
i
,f
i
) = exp{y
i
f
i
−A(f
i
)}
p(f) =
N
(f
|
0,K
)
K
ij
= k(x
i
,x
j
)
y
i
∈ {
0,1
}
−10
−5
0
5
10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p(y
i
|
x
i
,f
i
) = Ber(y
i
|
sigm(f
i
))
•!Equivalent to logistic regression, but
uses kernels rather than features
•!Gaussian prior on weights replaced by
Gaussian prior on training log-odds
•!As in logistic regression, cannot
exactly average of parameters to
compute test data predictions
•!Use Gaussian approximations instead
Laplace Approximations
Log−Unnormalised Posterior
−8
−6
−4
−2
0
2
4
6
8
−8
−6
−4
−2
0
2
4
6
8
Laplace Approximation to Posterior
−8
−6
−4
−2
0
2
4
6
8
−8
−6
−4
−2
0
2
4
6
8
Log Posterior Distribution Laplace (Gaussian) Approximation
•!Logistic regression approximates M-dim. distribution of weights w
•!GP classification approximates N-dim. distribution of training log-odds f
•!Both require similar gradient descent algorithms
Kernels or Features?
number of training examples
N
M
L
number of features
cost of kernel function evaluation, at worst
•!Feature-based linear regression:
•!Kernel-based GP regression:
•!Roughly, the difference corresponds to using either
•!Relative costs of logistic regression and GP classification
are similar, per iteration of optimization-based learning
•!What if N and M are both large???
O
(NM
2
+M
3
)
O
(LN
2
+N
3
)
O
(M
)
Φ
NxM matrix evaluating each feature for all training data
(
Φ
T
Φ)
−1
(
ΦΦ
T
)
−1
Approximate!!! Endless options, none perfect!