EE734, Spring 201
3
Midterm Exam
1.
(
5
pts
) Given prior information in the table,
Name
Sex
Young

Hyun
Male
Chansu
Female
Young

Hyun
Female
Young

Hyun
Female
Andy
Male
Karin
Female
Nina
Female
Sunju
Male
Find
if an officer named
“
Young

hyun
”
is more likely to be male or female.
U
se Bayes rule to justify your answer.
S
ol) p(male  Youn
g

hyun)=1/3 * 3/8 /
3/8 = 0.125
/ (3/8) =0.33
p(female  Youn
g

hyun)=2/5 * 5/8 /
3/8 = 0.250
/
(3/8) = 0.67
So, Officer Young

hyun is more likely to be a female.
2.
(10 pts)
(a) Give a general expression for the quadratic approximation to a twice differentiable
function f(x) at x=
k
.
S
ol)
(b)
Use your answer from part (a) to give an
approximate
value fo
r ln(1.
1
) where ln(x) is the
natural log function.
(sol) (a):
(b)
3.
(10 pts)
Gaussian Naive Bayes
Suppose you are training Gaussian Naive Bayes (GNB) on the training set shown below.
The dataset satis
ﬁ
es
Gaussian Naive Bayes assumptions. Assume that the variance is
independent
of instances but dependent on classes, i.e.
σ
i
k
=
σ
k
where i indexes
instances X
(i)
and k
∈
1;2
indexes classes. Draw the decision boundaries when you train
GNB
a. using the same variance for both classes,
σ
1
=
σ
2
b. using separate variance for each class
σ
1
≠
σ
2
sol)
The decision boundary for part a will be linear, and part b will be quadratic.
4.
(5
pts)
Consider the datasets
toydata1
in
the
figure
below
.
•
In
the
dataset there are two classes, ’+’ and ’o’.
•
Each class has the same number of points.
•
Each data point has two real valued features, the X and Y coordinates.
D
raw the decision boundary that a Gaussian Naive Bayes
classifier
will learn.
(
Remember that a very important piece of information was that all the classes had the
same
number of points, and so we don’t have to worry about the prior.
)
S
ol)
For
toydata
GNB learns two
Gaussians , one for the circle inside with small variance , and
one for the circle outside with a much larger variance, and the decision surface is roughly
shown
in
the
figure.
5.
(
21
pts)
(True of False)
(a)
(SVM)
When the data
is not completely linearly separable,
t
he linear SVM without slack
variables returns
w
= 0.
(
w
= weight vector)
(sol)
False, there is no solution.
.
(b)
(SVM) After
training a SVM, we can discard all examples
wh
i
ch are not support vectors
and can still
classify new examples.
(
T)
(c)
Over
fi
tting is more likely when the set of training data is small
(T)
Solutions:
True. With small tr
aining dataset, it's easier to
fi
nd a hypothesis to
fi
t the training
data
exactly,i.e., over
fi
t.
(d)
Logistic regression learns a non

linear decision boundary because it assumes
that
, which is a nonlinear function of
x
.
(F)
(e)
When learning a linear decision boundary with the perceptron algorithm, it is
guaranteed
to converge within a
fi
nite number of steps
as far as the data is linear separable
.
(
T
)
(f)
Maximizing the
log
likelihood of logistic regression model
may
yield multiple local
optimums.
(F) see the 2
nd
paragraph, pp137, by Prince. That is, the log

likelihood for logistic
regression has a special prop
erty, that is a concave function of the parameters phi.
Also refer
to slide 22 and 25 of the lecture note on Classification.
(g)
The Newton method used in optimization only finds local extreme. (T)
6.
(
4
pts) Encircle anyone that belongs to nonparametric supervi
se learning. (c, e)
(a)
Naïve
Bayes (b) logistic regression (c) SVM (d) k

means clustering (e) k

nearest neighbor
7.
(1
5
pts) (Regression)
a)
Explain three problems related to linear regression
and how to overcome each problem
in
detail.
8.
(
2
0 pts) Lagrange
method
We want to
make a rectangular box without a lid with 12m
2
of card board. To fin
d
the
maximum volume of such a box, our goal is to m
aximize the function
f (x,y,z)
= xyz, s.t.
g(x)=xy + 2yz + 2zx

12 = 0, where x,y, and z are the length, width, and h
eight of a box,
respectively.
In other words, find x, y, and z that maximizes the function f(x,y,z)..
S
ol) Buil
d
Lagrangian
Calculate partial derivatices for x,y,z and
λ
and set them equal to 0
x=y=2, z=1. f(2,2,1)=4.
9.
(10 pts) SVM
and the slack penalty C
The goal of this problem is to correctly classify test data points, given a training data set.
You have been warned, however, that the training data comes from sensors which can be
error

prone, so you should avoid trusting any
specific point too much.
For this problem, assume that we are training an SVM with a
quadratic kernel
–
that is,
our kernel function is a polynomial kernel of degree 2. You are given the data set
presented
in Figure 1. The slack penalty
C
will determine the
location of the separating
hyperplane.
Please answer the following questions
qualitatively
. Giv
e a one sentence
a
nswer/justification
for each and draw your solution in the appropriate part of the Figure
at the end of the
problem.
Figure 1:
Dataset for SVM slack penalty selection task .
a) (3 pts)
Where would the decision boundary be for very large values of
C
(i.e.,
C
→
∞
)?
(remember that we are using an SVM with a quadratic kernel.) Draw on
the figure below.
Justify your answer.
⋆
SOLUTI
ON:
For large values of
C
, the penalty for misclassifying points is very
high, so the decision boundary will perfectly separate the data if possible. See below for
the boundary learned using libSVM and
C
= 100000
.
_
COMMON MISTAKE 1:
Some students drew
straight lines, which would not be
the result with a quadratic kernel.
_
COMMON MISTAKE 2:
Some students confused the effect of
C
and thought
that a large
C
meant that the algorithm would be more tolerant of misclassifications.
b)
(3 pts)
For
C
≈
0,
indicate in the figure below, where you would expect the decision
boundary to be? Justify your answer.
⋆
SOLUTION:
The classifier can maximize the margin between most of the points,
while
misclassifying a few points, because the penalty is so low. See bel
ow for the boundary
learned by libSVM with
C
= 0
.
00005
.
c)
(2 pts)
Which of the two cases above would you expect to work better in the classification
task? Why?
⋆
SOLUTION:
We were warned not to trust any specific data point too much, so we
prefer the
solution where
C
≈
0
, because it maximizes the margin between the dominant
clouds of points.
d)
(1 pts)
Draw a data point which will not change the decision boundary learned for
very large values of
C
. Justify your answer.
⋆
SOLUTION:
We add the point
circled below, which is correctly classified by the
original
classifier, and will not be a support vector.
e)
(1 pts)
Draw a data point which will significantly change the decision boundary
learned for very large values of
C
. Justify your answer.
⋆
SOLUTI
ON:
Since
C
is very large, adding a point that would be incorrectly classified
by
the original boundary will force the boundary to move.
Figure 2: Solutions for Problem 2
SID: Name:
F
igure for prob 3.
Figure
for problem
4
Figures for prob. 9
(a)
(b)
(c )
(d)
Comments 0
Log in to post a comment