1
Homework3
COSC
63
42
Machine Learning
Remark: This is an individual homework!
Last Changed: April 21, 10p
; due: May 1, 11p (no extensions); solutions will be posted
on Zechun’s website on May 2…
1
) Ensemble Methods [
8
]
a)
One key problem of ensemble method
s is to obtain diverse ensembles; what are the
characteristics of a “diverse ensemble”? [2]
b)
What is the key idea of boosting?
How
does the boosting approach
encourage the
creating of diverse ensemble [3]
c)
The AdaBoost algorithm restarts if the accuracy of
classifiers drops below 50%.
Why? [3]
2
) Support Vector Machines
[1
1
]
a) Support vector machines maximize the width of the margin which separates the
example
s
of 2 classes. What advantage does a classifier with a wide margin have over a
classifier that
has
a much smaller margin? [
2
]
b) Non

linear support vector machine which use kernels which map a dataset into a
higher dimensional space are quite popular. What advantages you see in using non

linear
support vector machine over linear support vector machines
? [3]
c)
How is the decision boundary of the one class support vector machine different from
the decision boundaries of a traditional support vector machine? [2]
d)
The support vector regression approach minimizes the following objective function:
What does the first term of the objective function measure; what does it mean if this
term has a low value?
What does
+
measure
?
W
hat does it mean if this sum is
0?
Why are there two kinds of
errors in support
vector
machine regression?
for
t=1,..,n
subject to:
2
3)
Baysian
Belief Networks [
8
]
a)
What problems can be solved by Bayesian Belief Networks
?
How they a different
from Naïve Bayes?
Limit your answer to
4

5
sentences!
[3]
b) Compute P(EarthquakeJohnCalls) for
the Burgulary/Earthquake/Alarm Network
discussed in the
Belief Network
lecture! [5]
4)
Kernels
[
6
]
a)
Kernel methods do their computations by accessing the gram matrix.
What does the
gram matrix contain? [2]
b)
What machine learning algorithms ca
n be ker
nalized, which cannot?
What is the
advantage of doing computa
tions using the kernel function
K(a,b) instead of using
(a)
(b)? [
4
]
5
)
Reinforcement Learning
[1
4
]
a)
What role does the learning rate
play in temporal difference learning? How is using
high
values different from using low values for
?
[2]
b
) A
pply temporal difference learning to the
DEF
World
, depicted above,
relying
on the
following assumptio
n
s: [
3
]
The agent starts in state
7
and applies
n

w

y (n first)
When the agent applies y in state
5
, he moves
back
to state
4
Utilities of states are initialized with 0
What are the util
i
ties of
visited
states
visited
after
n

w

y
has been applied
?
Do not only
give the final result but also how you derived the final result
including formulas used!
c) Gi
ve the Bellman Equation for state
8
! [2]
d
) How is reinforcement learning different from supervised learning
, such as
classification and prediction
? [3]
e
) Assume you have a policy that always selects the action that leads to the state with the
highest
e
xpected utility.
Present arguments
that th
is is usually not a good policy
by
describing scenarios in which this policy leads to suboptimal behavior of agents [4]
!
6
) Comparing Classifiers
[
4
]
Assume you have 2
different
support vector machine approaches m
1 and m2 whose
testing accuracy (measured using n

fold cross

validation)
is the same but the area under
the
ROC
curve for method m1 is larger than
the area
for
method m2. In what aspect is m1
better than m2
?
1
Might need to do some reading to answer this qu
estion!
DEF

World
1
One way to answer this question is to construct specific testing example for which both methods have
same accuracy but different are under the curve, and then use this example to answer the question.
Comments 0
Log in to post a comment