ML_H3_2013 - Cs

jamaicacooperativeΤεχνίτη Νοημοσύνη και Ρομποτική

17 Οκτ 2013 (πριν από 4 χρόνια και 23 μέρες)

96 εμφανίσεις


1

Homework3

COSC
63
42

Machine Learning


Remark: This is an individual homework!

Last Changed: April 21, 10p
; due: May 1, 11p (no extensions); solutions will be posted
on Zechun’s website on May 2…



1
) Ensemble Methods [
8
]

a)

One key problem of ensemble method
s is to obtain diverse ensembles; what are the
characteristics of a “diverse ensemble”? [2]

b)

What is the key idea of boosting?
How
does the boosting approach


encourage the
creating of diverse ensemble [3]

c)

The AdaBoost algorithm restarts if the accuracy of
classifiers drops below 50%.
Why? [3]

2
) Support Vector Machines

[1
1
]

a) Support vector machines maximize the width of the margin which separates the
example
s

of 2 classes. What advantage does a classifier with a wide margin have over a
classifier that
has

a much smaller margin? [
2
]

b) Non
-
linear support vector machine which use kernels which map a dataset into a
higher dimensional space are quite popular. What advantages you see in using non
-
linear
support vector machine over linear support vector machines
? [3]

c)

How is the decision boundary of the one class support vector machine different from
the decision boundaries of a traditional support vector machine? [2]

d)
The support vector regression approach minimizes the following objective function:


What does the first term of the objective function measure; what does it mean if this
term has a low value?
What does
+

measure
?
W
hat does it mean if this sum is
0?
Why are there two kinds of

errors in support

vector

machine regression?

for
t=1,..,n

subject to:


2


3)

Baysian

Belief Networks [
8
]

a)
What problems can be solved by Bayesian Belief Networks
?

How they a different
from Naïve Bayes?
Limit your answer to
4
-
5

sentences!
[3]

b) Compute P(Earthquake|JohnCalls) for

the Burgulary/Earthquake/Alarm Network
discussed in the
Belief Network
lecture! [5]

4)
Kernels

[
6
]

a)
Kernel methods do their computations by accessing the gram matrix.
What does the
gram matrix contain? [2]


b)
What machine learning algorithms ca
n be ker
nalized, which cannot?
What is the
advantage of doing computa
tions using the kernel function

K(a,b) instead of using

(a)


(b)? [
4
]

5
)
Reinforcement Learning
[1
4
]

a)
What role does the learning rate


play in temporal difference learning? How is using
high

values different from using low values for

?

[2]

b
) A
pply temporal difference learning to the
DEF

World
, depicted above,

relying
on the
following assumptio
n
s: [
3
]



The agent starts in state
7

and applies
n
-
w
-
y (n first)



When the agent applies y in state
5
, he moves
back
to state
4



Utilities of states are initialized with 0

What are the util
i
ties of
visited
states

visited

after
n
-
w
-
y

has been applied
?

Do not only
give the final result but also how you derived the final result

including formulas used!


c) Gi
ve the Bellman Equation for state
8
! [2]

d
) How is reinforcement learning different from supervised learning
, such as
classification and prediction
? [3]

e
) Assume you have a policy that always selects the action that leads to the state with the
highest
e
xpected utility.
Present arguments

that th
is is usually not a good policy

by
describing scenarios in which this policy leads to suboptimal behavior of agents [4]
!

6
) Comparing Classifiers

[
4
]


Assume you have 2
different
support vector machine approaches m
1 and m2 whose
testing accuracy (measured using n
-
fold cross
-
validation)
is the same but the area under
the
ROC

curve for method m1 is larger than

the area

for
method m2. In what aspect is m1
better than m2
?
1

Might need to do some reading to answer this qu
estion!


DEF
-
World






1

One way to answer this question is to construct specific testing example for which both methods have
same accuracy but different are under the curve, and then use this example to answer the question.