1
Final Exam
COSC
63
42
Machine Learning
Solution Sketches
May 8, 2013
Your Name:
Your
Student id:
Problem 1 [
4
]:
Information Gain
Problem 2 [
14
]:
Ensemble Methods
+ Other
Problem 3 [
1
4
]:
Reinfo
r
cement Learning
Problem 4 [
11
]:
Computations in Belief
Networks /D

separation
Problem
5
[
5
]:
Kernel Methods
Problem
6
[
10]
:
Support Vector Machines
Problem
7
[
10]
:
Write an Essay
Problem 8 [
7
]:
K

means
Problem
9
[
8
]:
Non

parametric Density Estimation
:
Grade:
The exam is “open books
and notes
” and y
ou have
120
minutes to complete
t
he exam.
The use of computers is not allowed!
The exam will count
about
33
% towards the course grade.
Curve: f(x)=round(84.1+ (x

47.2)*1.06)
2
1)
Information Gain [
4
]
Assume we have
a classification problem
involving
3 classes: professors, students, and
staff members. There are
800
students, 1
0
0 staff members and 100 professor
s
.
All
professors have blond hair, 50 staff members have blond hair,
and
400
students have
blond hair. Comp
ute the information gain of the test
“hair_color=’blond’”
that returns
true or false
.
Just giving the formula
that computes the information gain is fine; you do
not need to compute the exact value of the formula!
Use H as the entropy function
in
your formula
(e.g. H(1/3,
1/6,1/2
) is the entropy that 1/3 of the examples belong to class1
1/6
of the examples belong to class 2
, and half of the examples belong to class 3
)
.
[
4
]
H(0.8,0.1,0.1)

0.55*H(400/550,50/550,100/550)

0.45*H(400/450,50/450,0)
One error: 0.5

1.5 points
Two errors: 0 points
2) Ensemble Methods [
14
]
a)
Ensembles have been quite successful in generating supervised learning systems
which exhibit very high accuracies. What is the explanation for that? Why is it better
to use a team of
diverse base classifiers rather than a single classification algorithm?
[4]
Building ensembles of different classifiers that make different kind of errors, leads to
higher accurac
ies
(as long the base classifiers’ accuracy is above 50%), as classifiers
whi
ch make a wrong decision are usually outvoted by classifiers which make the correct
decision.
Other answers might get credit!
b) What is the key idea of AdaBoost and boosting in general to obtain an ensemble of
diverse
base classifiers? [3]
The key idea
is to use example weighting, assigning high weights to examples that were
misclassified by previously generated classifiers, encouraging the generation of
classifiers which classify those examples correctly obtaining a classifier that makes
different kind
of errors, obtaining a set of diverse classifier in the end.
c
)
In ensemble learning it is important to obtain a set of base classifiers that a
re
diverse in
the sense that the base classifiers are making different kind of errors. Propose a method
that me
asures the diversity of an ensemble; e.g. using your proposed method we can
quantify
that
the diversity of Ensemble1 is 20% higher than the diversity of Ensemble2.
Be specific! [7]
Let
c1,c2 be classifiers, which map feature vectors
of examples
into clas
s lab
els
T be a training sets
T is the cardinality of the training set
We suggest to use the following distance
d
to assess the similarity between c1 and c2
d(c1,c2)= {t
Tc1(t)
c2(t)}/T
3
3
)
Reinforcement Learning
[1
4
]
a)
What are the main
differences between supervised learning and reinforcement
learning? [4]
SL: static world[0.5], availability to learn from a teacher/correct answer[1]
RL: dynamic changing world[0.5]; needs to learn from indirect, sometimes delayed
feedback/rewards
[1]
; suit
able for
exploring
of unknown worlds
[1]
;
temporal
analysis
/worried about the future/interested in an agent’
s long term well
being
[
0.5
], needs
to
carry out actions to find out if they are good
—
which actions/states are good is
(usually) not know
n
in advance1[
0.5]
At most 4 points!
b)
Answer the following questions for the ABC world (given at a separate sheet)
.
Give
the Bellman equation for states 1 and 4 of the ABC world
!
[3
]
U(1)=
5
+
*
U(4)
[1
]
U(4)=
3
+
*max (U(
2
)*0.
3
+
U(3)*0.1+
U(
5
)*0.
6
, U(
1
)*0.
4
+U(
5
)*0.
6
)
[2
]
No partial credit!
c) Assume you use temporal difference learning in conjunction with a random policy
which choses actions randomly assuming a uniform distribution. Do you believe that the
estimations obtained are a good measurement of the
“goodness” of states, that tell an
intelligent agent (
assume the agent is smart!!
) what states he/she should/should not visit?
Give reasons for your answer!
[3]
Not really; as we assume an intelligent agent will take actions that lead to good states and
av
oids bad states,
an
agent
that uses the random policy
might not recognize that a state is
a good state if both good and bad state
s
are
successors of this state; for example,
S2: R=+100
S1:R=

1
S3: R=

100
Due to the agent’s policy the agent will fail to realize the S1 is a good state, as the agent’s
average reward for visiting the successor state
s of S1
is 0; an intelligent agent would
almost always go from S1 to S2, making S1 a high utility state with respect to TD

learning.
d) What role does the learning rate
play in temporal difference learning; how does
running
temporal difference learning with l
ow values of
differ from running it with
high values of
? [2]
It determines how quickly our current belief
s
/estimations
are updated based on new
evidence.
e) Assume you
run temporal difference learning with high values of
—
what are the
implications o
f doing that? [2]
If
is high the agent will more focus on its long term wellbeing, and will shy away from
taking actions
—
although they lead to immediate rewards
—
that
will lead to
the medium
and long term suffering of the agent.
4
4.) Computations in Belief Networks
/D

separation
[1
1
]
Assume that the following Belief Network is given that consist
s
of
nodes
A, B, C, D, and
E that can take values
of
true and false.
a)
Using the given probabilities of the probability tables of the
above
belief network
(D
C
,
E
;
C

A,B
; A; B;
E
) give a formula to compute P(
D

A
).
Justify
all nontrivial steps
you used to obtain the formula
! [
7
]
P(
D

A
)=
P(D,CA)+P(D,~C
A
)= P(CA)*P(DA,C)+ P(CA
)*P(DA,
~
C
)
As D and A are d

separable given evidence C
—
the path is blocked exhibiting
pattern
1
, this formula can be simplified to:
= P(CA)*P(DC)+ P(
~
CA)*P(D
~C)
P(CA)= P(C,BA)+P(C,~BA)=P(BA)*P(CB,A)+ P(~BA)*P(C~B,A)=
As B and A are independent
assuming no evidence
(pattern
3
)
, this can be
simplified to:
P(B)*P(CB,A)+ P(~B)*P(C~B,A)
Using exactly the same derivation P(~CA) and P(DC) and P(D~C)
can be
computed
; e.g
:
P(DC)=P(D,EC)+P(D,~EC)=…
follow the solution for computing P(CA)!
b)
Are C and
E
independent; i
s
C
and
E

d

separable?
Give
a reason
for your answer!
denotes “no evidence given”
[2]
Yes, the
only
path C

D

E is blocked (pattern 3, as D is not part of the evidence)
c) Is
ECD d

separable from ACD? Give
a reason
for your answer
!
[2]
Yes,
the
only
path A

C

D

E is
blocked
(pattern
1
, as C belongs to the evidence).
A
/
B
C
D
/
E
5
5) Kernel Methods
[
5
]
There has been
a l
ot work in designing new kernels
in machine learning. Why is this
the
case? How can using kernels
help
in enhancing
particular techniques such as PCA or K

mean
s?
Mapping examples into a higher dimensional space frequently makes it easier to find us
more sophistated decision
/cluster
boundaries which
are not available if the ML algorithm
is used
in the orig
inal space
[3] L
eads to the creation of new features/new coordinate
systems that lead to a more successful results
of the ML algorithms[2].
Other answers
and more detail
ed
discussions of examples how a particular technique might benefit from
kernel
s
might
also
deserve credit!
6)
Support Vector Machine [
10
]
a) Why do most support vector machine approaches
usually
map example
s
to a higher
dimensional space?
[2]
It is easier to successful separate the examples with a hyperplan in the mapped space.
b)
The
support vector regression approach minimizes the following objective function,
given below
. G
ive a verbal description what this objective function minimizes! W
hat
purpose does
serve? What purpose does C serve? [5]
It maxim
izes the width of the

envelope of the regression function while keeping the
error resulting from making predictions that fall outside

envelope the low
[2.5]
.
C
determines the importance of each of the two objectives in the multi

objective
optimi
zation p
rocedure
[1].
defines an upper bound on what kind of errors we are
willing to tolerate
, as predictions that fall within the envelop are treated as
‘
0 error
’
predictions by the objective function!
[1.5].
for
t=1,..,n
subject to:
6
c) Assume you apply support vector regression to a particular problem and for
the
obtained hyper plane
and
are all 0 for the n training example
s
(t=1,..,n); what does
this
mean? [3]
All example fall within the

envelope of the regression function
or
all examples
predictions have
that
are
bounded by
.
7)
Write an Essay involving Machine Learning
[10]
C
hoose
one
of the essay
topics below (please use complete sentences)
and write an essay
of 10

16 sentences
length!
Topic1
: Assume you are the owner of a company which sells music (e.g.
songs, concert
recordings, CDs,…
) online. How can machine learning help to make your business more
successful?
Topic2
: Our society currently faces a lot of environmental chall
enges
,
such as air and
water pollution, oil and other spills,
shortage of environmental resources such as oil and
water
to name a few. How can machine learning help to better deal with environmental
challenges of our society?
7
8)
k

M
eans
[7]
a)
K

means
only finds clusters that are a local (but not a global
) maximum
of the
objective function J it minimizes.
What
does this mean? What are the implications of this
fact for using k

means to cluster real world datasets? [3]
K

means will not necessary find
the best clustering with the tightest clusters and the
quality of its results strongly depend on initialization
Run K

means multiple times with different initialization
s/random seeds
and choose the
clustering which h
as the lowest value for J.
b)
K

means
has difficulties clustering datasets which contain a lot of outliers. Explain,
why this is the case! What could be done to alleviate the problem? [4
+up to 3 extra
points
]
As K

means requires that all objects need to be assigned
to a cluster
; therefore,
outliers
have to be assigned to a particular cluster
,
leading to non

descriptive centroids that no
longer capture the characteristics of
most
objects
,
belonging to a particular cluster
[2].
Not answer given
to the second question
but some techniques that
have some merit
include
[2]
:
Remove outliers, prior to applying k

means
Use the cluster medoid, instead
of
the
cluster
centroid as the cluster summary
…
9
)
Non

parametric
Density Estimation
[
8
]
a)
Assume we have a 2D dataset
D
containing
3
objects:
D
={(
0
,0)
, (
3
,2) (3,4)}; moreover,
we use the Gaussian kernel
density
function to measure the density of
D
.
Assume we
want to compute the density at point (
3,3
) and you can
also
assume h=1
(
=1)
.
It is
sufficient just to give the formula and not necessary to com
pute the actual density in
(3,3
)!
[
5
]
If one error, at most 3 points; if two errors at most 0.5 points
; exception
:
if the fo
r
mula is
correct instead the constant in front you can give 4 points.
b) What role does the parameter h
play in the above approach; how does it impact the
obtained density functions; how does the appearance of the obtained density function
change if we would use h=0.5 or h=2.0 instead? [3]
The obtained density functions will be more rugged
(h=0.50
)
/more smo
o
th
(h=2)
and
contain more
(h=0.5)
/less local maxima
/minima
(h=2)
!
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment