Homework 2

beadkennelΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

94 εμφανίσεις

02
-
450/02
-
750 Automation of Biological Research

Homework
2


Fall 2013

Due September 19
, 2013


This homework requires you to implement
active learning solutions (for a dataset we have
provided for you)
in your favorite programming language that has
easy
-
to
-
use packages for
machine learning.

You are strongly encouraged to stick with Python, MATLAB or R. Java/C++ can
be supported through OpenCV (http://opencv.org/) (also available for Python).


(82Mb,
required
):

http://www.cs.cmu.edu/afs/cs/user/awn/
www/homework2.tar.gz

Please download this file to your machine, and open it. On unix
-
y machines:

tar xzvf homework2.tar.gz

will do the trick.


(680Mb, only for extra credit part II)

http://www.cs.cmu.edu/afs/cs/user/awn/www/homework2_full.tar.gz

Don’t down
load this file until you’re ready to actually do the extra credit.


Important:
For this homework you
may

discuss solutions with other students

(if you do so, as
stated

in class you
must

write their names on your submission), but all code
must

be your sole
effort. Print out a copy of your code as part of your homework submission.


Homework Problems Setup
:
One of the most frequent uses of active learning is the two
-
class
classification setting where there is a (minority) class that is highly undesirable, ther
e are few
labeled data, and the cost of manual annotation is very high. An example of this is in quality
control
1
:

here, we have microscopy images that vary in “quality” (e.g. noise levels, focus, etc.)
and in what protein subcellular localization pattern

was imaged (e.g. nuclear, mitochondrial,
endoplasmic reticular, etc.). The minor class (class 0) are of poor quality (class 0) and must be
discarded. The majority class (class 1) are not uniformly better in terms of quality than the
minority class. Both c
lasses are heterogeneous in that many different patterns were imaged.
The distribution of per
-
class patterns is not known ahead of time.


These images are often quantified by
feature vectors
. In this case, we have provided in the file
training_data.csv

a

particular kind of feature vector representation of images called SURF
2
.
Each image had up to 20 feature vectors computed (which together attempt to represent
important patterns in the image) and each is 128 doubles wide (128 columns in the file). In the
file file_mapping.csv, each feature vector from training_data.csvis associated to an image, which



1

think about the “bananas and
kiwis for an eccentric billionaire” example we discussed in class

2

H.
Bay,
A.
Ess, T
. Tuytelaars and

L
.

Van Gool
(2008)
SURF: Speeded Up Robust Features
.

Computer Vision and
Image Understanding (CVIU)

110(3):

346
--
359

for convenience are numbered 1,2,… We also provide ground truth labels (class 1/0) in the file
labels.csv



1.

Randomly choose 50 images (so up to 50x20 = 1000 f
eatures) each of class 0 and class 1. These
will be your initial data. The remainder will be “unlabeled” until your active learner(s) (see
below) will request them; when a label is requested for an image you provide all the feature
vectors and all of the l
abels for those feature vectors at once (they all match).

a.

Choose your favorite classifier (say, RandomForest or a SVM) and implement the CAL
algorithm (streaming model active learner) we discussed in class.
Additionally,
implement the randomized CAL learne
r with p=0.01 and 0.1 (three learners in total). T
he
order in which the data
will
arrive is fixed
for all three learners
(increasing image
numbers as given in file_mapping.csv, which is randomized with respect to the image
numbers!). Remember that you
don’t have to explicitly build a hypothesis space


it is
enough to ask if there are two classifiers (call them A and B
; these are surrogates for the
division of the hypothesis space
) that are consistent with all previous data that would
support either cla
ss 0 or class 1 assignment of the newest (unlabeled) image.

b.

Plot the cumulative distribution of label requests as a function of how many images
were observed in sequence

for all three learners
. Is there a point where the label
requests are roughly a fixed
frequency? (Yes/No) Give an explanation (in words) why
you think this is so.

c.

Plot the accuracy of the
surrogate
classifiers A and B whenever the active learner
requested a label as a function of the cumulative distribution of label requests. Describe
(and
explain with words) any trend you see.

2.

Same as problem (1) above, except your starting data should be 100 images of class 0 and 50
images of class 1. (Repeat the same steps 1(a
-
c) with a different initial labeled dataset).

a.

At what point (if any) do the dif
ferences in number of initial data for the minority class
matter? Explain why do you think this is so?

3.

Let’s instead explore the more realistic case for these data: the membership query model.

a.

Implement hierarchical clustering (label propagation) method
s

f
rom lecture 4
:

homogenous splits as well as the
“ignoring labels” (random) variant
. Remember: your
method chooses which image to label in this case, and each time gets up to 20 labeled
feature vectors back.

b.

Plot the accuracy of predictions as a function of

queries.
Is there a point where the
homogenous splits method outperforms the randomized variant? Explain any trends
you observe.



Extra Cr
edit

Part I: Choose a heuristic discussed in lecture 5 and repeat question (3) with it instead of the hierarchical
m
ethod. Contrast the results you obtain.

Part II: (only do this if you have plenty of time) We have provided the images (as PNG files) in
http://www.cs.cmu.edu/afs/cs/user/awn/www/homework2_full.tar.gz

Compute ORB features
3

using the OpenCV package
4

for eac
h image, and use these feature data
instead

of those provided in training_data.csv. Repeat problem (1) of this homework and discuss your findings.




3

D
escribed in
E
.

Rublee, V
.

Rabaud,
K
.

Konolige, G
.R. Bradski (2011)

ORB: An efficient alternat
ive to
SIFT or SURF. ICCV 2011:
2564
-
2571

4
h
ttp://docs.opencv.org/modules/features2d/doc/feature_detection_and_description.html?highlig
ht=orb#ORB