750 Automation of Biological Research
Due September 19
This homework requires you to implement
active learning solutions (for a dataset we have
provided for you)
in your favorite programming language that has
use packages for
You are strongly encouraged to stick with Python, MATLAB or R. Java/C++ can
be supported through OpenCV (http://opencv.org/) (also available for Python).
Please download this file to your machine, and open it. On unix
tar xzvf homework2.tar.gz
will do the trick.
(680Mb, only for extra credit part II)
load this file until you’re ready to actually do the extra credit.
For this homework you
discuss solutions with other students
(if you do so, as
in class you
write their names on your submission), but all code
be your sole
effort. Print out a copy of your code as part of your homework submission.
Homework Problems Setup
One of the most frequent uses of active learning is the two
classification setting where there is a (minority) class that is highly undesirable, ther
e are few
labeled data, and the cost of manual annotation is very high. An example of this is in quality
here, we have microscopy images that vary in “quality” (e.g. noise levels, focus, etc.)
and in what protein subcellular localization pattern
was imaged (e.g. nuclear, mitochondrial,
endoplasmic reticular, etc.). The minor class (class 0) are of poor quality (class 0) and must be
discarded. The majority class (class 1) are not uniformly better in terms of quality than the
minority class. Both c
lasses are heterogeneous in that many different patterns were imaged.
The distribution of per
class patterns is not known ahead of time.
These images are often quantified by
. In this case, we have provided in the file
particular kind of feature vector representation of images called SURF
Each image had up to 20 feature vectors computed (which together attempt to represent
important patterns in the image) and each is 128 doubles wide (128 columns in the file). In the
file file_mapping.csv, each feature vector from training_data.csvis associated to an image, which
think about the “bananas and
kiwis for an eccentric billionaire” example we discussed in class
. Tuytelaars and
SURF: Speeded Up Robust Features
Computer Vision and
Image Understanding (CVIU)
for convenience are numbered 1,2,… We also provide ground truth labels (class 1/0) in the file
Randomly choose 50 images (so up to 50x20 = 1000 f
eatures) each of class 0 and class 1. These
will be your initial data. The remainder will be “unlabeled” until your active learner(s) (see
below) will request them; when a label is requested for an image you provide all the feature
vectors and all of the l
abels for those feature vectors at once (they all match).
Choose your favorite classifier (say, RandomForest or a SVM) and implement the CAL
algorithm (streaming model active learner) we discussed in class.
implement the randomized CAL learne
r with p=0.01 and 0.1 (three learners in total). T
order in which the data
arrive is fixed
for all three learners
numbers as given in file_mapping.csv, which is randomized with respect to the image
numbers!). Remember that you
don’t have to explicitly build a hypothesis space
enough to ask if there are two classifiers (call them A and B
; these are surrogates for the
division of the hypothesis space
) that are consistent with all previous data that would
support either cla
ss 0 or class 1 assignment of the newest (unlabeled) image.
Plot the cumulative distribution of label requests as a function of how many images
were observed in sequence
for all three learners
. Is there a point where the label
requests are roughly a fixed
frequency? (Yes/No) Give an explanation (in words) why
you think this is so.
Plot the accuracy of the
classifiers A and B whenever the active learner
requested a label as a function of the cumulative distribution of label requests. Describe
explain with words) any trend you see.
Same as problem (1) above, except your starting data should be 100 images of class 0 and 50
images of class 1. (Repeat the same steps 1(a
c) with a different initial labeled dataset).
At what point (if any) do the dif
ferences in number of initial data for the minority class
matter? Explain why do you think this is so?
Let’s instead explore the more realistic case for these data: the membership query model.
Implement hierarchical clustering (label propagation) method
rom lecture 4
homogenous splits as well as the
“ignoring labels” (random) variant
. Remember: your
method chooses which image to label in this case, and each time gets up to 20 labeled
feature vectors back.
Plot the accuracy of predictions as a function of
Is there a point where the
homogenous splits method outperforms the randomized variant? Explain any trends
Part I: Choose a heuristic discussed in lecture 5 and repeat question (3) with it instead of the hierarchical
ethod. Contrast the results you obtain.
Part II: (only do this if you have plenty of time) We have provided the images (as PNG files) in
Compute ORB features
using the OpenCV package
h image, and use these feature data
of those provided in training_data.csv. Repeat problem (1) of this homework and discuss your findings.
.R. Bradski (2011)
ORB: An efficient alternat
SIFT or SURF. ICCV 2011: