Assignment 1: Using the WEKA Workbench

aspiringtokΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

2.016 εμφανίσεις

Assignment 1: Using the WEKA Workbench



A. Become familiar with the use of the WEKA workbench to invoke several different machine
learning schemes.


Use latest stable version. Use both the graphical interface (Explorer) and command line interface
(CLI).

See
Weka home page

for Weka documentation.

B. Use the following learning schemes
, w
ith the default settings to analyze the weather data (in
weather.arff). For test options, first choose "Use training set", then choose "Percentage Split" using
default 66% percentage split. Report model percent error rate.



ZeroR (majority class)



OneR



Na
ive Bayes Simple



J4.8

C. Which of these classifiers are you more likely to trust when determining whether to play?
Why?


D. What can you say about accuracy when using training set data and when using a separate
percentage to train?


Assignment 2:
Preparing the data and mining it




A.

Take the file
genes
-
leukemia.csv

(here is the
description of the data
) and convert it to Weka
file genes
-
a.arff.

You can convert the file either using a text editor like emacs (brute force way) or find a Wek
a
command that converts .csv file to .arff (a smart way).



B.

Target field is CLASS. Use J48 on genes
-
leukemia with "Use training set" option.



C.

Use genes
-
leukemia.arff to create two subsets:

genes
-
leukemia
-
train.arff, with the first 38 samples (s1 ...
s38) of the data

genes
-
leukemia
-
test.arff, with the remaining 34 samples (s39 ... s72).



D.

Train J48 on genes
-
leukemia
-
train.arff and specify "Use training set" as the test option.

What decision tree do you get? What is its accuracy?



E.

Now specify gen
es
-
leukemia
-
test.arff as the test set.

What decision tree do you get and how does its accuracy compare to one in the previous
question?



F.

Now remove the field "Source" from the classifier (unclick checkmark next to Source, and
click on Apply Filter in t
he top menu)

and repeat steps
D

and
E
.

What do you observe? Does the accuracy on test set improve and if so, why do you think it
does?



G.

Extra credit: which classifier gives the highest accuracy on the test set?

Assignment 3: Data Cleaning and
Preparing for Modeling



The previous assignment was with the selected subset of top 50 genes for a particular Leukemia
dataset. In this assignment you will be doing the work of real data miner, and you will be working
with an actual genetic dataset, star
ting from the beginning.

You will see that the process of data mining frequently has many small steps that all need to be
done correctly to get good results. However tedious these steps may seem, the goal is a worthy one
--

help make an early diagnosis fo
r leukemia
--

a common form of cancer. Making a correct
diagnosis is literally a
life and death

decision, and so we need to be careful that we do the analysis
correctly.

3A. Get data

Take ALL_AML_original_data.zip file from
Data directory

and extract from it



Train file: data_set_ALL_AML_train.txt



Test file: data_set_ALL_AML_independent.txt



Sample and class data: table_ALL_AML_samples.txt

This data comes from pioneering work by Todd Golub et al at MIT Whitehead Institute (now MIT
Broad Institute).

1. Rename the train file to ALL_AML_grow.train.orig.txt and test file to ALL_AML_grow.test.orig.txt .

Convention: we use the same file root for

files of similar type and use different extensions for
different versions of these files. Here "orig" stands for original input files and "grow" stands for genes
in rows. We will use extension .tmp for temporary files that are typically used for just one
step in the
process and can be deleted later.

Note: The pioneering analysis of MIT biologists is described in their paper
Molecular Classification of
Cancer:
Class Discovery and Class Prediction by Gene Expression Monitoring

(pdf).

Both train and test datasets are tab
-
delimited files with 7130 records.

The "train" file should have 78 fields and "test" 70 fields. The first two fields are

Gene Description (a l
ong description like GB DEF = PDGFRalpha protein) and

Gene Accession Number (a short name like X95095_at)

The remaining fields are pairs of a sample number (e.g. 1,2,..38) and an Affymetix "call" (P is gene is
present, A if absent, M if marginal).

Think

of the training data as a very tall and narrow table with 7130 rows and 78 columns. Note that it
is "sideways" from machine learning point of view. That is the attributes (genes) are in rows, and
observations (samples) are in columns. This is the standard

format for microarray data, but to use
with machine learning algorithms like WEKA, we will need to do "matrix transpose" (flip) the matrix to
make files with genes in columns and samples in rows. We will do that in step 3B.6 of this
assignment.

Here is a

small extract

Gene Description

Gene Accession
Number

1

call

2

call

...

GB DEF = GABAa receptor alpha
-
3
subunit

A28102_at

151

A

263

P

...

...

AB000114_at

72

A

21

A

...

...

AB000115_at

281

A

250

P

...

...

AB000220_at

36

A

43

A

...

3B:
Clean the data

Perform the following cleaning steps on both the train and test sets. Use unix tools, scripts or other
tools for each task.

Document all the steps and create intermediate files for each step. After each step, report the
number of fields and

records in train and test files. (Hint: Use unix command wc to find the number
of records and use awk or gawk to find the number of fields).

Microarray Data Cleaning Steps

1.

Remove the initial records with Gene Description containing "control".

(Those are Affymetrix controls, not human genes). Call the resulting files
ALL_AML_grow.train.noaffy.tmp and ALL_AML_grow.test.noaffy.tmp.

Hint: You can use unix command grep to remove the control records.

How many such control records are in each file?

2.

Remove the first field (long description) and the "call" fields, i.e. keep fields numbered
2,3,5,7,9,... Hint: use unix cut command to do that.

3.

Replace all tabs with commas

4.

Change "Gene Accession Number" to "ID" in the first record. (You can use emacs h
ere).

(Note: That will prevent possible problems that some data mining tools have with blanks in
field names.)

5.

Normalize the data: for each value, set the minimum field value to 20 and the maximum to
16,000. (Note: The expression values less than 20 or o
ver 16,000 were considered by
biologists unreliable for this experiment.)

Write a small Java program or Perl script to do that.

Call the generated files ALL_AML_grow.train.norm.tmp and ALL_AML_grow.test.norm.tmp

6.

Write a short java program or shell scrip
t to transpose the training data to get

ALL_AML_gcol.test.tmp and ALL_AML_gcol.train.tmp ("gcol" stands for genes in columns).
These files should each have 7071 fields, and 39 records in "train", 35 records in "test"
datasets.

7.

Extract from file table_ALL
_AML_samples.txt tables

ALL_AML_idclass.train.txt and ALL_AML_idclass.test.txt with sample id and sample labels,
space separated.

Here you can use a combination of unix commands and manual editing by emacs.

Add a header row with "ID Class" to each of th
e files.

File ALL_AML_idclass.train.txt should have 39 records and two columns. First record
(header) has "ID Class", next 27 records have class "ALL" and last 11 records have class
"AML". Be sure to remove all spaces and tabs from this file.

ALL_AML_idc
lass.test.txt should have 20 "ALL" samples and 14 "AML" samples, intermixed.

8.

Note that the sample numbers in ALL_AML_gcol*.csv files are in different order than in
*idclass files. Use Unix commands to create combined files ALL_AML_gcol_class.train.csv
and

ALL_AML_gcol_class.test.csv which have ID as the first field, Class as the last field, and
gene expression fields in between.

3C: Build Models on a full dataset

1.

As in assignment 2, convert

ALL_AML_gcol_class.train.csv to ALL_AML_allgenes.train.arff

ALL
_AML_gcol_class.test.csv to ALL_AML_allgenes.test.arff

2.

Using ALL_AML_allgenes.train.arff as train file and ALL_AML_allgenes.test.arff as test, build
a model using OneR. What accuracy do you get?

3.

Now, excluding the field ID, build models using OneR, Naive
Bayes Simple, and J4.8, using
training set only.

What models and error rates you get with each method?

Warning: some of the methods may not finish or give you errors due to the large number of attributes
for this data.

4.

If you got thus far
--

congratulat
ions!

Based on your experience, what three things are important in the process of data mining ?


Assignment 4: Feature Reduction


Start from the ALL_AML_train_processed data set (zipped file in
Data directory
). This data was
thresholded to >= 20 and <= 16,000, with genes in rows. You should have generated it as part of
assignment 3. Name it ALL_AML_gr.thr.train.csv (Note al
most all of this data preprocessing is much
easier done with data with genes
-
in
-
rows format). We will convert the data to genes
-
in
-
columns
when we are ready to do the modeling).

1. Examining gene variation

A. Write a program (or a script) to compute a fold difference for each gene. Fold difference is the
Maximum Value across samples divided by minimum value. This value is frequently used by
biologists to assess gene variability.

B. What is the largest fold
difference and how many genes have it?

C. What is the lowest fold difference and how many genes have it?

D. Count how many genes have fold ratio in the following ranges

Range

Count

Val <= 2

..

2 <Val <= 4

..

4 <Val <= 8

..

8 <Val <= 16

..

16 <Val <= 32

..

32 <Val <= 64

..

64 <Val <= 128

..

128 <Val <= 256

..

256 <Val <= 512

..

512 <Val

..

E: Extra Credit: Graph fold ratio distribution appropriately.

2. Finding most significant genes

For train set, samples 1
-
27 belong to class ALL, and 28
-
38 to class AML.

Let Avg1, Avg2 be the average expression values.

Let Stdev1, Stdev2 be the sample standard deviations, which can be computed as

Stdev = sqrt((N*Sum_sq
-

Sum_val*Sum_val)/(N*(N
-
1)))


Here N is the number of observations,

Sum_val is the sum of values,

Sum_sq is the sum of squares of values.

Signal to Noise (S2N) ratio is defined as (Avg1
-

Avg2) / (Stdev1 + Stdev2)

T
-
value is defined as (Avg1
-

Avg2) / sqrt(Stdev1*Stdev/N1 + Stdev
2*Stdev2/N2)

A. Write a script that will compute for each gene, the average and standard deviation for both
classes ("ALL" and "AML").

Also compute for each gene T
-
value and Signal to Noise ratio.

B. Select for each class, top 50 genes with the highest
S2N ratio.




Which gene has the highest S2N for "ALL" (i.e. its high value is most correlated with class
"ALL")? 50th highest ? Give gene names and S2N values



Same question for "AML" class.



What is the relationship between S2N values for ALL and AML ?

C.

Select for each class top 50 genes with the highest T
-
value




Which gene has the highest T
-
value for class "ALL"? 50th highest ? Give gene names and T
values



For AML ?



What is the relationship between T values for ALL and AML ? Will a similar relationshi
p hold
if there are more than 2 classes?

D. How many genes are in common between top 50 genes for ALL selected using S2N and
those selected using T
-
value ? How many genes are in common among top 3 genes in each
list?


E. Same question for top genes for "A
ML" .


3. Lessons Learned

What have you learned so far about the process of feature selection and data preparation?


Predict Treatment Outcome



Note: For this assignment, we used CART from Salford

Systems, which was available to us under
an educational license. If CART is not available, another decision tree tool, such as J4.8 in Weka
can be used instead.


Start with genes
-
leukemia.csv dataset used in assignment 2. (See
Dataset directory
).

As a predictor use field TREATMENT_RESPONSE, which has values Success, Failure or "?"
(missing)

Step 1. Examine the records where TREATMENT_RESPONSE is non
-
missing.

Q1: How many such re
cords are there?

Q2: Can you describe these records using other sample fields (e.g. Year from XXXX to YYYY , or
Gender = X, etc)

Q3: Why is it not correct to build predictive models for TREATMENT_RESPONSE using records
where it is missing?

Step 2. Selec
t only the records with non
-
missing TREATMENT_RESPONSE. Keep SNUM (sample
number) but remove sample fields that are all the same or missing. Call the reduced dataset genes
-
reduced.csv

Q4: Which sample fields you should keep?

Step 3. Build a CART Model us
ing leave
-
one
-
out cross validation.

Q5: what tree do you get? and what is the expected error rate?

Q6: what are the important variables and their relative importance, according to CART?

Q7: Remove the top predictor
--

and re
-
run the CART
--

what do you
get?

Step 4: Extra credit (10%):

Use Google to search the web for the name of top gene that predicts the outcome and briefly report
relevant information that you find.

Step 5: Randomization test

Randomize the TREATMENT_RESPONSE variable 10 times and re
-
run the CART for each
randomized class.

Q8: Report the trees and error rates you get.

Q9: Based on the results in Q8, do you think the tree that you found with the original data is
significant?



ID

Mid I/15

Mid II/15

Hws(1
-
3)/30

Sub
Total/60

Final/40

Total/100

Grade

120060194

8

8

10

26




120070044

--

--

15

15




120070063

11

11

20

42




120070167

5

11

10

32




207210067

11

8

10

32




208110169

13

12

25

51




208110227

9

11

20

42