Data mining Lecture (doc) - Bioinformatics

fantasicgilamonsterΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

174 εμφανίσεις

Data mining



knowledge discovery


--------------------------


Tutorial
#1
-
goal:

To produce and interpret decision trees using the file
breast
-
cancer.data
.


Perform a classification run and look at the decision tree.


--------------------------


Tutorial #
2
-
goal: To perform data cleaning on
breast
-
cancer

files
.


Open the following files in TextPad;

breast
-
cancer.names

breast
-
cancer.data

breast
-
cancer.test


Note in the names file the statements;

2 for benign, 4 for malignant


Using the search and replace fun
ction find instances of 2 at the end of a line and replace it
with “benign” in all three open files. Do this again for the “4
-
> malignant” conversion.


The $ sign is used as a regular expression to indicate the end of
a
line.


Perform a classifica
tion ru
n on these cleaned files. Notice that the decision trees are
much easier to interpret now.


Close all files in TextPad.

--------------------------


Tutoria
l #3
-
goal: To understand

how to use SELDI data as input to See5.


Open
daf
-
0248.names

and daf
-
0248.t
est in TextPad.

Look at the bottom

of the names file (used Control
-
End
)

Notice the “results” line at the end
of the file,
these are the two possible outcomes for
classification.


result: Cancer, Control.


Open the file daf
-
0248.test. Notice that it has on
ly one line of data indicating that there is
only one case in this test file.


Hit the End key to go to the end of the line. Is this one case from a cancer patient or
control?


Perform a classification run on these files.

Starting with 2, how many boostin
g trials does it take before the test data is correctly
classified?


What method do you think I used to produce these file from raw spectra?

Close all files in TextPad.


--------------------------


Tutorial #4
-
goal: To understand how a costs file works.


L
oad the
400
-
hypothyroid
.data file in See5.

Perform a classific
ation run with 10 boosting trials and take note of the number and type
of errors for the training data.


Rename the file:
400
-
hypothyroid.cost

to
400
-
hypothyroid.cost
s
.
O
pen the
400
-
hypothyroid
.cost
s file in TextPad.


Notice that a costs file is now shown for the
400
-
hypothyroid

data set.

Again
perform a classification run with 10 boosting trials and take note of the number
and type of errors for the training data. What is the difference in err
or now?


Close all files in TextPad.

Rename
400
-
hypothyroid.cost
s to
400
-
hypothyroid.cost


--------------------------


Tutorial #5
-
goal: To demonstrate how to use a cases file.

Note that there is no listing for cases file in the main See5 window.

Perform
a classification run with 10 boosting trials on the
400
-
hypothyroid

but
this time
check the box for Rulesets. Now a rules file appears.


Open
the file
400
-
hypothyroid.cases

in TextPad.


Open a DOS prompt and cd to the directory with t
he data files.


Paste

in the following command and run it.

See5Sam
-
f 400
-
hypothyroid



The results should all be negative for the disease.


The See5Sam.exe program reads the rules file and cases
-
file into memory, applies the
rules to the data and prints the results to screen.


Use search and replace in TextPad to change all instances of negative at the end of every
line to “?”.


Rerun See5Sam
-
f 400
-
hypothyroid


Bring the costs file back in to play, perform a classification run, recall the See5Sam
-
f
400
-
hypothyroid and run i
t, modify then save the costs file, classification again to make a
new rules file and again run See5Sam.


Use search and replace in TextPad to change all instances of negative at the end of every
line to “?”.


Rerun See5Sam
-
f 400
-
hypothyroid


---------
-----------------


Tutorial #6
-
goal: To demonstrate that the result attribute can have several outcomes.


Load
soybean.data

into See5 and run it.


Notice that there are “result” attributes from
(
a
)

through
(
s
)
.


--------------------------


Tutorial #7
-
goal
: To demonstrate boosting and how it effects classification error.


Part #1

Email me your error prediction count for evaluation on test data (199 cases) for 6
boosting trials. Use the simple format demonstrated in class for tracking errors and put
your p
rediction in the subject line of the email. Example Subject: 5 trials number,number.


3 trials 6,8

4 trials 6,8

5 trials 4,8



Part #2

Perform classification on the
breast
-
cancer

data set with 6 boosting trials. Email your
results and if your prediction
did not match the actual results explain what may have
cased the difference from what you expected.