Classification Methods (non-overkill) - cs

tastelesscowcreekΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

66 εμφανίσεις

1

2

Source Data

Variables

(Encoding)

(0.5,0.7,…)

Classification

Scheme

Variable insight

Models

Predictions

?

?



Overview


General features of learning problems



Training, testing and quantifying accuracy



Choosing a classifier



Methods:


k
-
Nearest
neighbours


Classification trees


3

Learning Problems


Map input variables to categories or quantities



CLASSIFICATION

Predict qualitative traits

(categories)



REGRESSION

(and related methods)

Quantitative predictions



binds?

(Y/N)

How strong?

4

Training


Goal: to ‘learn’ the rules (or fit functions) that
distinguish between classes






What are the properties of a good training
set?

5


Random sample

from the population



Sufficiently large / representative



All classes represented

6

Types of Learning


SUPERVISED


Labelled

classes


Feedback: information about labeling is used to
train classifier



UNSUPERVISED


Classes may be
labelled

or unlabelled


Classifier develops the classification scheme
independently from class labels

7

Goal of supervised learning

Minimize error

(via for instance a
loss function
)

on the training set


e.g., Squared error loss: EPE(
f
) = E(Y


f
(X))
2


8

Hastie section 2.4

Expected prediction error

Expectation

Difference between actual value (Y) and prediction


Some methods have closed
-
form solutions that
are globally optimal on the cost function


Many statistical methods e.g. discriminant function
analysis, linear regression



Others must use heuristics (iterative training,
greedy approaches)


Artificial neural networks


Classification trees


Support vector machines


9

Generalization




How well does the classifier
handle cases
that
were not present in the training set?

10

Data set splitting


Use a fraction of available cases as the
training set,
reserve the remainder for a
test

set

11

Train classifier

Test classifier

Cross
-
validation


Repeated training with different subsets









The
cross
-
validation score

is the average
performance on all test sets

12

5
-
fold cross
-
validation

etc…


Sample sets at
random
, but make sure every
class is represented!



In the two
-
class case:

+ training set

-

training set

+ test set

-

test set

13

Classification Accuracy


CONFUSION MATRIX for a two
-
class (positive
and negative set) problem

Predicted


True

+

-

+

True positive

False negative

-

False positive

True negative

14

may require THRESHOLDING of continuous predictions

Quantifying Accuracy

15


Sensitivity
: Accuracy on the positive set

t
p

/ (
t
p

+ f
n
)





Specificity
: Accuracy of all positive predictions

t
p

/ (
t
p

+
f
p
)




‘Balanced’ accuracy
: Overall performance on the set,
weighted by class size


[
t
p

/ (
t
p

+ f
n
)
+
t
n

/ (
t
n

+
f
p
)

] / 2

ROC Curves

16

(W
-
Commons)

Area under

ROC curve


Matthews Correlation Coefficient
:


17

Others: see
Baldi

et al. (2000) in
Bioinformatics

Provocative Question



What does it mean if the accuracy on the test
set is 60%?

18

Choosing a Classification Method

19


No one classifier is best for every classification
problem




What criteria should we consider?

20

Bias
-
Variance Tradeoff


Do we want a classifier that is as simple as
possible, or one that can make complex
decisions?

21

Bias

Variance (overfitting)

22

Burges 1997, “A Tutorial on Support Vector Machines for Pattern

Recognition”.

23

Hastie, p.38

Interpretability


Some methods yield understandable (or
almost understandable) rules, others do not

24

Cyclic?

Proline

Glycine

No

Yes

?

?

?

?

?

?

Input

Magic

Output

Tractability




If the training data are necessarily high
-
dimensional, then a simpler classifier may be
necessary

25

k
-
Nearest
Neighbours




Given an
n
-
dimensional space, map any point
p

to the class represented by a plurality of its
k

nearest
neighbours

from the training set

26

Hastie, Section 13.3

Procedure

27

15
-
NN

1
-
NN

(
Voronoi

tesselation
)

Hastie, p. 15
-
16

Training


No training per se, since modeling the
decision boundary could be quite complex



Instead, find the labels of the
k

closest (e.g.,
Euclidean distance) training vectors




28

Example: Gene selection from
microarrays

29

g
Genes

s
Samples (Diffuse large B cell lymphoma


patients of two types, with


different prognoses)

Expression levels:

Low

High

Li et al (2001)
Bioinformatics

Approach


Map
all
s
training vectors (gene expression
profiles) into
d
-
dimensional space by
subsampling

genes



A test vector is considered classified if its 3
-
NNs all represent the same lymphoma type,
otherwise
unclassified



30

Feature Selection


When
g

= 4026 and
p

= 34, we have a few too
many variables



The authors used a
genetic algorithm

to select
subsets of
d

genes for analysis



Tried
d

= { 5, 10, 20, 30, 40, 50 }



Replicate 250 times to determine which genes
are most important

31

32

How often are different genes selected for
classification by the genetic algorithm?

# of times selected

(out of 10,000 replicates)

Expected # of times selected

=
d
/
g

* 10,000

Standard deviation

33

What is the best classification
accuracy that can be achieved?


What is the optimal value of
d
?


How many genes do we need?

Top line: accuracy on training set

Bottom line: accuracy on test set

Modifications



Try different distance definitions



Treat different dimensions differently (e.g.,
normalize)

34

Classification and Regression Trees



Use a series of conditions on predictor variables
to split the training cases

35

Hastie section 9.2

Procedure

36

Back

If you find this chart useful, well then, ENJOY! Use it as you discretion.

There is no need to print this.

STUDY HARD ! ! !


For a given case, follow the deterministic path
along the decision tree

www.wanderings.net

Training

Decision Node (cases
c
consisting of variables
x
ic
):


If (stopping criterion not reached)

1.
Find variable
x
i

and threshold
t

such that optimal
separation is achieved between cases with different
labels

2.
Decision Node (
c
x
i



t)

3.
Decision Node (
c
x
i

>

t)



37

Optimal separation

Minimize node ‘impurity’






Misclassification error
: % of cases not belonging to majority class


Gini

index
:


Cross
-
entropy
:

38

200 promoters/200 genes

TATA present

TATA absent

150 promoters/50 genes

50 promoters/150 genes


A greedy approach is most common


Is this guaranteed to find the best solution?

39

Yes

Of course not!!!

Stopping criteria



There is no need to subdivide a pure class



Other criteria (such as minimum number of
classes at a node) might be used as well

40

Pruning


Once the decision tree has been generated,
we can prune away less useful nodes

Cost complexity criterion:


41

For a particular tree T

Number of cases in a node

Penalty term (e.g.,
Gini

score)

Size penalty (tuning parameter)

Weakest link pruning:


Find the
subtree

T
α

that

minimizes the cost function

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

43

Quantity we’re trying

to minimize

How wrong is the tree?

How big is the tree?

size
error
C(high alpha)
C(low alpha)
Big trees, low error

Small trees, high error


Greedy approach to tree reduction: always
remove the weakest link in the tree until the
target cost is reached

44

Rifampin

sensitivity of bacterial RNA
polymerase



Rifampin

interferes with transcription in
Mycobacterium tuberculosis



However, mutations can arise in certain parts of
the
rpoB

gene that lead to
rifampin

resistance



Can we classify
rpoB

variants based on the amino
acids encoded by the gene?

45

Cummings
et al. BMC Bioinformatics

2004
5
:137

46

88.4% correctly classified (10
-
fold cross
-
validation)

Many nearby polymorphic sites (e.g. 511, 512, 515, 521 and 529)
rejected

as potentially good classifiers


“resistant”: requires > 1
μ
g /
mL

for inhibition of
M.tuberculosis

Considerations


CARTs are:


Easy to interpret


Often bad at generalizing (unstable)

47

Summary


Keep bias and variance in mind when choosing
a classifier



Training set accuracy is a useful measure, but
tells you nothing about predictions on unseen
cases



The choice of method often involves a trade
-
off between different desirable properties

48

START SIMPLE!!!

49

Implementations


k
-
Nearest
Neighbours
:


The R statistical package (
www.r
-
project.org
)


Weka

(http://www.cs.waikato.ac.nz/ml/weka/)



Classification Trees:


As above


C4.5 / C5.0 (commercial!)


50

51

F

I

N