# Classification Methods (non-overkill) - cs

Βιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 7 μήνες)

88 εμφανίσεις

1

2

Source Data

Variables

(Encoding)

(0.5,0.7,…)

Classification

Scheme

Variable insight

Models

Predictions

?

?

Overview

General features of learning problems

Training, testing and quantifying accuracy

Choosing a classifier

Methods:

k
-
Nearest
neighbours

Classification trees

3

Learning Problems

Map input variables to categories or quantities

CLASSIFICATION

Predict qualitative traits

(categories)

REGRESSION

(and related methods)

Quantitative predictions

binds?

(Y/N)

How strong?

4

Training

Goal: to ‘learn’ the rules (or fit functions) that
distinguish between classes

What are the properties of a good training
set?

5

Random sample

from the population

Sufficiently large / representative

All classes represented

6

Types of Learning

SUPERVISED

Labelled

classes

Feedback: information about labeling is used to
train classifier

UNSUPERVISED

Classes may be
labelled

or unlabelled

Classifier develops the classification scheme
independently from class labels

7

Goal of supervised learning

Minimize error

(via for instance a
loss function
)

on the training set

e.g., Squared error loss: EPE(
f
) = E(Y

f
(X))
2

8

Hastie section 2.4

Expected prediction error

Expectation

Difference between actual value (Y) and prediction

Some methods have closed
-
form solutions that
are globally optimal on the cost function

Many statistical methods e.g. discriminant function
analysis, linear regression

Others must use heuristics (iterative training,
greedy approaches)

Artificial neural networks

Classification trees

Support vector machines

9

Generalization

How well does the classifier
handle cases
that
were not present in the training set?

10

Data set splitting

Use a fraction of available cases as the
training set,
reserve the remainder for a
test

set

11

Train classifier

Test classifier

Cross
-
validation

Repeated training with different subsets

The
cross
-
validation score

is the average
performance on all test sets

12

5
-
fold cross
-
validation

etc…

Sample sets at
random
, but make sure every
class is represented!

In the two
-
class case:

+ training set

-

training set

+ test set

-

test set

13

Classification Accuracy

CONFUSION MATRIX for a two
-
class (positive
and negative set) problem

Predicted

True

+

-

+

True positive

False negative

-

False positive

True negative

14

may require THRESHOLDING of continuous predictions

Quantifying Accuracy

15

Sensitivity
: Accuracy on the positive set

t
p

/ (
t
p

+ f
n
)

Specificity
: Accuracy of all positive predictions

t
p

/ (
t
p

+
f
p
)

‘Balanced’ accuracy
: Overall performance on the set,
weighted by class size

[
t
p

/ (
t
p

+ f
n
)
+
t
n

/ (
t
n

+
f
p
)

] / 2

ROC Curves

16

(W
-
Commons)

Area under

ROC curve

Matthews Correlation Coefficient
:

17

Others: see
Baldi

et al. (2000) in
Bioinformatics

Provocative Question

What does it mean if the accuracy on the test
set is 60%?

18

Choosing a Classification Method

19

No one classifier is best for every classification
problem

What criteria should we consider?

20

Bias
-

Do we want a classifier that is as simple as
possible, or one that can make complex
decisions?

21

Bias

Variance (overfitting)

22

Burges 1997, “A Tutorial on Support Vector Machines for Pattern

Recognition”.

23

Hastie, p.38

Interpretability

Some methods yield understandable (or
almost understandable) rules, others do not

24

Cyclic?

Proline

Glycine

No

Yes

?

?

?

?

?

?

Input

Magic

Output

Tractability

If the training data are necessarily high
-
dimensional, then a simpler classifier may be
necessary

25

k
-
Nearest
Neighbours

Given an
n
-
dimensional space, map any point
p

to the class represented by a plurality of its
k

nearest
neighbours

from the training set

26

Hastie, Section 13.3

Procedure

27

15
-
NN

1
-
NN

(
Voronoi

tesselation
)

Hastie, p. 15
-
16

Training

No training per se, since modeling the
decision boundary could be quite complex

Instead, find the labels of the
k

closest (e.g.,
Euclidean distance) training vectors

28

Example: Gene selection from
microarrays

29

g
Genes

s
Samples (Diffuse large B cell lymphoma

patients of two types, with

different prognoses)

Expression levels:

Low

High

Li et al (2001)
Bioinformatics

Approach

Map
all
s
training vectors (gene expression
profiles) into
d
-
dimensional space by
subsampling

genes

A test vector is considered classified if its 3
-
NNs all represent the same lymphoma type,
otherwise
unclassified

30

Feature Selection

When
g

= 4026 and
p

= 34, we have a few too
many variables

The authors used a
genetic algorithm

to select
subsets of
d

genes for analysis

Tried
d

= { 5, 10, 20, 30, 40, 50 }

Replicate 250 times to determine which genes
are most important

31

32

How often are different genes selected for
classification by the genetic algorithm?

# of times selected

(out of 10,000 replicates)

Expected # of times selected

=
d
/
g

* 10,000

Standard deviation

33

What is the best classification
accuracy that can be achieved?

What is the optimal value of
d
?

How many genes do we need?

Top line: accuracy on training set

Bottom line: accuracy on test set

Modifications

Try different distance definitions

Treat different dimensions differently (e.g.,
normalize)

34

Classification and Regression Trees

Use a series of conditions on predictor variables
to split the training cases

35

Hastie section 9.2

Procedure

36

Back

If you find this chart useful, well then, ENJOY! Use it as you discretion.

There is no need to print this.

STUDY HARD ! ! !

For a given case, follow the deterministic path
along the decision tree

www.wanderings.net

Training

Decision Node (cases
c
consisting of variables
x
ic
):

If (stopping criterion not reached)

1.
Find variable
x
i

and threshold
t

such that optimal
separation is achieved between cases with different
labels

2.
Decision Node (
c
x
i

t)

3.
Decision Node (
c
x
i

>

t)

37

Optimal separation

Minimize node ‘impurity’

Misclassification error
: % of cases not belonging to majority class

Gini

index
:

Cross
-
entropy
:

38

200 promoters/200 genes

TATA present

TATA absent

150 promoters/50 genes

50 promoters/150 genes

A greedy approach is most common

Is this guaranteed to find the best solution?

39

Yes

Of course not!!!

Stopping criteria

There is no need to subdivide a pure class

Other criteria (such as minimum number of
classes at a node) might be used as well

40

Pruning

Once the decision tree has been generated,
we can prune away less useful nodes

Cost complexity criterion:

41

For a particular tree T

Number of cases in a node

Penalty term (e.g.,
Gini

score)

Size penalty (tuning parameter)

Find the
subtree

T
α

that

minimizes the cost function

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

43

Quantity we’re trying

to minimize

How wrong is the tree?

How big is the tree?

size
error
C(high alpha)
C(low alpha)
Big trees, low error

Small trees, high error

Greedy approach to tree reduction: always
remove the weakest link in the tree until the
target cost is reached

44

Rifampin

sensitivity of bacterial RNA
polymerase

Rifampin

interferes with transcription in
Mycobacterium tuberculosis

However, mutations can arise in certain parts of
the
rpoB

rifampin

resistance

Can we classify
rpoB

variants based on the amino
acids encoded by the gene?

45

Cummings
et al. BMC Bioinformatics

2004
5
:137

46

88.4% correctly classified (10
-
fold cross
-
validation)

Many nearby polymorphic sites (e.g. 511, 512, 515, 521 and 529)
rejected

as potentially good classifiers

“resistant”: requires > 1
μ
g /
mL

for inhibition of
M.tuberculosis

Considerations

CARTs are:

Easy to interpret

47

Summary

Keep bias and variance in mind when choosing
a classifier

Training set accuracy is a useful measure, but
tells you nothing about predictions on unseen
cases

The choice of method often involves a trade
-
off between different desirable properties

48

START SIMPLE!!!

49

Implementations

k
-
Nearest
Neighbours
:

The R statistical package (
www.r
-
project.org
)

Weka

(http://www.cs.waikato.ac.nz/ml/weka/)

Classification Trees:

As above

C4.5 / C5.0 (commercial!)

50

51

F

I

N