Reducing false positives in

coachkentuckyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

61 εμφανίσεις

Reducing
false positives

in

molecular pattern recognition

Xijin Ge,

Shuichi Tsutsumi,

Hiroyuki Aburatani & Shuichi Iwata


Genome Science Division

Research Center for Advanced Science and Technology

The University of Tokyo


Towards
bedside

application of DNA
microarrays to cancer treatment


Hardware


Technologies


Cost


Knowledge
accumulation


Software


Availability


Evaluation

Needs/Market

Software

Hardware

Knowledge

accumulation

Bedside application

Testing

Algorithms for cancer classification


Support vector machine (SVM)


k
-
nearest neightbor (kNN)


Prototype matching (PM)


Artificial neural networks (ANN)


Weighted voting (WV)


Naive bayes (NB)


……



Algorithms for cancer classification


Support vector machine (SVM)


k
-
nearest neightbor (kNN)


Prototype matching (PM)


Artificial neural networks (ANN)


Weighted voting (WV)


Naive Bayes (NB)


……



Our objective


Evaluate the
reliability

of existing algorithms


Testing for practical applications


Suggestions for improvement


How to use them?

Major result:


SVM & kNN with false positive error rates >50% !


False Positive Error False Negative Error

0
10
20
30
40
50
60
70
80
0
100
200
300
400
500
600
700
800
Number of Predictor Genes
False Negative Error Rate (%)
0
10
20
30
40
50
60
70
80
0
100
200
300
400
500
600
700
800
Number of Predictor Genes
False Positive Error Rate (%)
KNN

SVM

PM

KNN

SVM

PM

How are classifiers tested?

Acute Myeloid Leukemia


AML (N=11 )


Acute Lymphoblastic Leukemia


ALL (N=27 )


83.3%
accuracy

Question:

Is this a true measure of reliability ?

KNN, SVM,
PM etc.

“Independent”

Implicit assumption:

Samples must be accurately diagnosed prior to
classification (AML or ALL).

Golub et al, Science, 1999

What happens if we present the classifier samples
that are
neither AML nor ALL
?

Bedside reality


Diagnosis accuracy


Metastasis cancers


Novel subtypes
absent

from training dataset



“complete” training dataset ?


Important for particular patients


Important for progress in cancer diagnosis





How
should

classifiers be tested?



AML (N=11 )



ALL (N=27 )


KNN, SVM,
PM, etc.

“Independent”

False positives!

“Null” test

Acute Myeloid Leukemia


AML (N=11 )

Acute Lymphoblastic Leukemia


ALL (N=27 )


KNN, SVM,
PM, etc.

False positives !

AML

ALL

No comment!

Strangers!

A benchmark dataset


Training


11 AML, 19 B
-
ALL, 8 T
-
ALL



Independent test
(false negative)


14 AML, 19 B
-
ALL, 1 T
-
ALL



Null test (false positive)


239 samples


(stomach, ovarian, liver, ……)


0
50
100
150
200
250
300
Training
Indt.
Test
Null test
Number of Samples
AML
B-ALL
T-ALL
None
38

239

34

Samples

CO

GA

LU_A

PA

BL

LU_S

LI

KI

PR

OV

BR

Training

Testing

Leave
-
one
-
class
-
out

cross validation

(LOCO
-
CV)

for the testing of

false positives

Dataset of Su et al.

Cancer Res., 2001

Algorithm evaluation

Leave
-
one
-
sample
-
out cross validation

Positive test

Null test


Leave
-
one
-
class
-
out

cross validation

False

positive

False

negative

“Unsupervised” genes selection


One
-
vs
-
all


(1,0,0) (0,1,0) (0,0,1)



Cluster
-
and
-
select





Classification of genes for the classification of samples




Data
-
structure dependent




Variation filter

Kruskal
-
Wallis H test

Dividing genes into
M

clusters
(e.g.
K
-
means clustering)

Selecting
S

genes from each
clusters with highest H value

Cluster
-
and
-
select

(1,0,1)

(1,0,0)

(1,1,0)

(0,1,0)

(0,0,1)

(0,1,1)

(1,0,0)

(0,0,1)

(0,1,0)

One
-
vs
-
All

Cluster
-
and
-
select

T
-
ALL

B
-
ALL

AML

B
-
ALL

AML

T
-
ALL

genes

Brief description of algorithms


Support vector machine (SVM)


k
-
nearest neightbor (kNN)


Prototype matching (PM)


Artificial neural networks (ANN)


Weighted voting (WV)


Naive Bayes (NB)


……



Support Vector Machine (SVM)


Multiple binary classifier


Code: SvmFu


Ryan Rifkin: www.ai.mit.edu/projects/cbcl/

No Prediction Zone

K
-
nearest neighbor (KNN)


Case
-
based reasoning


Simplicity


Prototype Matching

1
d
1
G
3
G
2
G
Nearest
-
centroid classification

“PAM”, Tibshirani et al, PNAS, 2002

Prototype Matching: modifications

1
d
2
d
1
G
3
G
2
G
Confidence:

Pearson Co. >0.2

0
10
20
30
40
50
60
70
80
0
100
200
300
400
500
600
700
800
Number of Predictor Genes
False Negative Error Rate (%)
0
10
20
30
40
50
60
70
80
0
100
200
300
400
500
600
700
800
Number of Predictor Genes
False Positive Error Rate (%)
Results: Comparison of clustering algorithms

KNN

SVM

PM


False Positive Error False Negative Error

KNN

SVM

PM

Results: Comparison of gene selection methods


PM SVM

0
5
10
15
20
25
30
35
40
45
50
0
100
200
300
400
500
600
700
800
Number of predictor genes
False positive error (%)
One-vs-all
Cluster-and-select
False positive of PM
0
5
10
15
20
25
30
35
40
45
50
0
100
200
300
400
500
600
700
800
Number of predictor genes
False positive error (%)
One-vs-all
Cluster-and-select
False positive of SVM
Different feature set for different algorithms.

Variation filter

Kruskal
-
Wallis H test

Dividing genes into
M

clusters
(e.g.
K
-
means clustering)

Selecting
S

genes from each
clusters with highest
H

value


Leave
-
one
-
sample
-
out cross validation

PM

Positive test

Null test

Outlier
detection

Leave
-
one
-
class
-
out cross validation

Feature selection:
Global filter

Feature selection:
Redundancy reduction

Pattern recognition

Verification

Results on other datasets

(Cluster
-
and
-
select + PM)

Leave
-
one
-
class
-
out cross validation

Discussion (1)



Why do we see such a big difference?


A. Prototype matching (PM)
B. Support vector machine (SVM)
Two strategies of classification

Uniqueness Differences

Multi
-
class problems Binary problems

Metastasis vs. non
-
metastasis

Tumor origin

Discussion (2)


How many genes should we use?


0
10
20
30
40
50
60
70
80
0
100
200
300
400
500
600
700
800
Number of Predictor Genes
False Negative Error Rate (%)
0
10
20
30
40
50
60
70
80
0
100
200
300
400
500
600
700
800
Number of Predictor Genes
False Positive Error Rate (%)
KNN

SVM

PM


False Positive Error False Negative Error

KNN

SVM

PM

Discussion (3)


Which algorithm should we use?

KNN

SVM

PM

False positive

False negative

Specificity

Sensitivity

SVM

Don’t fall in love with SVM !

Focus on the problem and always try other methods!

Gene
-
prediction vs. tumor classification


Algorithm Development


FGENES, GeneMark, Genie, MZEF,


Morgan, Genescan, HMMgene



Algorithm Evaluation


Burset & Guigo,
Genomics
, 1996






“Meta
-
algorithm”


Murakami & Takagi, Bioinformatics, 1998


Rogic et al, Bioinformatics, 2002


Shah et al, Bioinformatics, 2003
(GeneComber)





Algorithm Development


SVM, PM, kNN, NB, WV


Algorithm Evaluation


Dudoit et al, JASA, 2002.


Liu et al, GIW2002


“Meta
-
algorithm”


Conclusions


A benchmark dataset to evaluate algorithms.

“Null test” & “leave
-
one
-
class
-
out” cross validation



High false positives for KNN & SVM with small
feature set.
(>50%)



PM can be modified to achieve high specificity
(~90%).



“Cluster
-
and
-
select” gene selection procedure.


Thanks to:

Hiroyuki Aburatani

Shuichi Tsutsumi

Shogo Yamamoto

Shingo Tsuji

Kunihiro Nishimura

Daisuke Komura

Makoto Kano

Shigeo Ihara

Naoko Nishikawa


Shuichi Iwata

Naohiro Shichijo

Jerome Piat


Todd R. Golub


Qing Guo


Jiang Fu


GIW reviewers

Yoshitaka Hippo

Shumpei Ishikawa

Akitake Mukasa

Yongxin Chen

Yingqiu Guo

Other lab members






Supplementary information


(Benchmark datasets, source code, …)



www2.genome.rcast.u
-
tokyo.ac.jp/pm/



Additional data

Xijin Ge

Support Vector Machine (SVM)


Multiple binary classifier


Code: SvmFu


Ryan Rifkin: www.ai.mit.edu/projects/cbcl/

No Prediction Zone

K
-
nearest neighbor (KNN)



Threshold >80% consistency


Prototype Matching: modifications

1
d
2
d
1
G
3
G
2
G
Confidence:

Pearson Co. >0.2

Distance to closest match (d/r)
Margin over second closest match (%)
0
10
20
30
40
50
60
70
80
90
100
0
1
2
3
4
5
6
7
Training in LOOCV
Positive test
Null test
Confidence criterion
Samples

CLL

FL

DLBCL

Lymphoma dataset

Alizadeh et al,

Nature, 2000

Burkitt

lymphomas

(BL)

Neuroblastoma

(NB)

Ewing family

of tumors

(EWS)

Burkitt

lymphomas

Rhabdomyosarcoma

(RMS)

SRBCT dataset:

Khan et al,

Nature Medicine

2001

Samples

CO

GA

LU_A

PA

BL

LU_S

LI

KI

PR

OV

BR

Su et al

Can. Res.

2001


False positive False negative