A Significance Test

Based Feature
Selection Method for the Detection of
Prostate Cancer from Proteomic Patterns
Qianren (Tim) Xu
M.A.Sc. Candidate:
Supervisors:
Dr. M. Kamel
Dr. M. M. A. Salama
2
Highlight
•
STFS can be generally used for
any problems of supervised
pattern recognition
•
Very good performances have
been obtained on several
benchmark datasets,
especially with a large number
of features
Significance Test

Based
Feature Selection (STFS):
Proteomic Pattern
Analysis for Prostate
Cancer Detection
STFS
Neural Networks
ROC analysis
•
Sensitivity 97.1%,
Specificity 96.8%
•
Suggestion of mistaken
label by prostatic biopsy
3
Outline of Part I
Significance Test

Based Feature Selection (STFS)
on Supervised Pattern Recognition
•
Introduction
•
Methodology
•
Experiment Results on Benchmark Datasets
•
Comparison with MIFS
4
Introduction
Problems on Features
•
Large number
•
Irrelevant
•
Noise
•
Correlation
Increasing
computational complexity
Reducing
recognition rate
5
Mutual Information Feature Selection
•
Large number of features and the large number
of classes
•
Continuous data
•
But estimation of the mutual information is difficult:
•
One of most important heuristic feature selection
methods, it can be very useful in any classification
systems.
6
Problems on Feature Selection
Methods
•
Computational complexity
•
Optimal deficiency
Two key issues:
7
Proposed Method
Significance
of feature
=
Criterion of Feature Selection
Significant
difference
Independence
X
Noncorrelation between
candidate feature and
already

selected features
Pattern separability
on individual
candidate features
8
Measurement of Pattern
Separability of Individual Features
Statistical Significant Difference
Continuous data
with normal
distribution
Continuous data with
non

normal distribution
or rank data
Categorical
data
Two
classes
More than
two classes
Two
classes
More than two
classes
t

test
ANOVA
Mann

Whitney
test
Chi

square
test
Kruskal

Wallis
test
9
Independence
Independence
Continuous data
with normal
distribution
Continuous data with
non

normal distribution
or rank data
Categorical
data
Pearson
contingency
coefficient
Pearson
correlation
Spearman rank
correlation
10
Selecting Procedure
MSDI:
M
aximum
S
ignificant
D
ifference
and
I
ndependence Algorithm
MIC:
M
onotonically
I
ncreasing
C
urve Strategy
11
Maximum Significant Difference and
Independence (MSDI) Algorithm
Compute the significance difference (
sd
) of every initial features
Select the feature with maximum
sd
as the first feature
Computer the independence level (
ind
) between every
candidate feature and the already

selected feature(s)
Select the feature with maximum feature
significance (
sf
=
sd
x
ind
) as the new feature
12
Monotonically Increasing Curve
(MIC) Strategy
0
10
20
30
0.4
0.6
0.8
1
Number of features
Rate of recognition
Performance Curve
The feature subset
selected by MSDI
Plot performance curve
Delete the features that have
“no good” contribution to
the increasing of
recognition
Until the curve is monotonically increasing
13
Example I: Handwritten Digit Recognition
•
Thus 8x8 matrix is generated, that is 64 features
•
The pixels in each block is counted
•
32

by

32 bitmaps are divided into 8
X
8=64 blocks
14
Performance Curve
0
10
20
30
40
50
60
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of features
Rate of recognition
MSDI
MIFS(
β
=1.0)
MIFS(
β
=0.8)
MIFS(
β
=0.6)
MIFS(
β
=0.4)
MIFS(
β
=0.2)
Battiti’s MIFS
:
Random ranking
It is need to
determined
β
MSDI:
Maximum Significant
Difference and Independence
MIFS: Mutual Information
Feature Selector
15
Computational Complexity
Selecting 15 features from the 64
original feature set
MSDI
: 24 seconds
Battiti’s MIFS
: 1110 seconds
(5 vales of β are searched
in the range of 0

1)
16
Example II: Handwritten digit recognition
The
649 features
that distribute over the
following six feature sets:
•
76 Fourier coefficients of the character
shapes,
•
216 profile correlations,
•
64 Karhunen

Love coefficients,
•
240 pixel averages in 2 x 3 windows,
•
47 Zernike moments,
•
6 morphological features.
17
Performance Curve
0
10
20
30
40
50
0.2
0.4
0.6
0.8
1
Number of features
Rate of recognition
MSDI + MIC
MSDI
Random ranking
MSDI:
Maximum Significant
difference and independence
MIC:
Monotonically Increasing
Curve
18
Comparison with MIFS
0
10
20
30
40
50
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of features
Rate of recognition
MSDI
MIFS (
β
=0.2)
MIFS (
β
=0.5)
MSDI
is much
better on
large number
of features
MIFS
is better
on small
number of
features
MSDI:
Maximum Significant
Difference and Independence
MIFS:
Mutual Information
Feature Selector
19
Summary on Comparing MSDI
with MIFS
•
MSDI is much more computational
effective
•
MIFS need to calculate the pdfs
•
The computational effective criterion
(Battiti’s MIFS) still need to determine
β
•
MSDI only involves the simple statistical
calculation
•
MSDI can select more optimal feature subset from
a large number of feature, because it is based on
relevant statistical models
•
MIFS is more suitable on small volume of data and
small feature subset
20
Outline of Part II
Mass Spectrometry

Based Proteomic Pattern
Analysis for Detection of Prostate Cancer
•
Problem Statement
•
Methods
•
Feature
•
Classification
•
optimization
•
Results and Discussion
21
Problem Statement
1.
Very large number of features
2.
Electronic and chemical noise
3.
Biological variability of human disease
4.
Little knowledge in the proteomic mass
spectrum
15154 points (features)
22
The system of Proteomic
Pattern Analysis
Training dataset
(initial features > 10
4
)
RBFNN / PNN learning
Most significant features
selected by STFS
Mature classifier
Trained neural classifier
Optimization of the size of feature
subset and the parameters of classifier
by minimizing ROC distance
STFS
: Significance Test

Based
Feature Selection
PNN
: Probabilistic Neural Network
RBFNN
: Radial Basis Function
Neural Network
23
Feature Selection: STFS
MIC
MSDI
Significance
of feature
=
Significant
difference
Independence
x
STFS: Significance Test

Based Feature Selection
MSDI: Maximum Significant Difference and Independence Algorithm
MIC: Monotonically Increasing Curve Strategy
Student
Test
Pearson
correlation
24
Classification: PNN / RBFNN
x
1
x
2
x
n
Pool 1
Pool 2
S
1
S
2
x
1
x
2
x
3
x
n
y
(1)
y
(2)
y
y
d
PNN is a standard
structure with four layers
RBFNN is a modified
four

layer structure
PNN: Probabilistic Neural Network
RBFNN: Radial Basis Function Neural Network
25
Optimization: ROC Distance
Minimizing the ROC distance
to optimize:

Feature subset numbers m

Gaussian spread
σ

RBFNN pattern decision weight
λ
ROC: Receiver Operating Characteristic
a
b
d
ROC
False positive rate
(1

specificity)
1
0
0
1
True positive rate
(sensitivity)
26
Results:
Sensitivity and Specificity
Sensitivity
Specificity
Our results
97.1%
96.8%
Petricoin (2002)
94.7%
75.9%
DRE
55

68%
6

33%
PSA
29

80%

27
Pattern
Distribution

0.4

0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0
10
20
30
40
50
60

0.4

0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0
10
20
30
40
50
60
70
False negative 2.9%
False positive 3.2%
True negative 96.8%
True positive 97.1%
Cut

point
Non

Cancer
Cancer
Labelled by
Biopsies
Non

Cancer
Cancer
Pattern
recognized
by RBFNN
28
The possible causes on
the unrecognizable samples
1.
The algorithm of the classifier is not able
to recognize all the samples
2.
The proteomics is not able to provide
enough information
3.
Prostatic biopsies mistakenly label the
cancer
29
Possibility of Mistaken Diagnosis of
Prostatic Biopsy

0.4

0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0
10
20
30
40
50
60

0.4

0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0
10
20
30
40
50
60
70
True non

cancer
False non

cancer
False cancer
True cancer
Cut

point
•
Biopsy has limited sensitivity
and specificity
•
Proteomic classifier has very
high sensitivity and specificity
correlated with biopsy
•
The results of proteomic
classifier are not exactly the
same as biopsy
•
All unrecognizable sample
are outliers
31
Summary (1)
Significance Test

Based Feature Selection (STFS):
•
STFS selects features by maximum significant difference
and independence (MSDI), it aims to determine minimum
possible feature subset to achieve maximum recognition
rate
•
Feature significance (selecting criterion ) is estimated
based on the optimal statistical models in accordance
with the properties of the data
•
Advantages:
•
Computationally effective
•
Optimality
32
Summary (2)
Proteomic Pattern Analysis for Detection of
Prostate Cancer
•
The system consists of three parts: feature selection by
STFS, classification by PNN/RBFNN, optimization and
evaluation by minimum ROC distance
•
Sensitivity 97.1%, Specificity 96.8%, it would be an
asset to early and accurately detect prostate, and to
prevent a large number of aging men from undergoing
unnecessary prostatic biopsies
•
Suggestion of mistaken label by prostatic biopsy
through pattern analysis may lead to a novel direction
in the diagnostic research of prostate cancer
33
Thanks for your time
Questions?
Comments 0
Log in to post a comment