1
Bios 760R, Lecture 1
Overview
Overview of the course
Classification
The “curse of dimensionality” problem is
exacerbated in Bioinformatics applications
Reminder of some background knowledge
Tianwei Yu
Room 330
tyu8@sph.emory.edu
Friday 9 am ~ 11 pm
Room P51
2
Overview
Focus of the course:
Classification
Clustering
Dimension reduction
Jan 16
Overview
Jan 23
Bayesian decision theory
Jan 30
Density estimation
Feb 6
Linear machine
Feb 13
Support vector machine
Feb 20
Tree and forest
Feb 27
Bump hunting
Mar 6
Boosting
Mar 13
No class.
Mar 20
Generalization of models
Mar 27
Similarity measures, Hierarchical
clustering, Subspace clustering
Apr 3
Center

based & Model

based
clustering
Apr 10
PCA, SIR
Apr 17
ICA, PLSDA
Apr 24
In

class presentation
3
Overview
References:
Textbook:
The elements of statistical learning. Hastie, Tibshirani &
Friedman.
Other references:
Pattern classification. Duda, Hart & Stork.
Data clustering: theory, algorithms and application. Gan, Ma &
Wu.
Applied multivariate statistical analysis. Johnson & Wichern.
Evaluation:
In

class presentation of a research paper selected by instructor.
4
Machine Learning
/Data mining
Supervised learning
”direct data mining”
Unsupervised learning
”indirect data mining”
Classification
Estimation
Prediction
Clustering
Association rules
Description,
dimension
reduction and
visualization
Modified from Figure 1.1 from <Data Clustering> by Gan, Ma and Wu
Overview
Semi

supervised learning
5
In supervised learning, the problem is well

defined:
Given a set of observations
{
x
i
, y
i
},
estimate the density
Pr(Y
X
)
Usually the goal is to find the location parameter that
minimize the expected error at each
x:
Objective criteria
exists to measure the success of a supervised
learning mechanism:
Error rate from testing (or cross

validation) data
)
,
Y
(
L
E
min
arg
)
x
(
X

Y
Overview
6
In unsupervised learning, there is no output variable, all we
observe is a set {
x
i
}.
The goal is to infer
Pr(
X
)
and/or some of its properties.
When the dimension is low, nonparametric density estimation is
possible;
When the dimension is high, may need to find simple properties
without density estimation, or apply strong assumptions to
estimate the density.
There is
no objective criteria
to evaluate the outcome;
Heuristic arguments are used to motivate the methods;
Reasonable explanation of the outcome is expected from the
subjective field of study.
Overview
7
Classification
The general scheme.
An example.
8
Classification
In most cases, a single
feature is not enough to
generate a good
classifier.
9
Classification
Two extremes:
overly rigid and
overly flexible
classifiers.
10
Classification
Goal: an optimal trade

off between model simplicity and training
set performance.
This is similar to the AIC / BIC / …… model selection in
regression.
11
Classification
An example of the overall
scheme involving
classification:
12
Classification
A classification
project:
a systematic view.
13
Curse of Dimensionality
Bellman R.E.,
1961.
In p

dimensions, to get a hypercube with volume r, the edge length
needed is
r
1/p
.
In 10 dimensions, to capture 1% of the data to get a local average,
we need 63% of the range of each input variable.
14
In other words,
To get a “dense” sample, if we need N=100 samples in 1
dimension, then we need N=100
10
samples in 10 dimensions.
In high

dimension, the data is always sparse and do not support
density estimation.
More data points are closer to the boundary, rather than to any
other data point
prediction is much harder near the edge of the
training sample.
Curse of Dimensionality
15
Curse of Dimensionality
))
(
ˆ
(
))
(
ˆ
(
)]
(
))
(
ˆ
(
[
))]
(
ˆ
(
)
(
ˆ
[
]
))
(
ˆ
)
(
[(
)]
))
(
ˆ
)
(
(
))
(
ˆ
)
(
(
2
[(
]
))
(
ˆ
[(
)
(
0
2
0
2
2
0
0
2
0
0
2
2
0
0
2
2
0
0
0
0
2
2
0
0
x
f
Bias
x
f
Var
x
f
x
f
E
x
f
E
x
f
E
x
f
x
f
E
x
f
x
f
x
f
x
f
E
x
f
Y
E
x
EPE
Just a reminder, the
expected prediction error
contains variance and bias
components.
Under model:
Y=f(X)+ε
16
Curse of Dimensionality
We have talked about the curse of dimensionality in the
sense of density estimation.
In a classification problem, we do not necessarily need
density estimation.
Generative model

care about class density function.
Discriminative model

care about boundary.
Example: Classifying belt fish and carp. Looking at the
length/width ratio is enough. Why should we care how many
teeth each kind of fish have, or what shape fins they have?
17
High

throughput biological data
We talk about “curse of dimensionality” when N is not >>>p.
In bioinformatics, usually N<100, and p>1000.
How to deal with this N<<p issue?
Dramatically reduce p before model

building.
Filter genes based on:
variation.
normal/disease test statistic.
projection.
functional groups/network
modules.
……
For the most part of this course, we will pretend the N<<p issue
doesn’t exist. Some methods claim to be resistant to this.
18
Reminder of some results for random vectors
E
(
X
)(
X
)
'
E
X
1
1
X
2
2
X
p
p
X
1
1
X
2
2
X
p
p
11
12
1
p
21
22
2
p
p
1
p
2
pp
The multivariate Gaussian distribution:
The covariance
matrix, not limited to
Gaussian.
* Gaussian is fully
defined by mean
vector and covariance
matrix (first and
second moments).
19
11
12
1
p
21
22
2
p
p
1
p
2
pp
ik
ik
ii
kk
The correlation matrix:
V
1
2
diag
ii
V
1
2
V
1
2
Relationship with covariance matrix:
Reminder of some results for random vectors
20
c
X
c
1
X
1
c
p
X
p
E
(
c
X
)
c
Var(
c
X
)
c
c
x
Ax
a
ij
x
i
x
j
j
1
k
i
1
k
Var(
aX
1
bX
2
)
E
[(
aX
1
bX
2
)
(
a
1
b
2
)]
2
E
[
a
(
X
1
1
)
b
(
X
2
2
)]
2
a
2
11
b
2
22
2
ab
12
[
a
b
]
11
12
12
22
a
b
c
c
Reminder of some results for random vectors
Quadratic form:
A linear combination of the elements of a random vector:
2

D example:
21
Z
c
11
c
12
c
1
p
c
21
c
22
c
2
p
c
q
1
c
q
2
c
qp
X
CX
Z
E
(
Z
)
E
(
CX
)
C
X
Z
Cov(
CX
)
C
X
C
Reminder of some results for random vectors
A “new” random vector generated from linear combinations of a
random vector:
Reminder of some results for random vectors
Let A be a
kxk
square symmetrix matrix, then it has
k
pairs of eigenvalues
and eigenvectors. A can be decomposed as:
A
1
e
1
e
1
2
e
2
e
2
.......
k
e
k
e
k
P
P
Positive

definite matrix:
x
Ax
0
,
x
0
1
2
......
k
0
Note
:
x
Ax
1
(
x
e
1
)
2
......
k
(
x
e
k
)
2
Reminder of some results for random vectors
Square root matrix of a positive

definite matrix:
Inverse of a positive

definite matrix:
A
1
P
1
P
1
i
e
i
e
i
i
1
k
A
1
2
P
1
2
P
i
e
i
e
i
i
1
k
24
A
:
positive definite matrix;
eigen values
1
2
p
0
eigenvectors e
1
,e
2
,
,e
p
For all unit vector x,
max
x
0
x
Ax
x
x
1
(when x
e
1
)
min
x
0
x
Ax
x
x
p
(when x
e
p
)
max
x
e
1
,
,e
k
x
Ax
x
x
k
1
(when x
e
k
1
)
Reminder of some results for random vectors
25
A
1/2
P
1/2
P
,
y
P
x
x
Ax
x
x
x
A
1/2
A
1/2
x
x
P
P
x
x
P
1/2
P
P
1/2
P
x
y
y
y
y
y
y
i
y
i
2
i
1
p
y
i
2
i
1
p
1
y
i
2
i
1
p
y
i
2
i
1
p
1
x
e
1
,
y
(Pe
1
)
[1
0
0],
y
y
y
y
1
e
1
Ae
1
e
1
e
1
Reminder of some results for random vectors
Proof of the first (and second) point of the previous slide.
26
E
(
X
)
E
(
1
n
X
1
1
n
X
2
1
n
X
n
)
(
X
)(
X
)
1
n
X
i
i
1
n
1
n
X
j
j
1
n
1
n
2
X
i
j
1
n
i
1
n
X
j
Cov(
X
)
E
(
X
)(
X
)
1
n
2
E
X
i
j
1
n
i
1
n
X
j
1
n
Reminder of some results for random vectors
With a sample of the random vector:
27
X
S
1
n
1
X
i
X
X
i
X
i
1
n
Reminder of some results for random vectors
To estimate mean vector and covariance matrix:
28
Reminder of some results for random vectors
29
Reminder of some results for random vectors
30
Reminder of some results for random vectors
Comments 0
Log in to post a comment