Lecture 1: Overview - Userwww.service.emory.edu

naivenorthAI and Robotics

Nov 8, 2013 (3 years and 5 months ago)

71 views

1

Bios 760R, Lecture 1


Overview


Overview of the course



Classification



The “curse of dimensionality” problem is
exacerbated in Bioinformatics applications




Reminder of some background knowledge

Tianwei Yu

Room 330

tyu8@sph.emory.edu

Friday 9 am ~ 11 pm

Room P51

2

Overview

Focus of the course:


Classification


Clustering


Dimension reduction

Jan 16

Overview

Jan 23

Bayesian decision theory

Jan 30

Density estimation

Feb 6

Linear machine

Feb 13

Support vector machine

Feb 20

Tree and forest

Feb 27

Bump hunting

Mar 6

Boosting

Mar 13

No class.

Mar 20

Generalization of models

Mar 27

Similarity measures, Hierarchical
clustering, Subspace clustering

Apr 3

Center
-
based & Model
-
based
clustering

Apr 10

PCA, SIR

Apr 17

ICA, PLSDA

Apr 24

In
-
class presentation

3

Overview

References:


Textbook:


The elements of statistical learning. Hastie, Tibshirani &
Friedman.



Other references:


Pattern classification. Duda, Hart & Stork.


Data clustering: theory, algorithms and application. Gan, Ma &
Wu.


Applied multivariate statistical analysis. Johnson & Wichern.


Evaluation:



In
-
class presentation of a research paper selected by instructor.

4

Machine Learning

/Data mining

Supervised learning

”direct data mining”

Unsupervised learning

”indirect data mining”



Classification



Estimation


Prediction


Clustering


Association rules


Description,
dimension
reduction and
visualization

Modified from Figure 1.1 from <Data Clustering> by Gan, Ma and Wu

Overview

Semi
-
supervised learning

5

In supervised learning, the problem is well
-
defined:



Given a set of observations
{
x
i
, y
i
},


estimate the density
Pr(Y|
X
)


Usually the goal is to find the location parameter that
minimize the expected error at each
x:






Objective criteria
exists to measure the success of a supervised
learning mechanism:


Error rate from testing (or cross
-
validation) data

)
,
Y
(
L
E
min
arg
)
x
(
X
|
Y




Overview

6

In unsupervised learning, there is no output variable, all we
observe is a set {
x
i
}.

The goal is to infer
Pr(
X
)

and/or some of its properties.


When the dimension is low, nonparametric density estimation is
possible;

When the dimension is high, may need to find simple properties
without density estimation, or apply strong assumptions to
estimate the density.


There is
no objective criteria
to evaluate the outcome;

Heuristic arguments are used to motivate the methods;
Reasonable explanation of the outcome is expected from the
subjective field of study.

Overview

7

Classification

The general scheme.

An example.

8

Classification

In most cases, a single
feature is not enough to
generate a good
classifier.

9

Classification

Two extremes:
overly rigid and
overly flexible
classifiers.

10

Classification

Goal: an optimal trade
-
off between model simplicity and training
set performance.

This is similar to the AIC / BIC / …… model selection in
regression.

11

Classification

An example of the overall
scheme involving
classification:

12

Classification

A classification
project:

a systematic view.

13

Curse of Dimensionality

Bellman R.E.,

1961.









In p
-
dimensions, to get a hypercube with volume r, the edge length
needed is
r
1/p
.

In 10 dimensions, to capture 1% of the data to get a local average,
we need 63% of the range of each input variable.

14

In other words,


To get a “dense” sample, if we need N=100 samples in 1
dimension, then we need N=100
10

samples in 10 dimensions.


In high
-
dimension, the data is always sparse and do not support
density estimation.


More data points are closer to the boundary, rather than to any
other data point


prediction is much harder near the edge of the
training sample.

Curse of Dimensionality

15

Curse of Dimensionality

))
(
ˆ
(
))
(
ˆ
(
)]
(
))
(
ˆ
(
[
))]
(
ˆ
(
)
(
ˆ
[
]
))
(
ˆ
)
(
[(
)]
))
(
ˆ
)
(
(
))
(
ˆ
)
(
(
2
[(
]
))
(
ˆ
[(
)
(
0
2
0
2
2
0
0
2
0
0
2
2
0
0
2
2
0
0
0
0
2
2
0
0
x
f
Bias
x
f
Var
x
f
x
f
E
x
f
E
x
f
E
x
f
x
f
E
x
f
x
f
x
f
x
f
E
x
f
Y
E
x
EPE























Just a reminder, the
expected prediction error
contains variance and bias
components.


Under model:

Y=f(X)+ε

16

Curse of Dimensionality

We have talked about the curse of dimensionality in the
sense of density estimation.


In a classification problem, we do not necessarily need
density estimation.


Generative model
---

care about class density function.


Discriminative model
---

care about boundary.


Example: Classifying belt fish and carp. Looking at the
length/width ratio is enough. Why should we care how many
teeth each kind of fish have, or what shape fins they have?

17

High
-
throughput biological data

We talk about “curse of dimensionality” when N is not >>>p.


In bioinformatics, usually N<100, and p>1000.


How to deal with this N<<p issue?


Dramatically reduce p before model
-
building.


Filter genes based on:


variation.



normal/disease test statistic.



projection.



functional groups/network
modules.



……

For the most part of this course, we will pretend the N<<p issue
doesn’t exist. Some methods claim to be resistant to this.

18

Reminder of some results for random vectors




E
(
X


)(
X


)
'

E
X
1


1
X
2


2
X
p


p












X
1


1
X
2


2
X
p


p
















11

12

1
p

21

22

2
p

p
1

p
2

pp












The multivariate Gaussian distribution:

The covariance
matrix, not limited to
Gaussian.

* Gaussian is fully
defined by mean
vector and covariance
matrix (first and
second moments).

19





11

12

1
p

21

22

2
p

p
1

p
2

pp













ik


ik

ii

kk
The correlation matrix:


V
1
2

diag

ii


V
1
2

V
1
2


Relationship with covariance matrix:

Reminder of some results for random vectors

20



c
X

c
1
X
1


c
p
X
p
E
(

c
X
)


c

Var(

c
X
)


c

c


x
Ax

a
ij
x
i
x
j
j

1
k

i

1
k


Var(
aX
1

bX
2
)

E
[(
aX
1

bX
2
)

(
a

1

b

2
)]
2

E
[
a
(
X
1


1
)

b
(
X
2


2
)]
2

a
2

11

b
2

22

2
ab

12

[
a
b
]

11

12

12

22






a
b








c

c
Reminder of some results for random vectors

Quadratic form:

A linear combination of the elements of a random vector:

2
-
D example:

21


Z

c
11
c
12
c
1
p
c
21
c
22
c
2
p
c
q
1
c
q
2
c
qp












X

CX

Z

E
(
Z
)

E
(
CX
)

C

X

Z

Cov(
CX
)

C

X

C
Reminder of some results for random vectors

A “new” random vector generated from linear combinations of a
random vector:

Reminder of some results for random vectors

Let A be a
kxk

square symmetrix matrix, then it has
k

pairs of eigenvalues
and eigenvectors. A can be decomposed as:


A


1
e
1
e
1



2
e
2
e
2


.......


k
e
k
e
k


P


P
Positive
-
definite matrix:



x
Ax

0
,

x

0

1


2

......


k

0
Note
:

x
Ax


1
(

x
e
1
)
2

......


k
(

x
e
k
)
2
Reminder of some results for random vectors

Square root matrix of a positive
-
definite matrix:

Inverse of a positive
-
definite matrix:


A

1

P


1

P

1

i
e
i
e
i

i

1
k


A
1
2

P

1
2

P


i
e
i
e
i

i

1
k

24


A
:
positive definite matrix;
eigen values

1


2



p

0

eigenvectors e
1
,e
2
,
,e
p
For all unit vector x,
max
x

0

x
Ax

x
x


1
(when x

e
1
)
min
x

0

x
Ax

x
x


p
(when x

e
p
)
max
x

e
1
,
,e
k

x
Ax

x
x


k

1
(when x

e
k

1
)
Reminder of some results for random vectors

25


A
1/2

P

1/2

P
,
y


P
x

x
Ax

x
x


x
A
1/2
A
1/2
x

x
P

P
x


x
P

1/2

P
P

1/2

P
x

y
y


y

y

y
y


i
y
i
2
i

1
p

y
i
2
i

1
p



1
y
i
2
i

1
p

y
i
2
i

1
p



1
x

e
1
,

y

(Pe
1

)

[1
0
0],

y

y

y
y


1

e
1

Ae
1
e
1

e
1
Reminder of some results for random vectors

Proof of the first (and second) point of the previous slide.

26


E
(
X
)

E
(
1
n
X
1

1
n
X
2


1
n
X
n
)


(
X


)(
X



)

1
n
X
i




i

1
n







1
n
X
j




j

1
n











1
n
2
X
i




j

1
n

i

1
n

X
j





Cov(
X
)

E
(
X


)(
X



)

1
n
2
E
X
i




j

1
n

i

1
n

X
j






1
n

Reminder of some results for random vectors

With a sample of the random vector:

27




X


S

1
n

1
X
i

X


X
i

X



i

1
n

Reminder of some results for random vectors

To estimate mean vector and covariance matrix:

28

Reminder of some results for random vectors

29

Reminder of some results for random vectors

30

Reminder of some results for random vectors