Sean's Class Lecture 1

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

56 εμφανίσεις

Expression Profiling

Microarrays vs. RNA.seq

Question:

What’s a microarray?

Answer:

A microarray is a high density array of “molecules”
attached to a solid support.


1) What does high
-
density mean?


-
millimeters, microns, sub
-
micron?


-
Affymetrix Patent, 1000 probes/cm
2


2) What kind of molecules?


-
nucleic acids, proteins, organics, cells, tissues


3) How are they attached?


-
specifically, non
-
specifically, covalently, non
-
covalently


4) What kind of support?


-
glass, nylon, other polymers

Array Options



1
-

70mers

Up to 390,000 features

per array

Experimental Design 1


Key Point: We know beforehand what sequence is in
each position on the array


Synergy with genome sequencing projects


What
kind of experiments can I do on a DNA chip?


Any assay whose readout is the enrichment of nucleic acid
sequences


What
is measured?


The fluorescent intensity, or ratio of intensities, at a
particular location on the
array

Experimental Design 2

mRNA
1

mRNA
2

cDNA
2

cDNA
1

mRNA
1

cDNA
1

Scaling Signal Intensities

-
total fluorescence is scaled to be equal in both experiments

-
every spot on the array is multiplied by the same scaling factor

T
I
SF
N
i
i



1
Where N=number of genes on array


I=intensity on the array


T=some set arbitrary number

Assuming total RNA is constant between experiments

Fold
-
Changes

Why use the log?

Answer: provides symmetry around zero


10 copies mRNA/20 copies mRNA = 0.5

20 copies mRNA/10 copies mRNA = 2.0

But,

log
2
(10 copies mRNA/20 copies mRNA ) =
-
1

log
2
(20 copies mRNA/10 copies mRNA ) = 1


RNA.seq


Rapidly replacing array based methods

mRNA

c
D
NA

sequence

count

Counting can be non
-
trivial


3’ poly T priming bias


5’ and 3’ UTR boundaries


Alternate splicing


Cryptic
exons


Cryptic start sites


Paired
-
end reads can help


RPKM


R
eads
P
er
K
ilobase

per
M
illion reads

Assuming total RNA is constant between experiments

RNA.seq continued


No genome sequence necessary (
de novo
transcriptome

assembly)


Dynamic range and sensitivity limited only
by sequencing capacity


Specificity an issue with short reads


Splice sites


5’ and 3’ UTR mapping


Multiplexing!

Multiplexing with sample barcodes

mRNA

cDNA

ligate

barcoded

sequencing
primers

Sample 1

Sample 2

pool samples and
sequence

Finding significant fold changes

C
1

C
2


X
1

Sample 1

Sample 2

X
2

Is X
1
/C
1

less than X
2
/C
2
?

C
1
= total number of mapped reads in sample 1

X
1

= number of reads in sample 1 that map to gene X

C
2

= total number of mapped reads in sample 2

X
2

= number of reads in sample 2 that map to gene X

Significance testing with
Hypergeometric

(Fishers Exact Test)

C
1

C
1
+C
2


X
1

Sample 1

Pooled Sample

X
1
+X
2

H
o
: X
1
/C
1

= X
1
+X
2
/C
1
+C
2

H
a
: X
1
/C
1

< X
1
+X
2
/C
1
+C
2

Significance testing with
Hypergeometric

(Fishers Exact Test)



























































1
X
0
2
2
1
2
2
1
2
1
2
1
2
1
2
1
2
2
1
1
2
1
2
1
X
i
C
C
X
C
i
C
)
C
,
C
,
X
;
X
(
X
X
C
C
X
C
X
C
)
C
,
C
,
X
;
X
(
i
x
x
P
P


















!
X
)!
X
C
(
!
C
X
C
Remember that:

…but with replication we just
revert to t
-
tests (more or less)


C
1A

X
1A

Sample 1

Sample 2

Replicate A

Replicate B

Replicate C


C
2A

X
2A


C
1B

X
1B


C
2B

X
2B


C
1C

X
1C


C
2C

X
2C

…but with replication we just
revert to t
-
tests (more or less)

Is (X
1A
/C
1A
, X
1B
/C
1B
, X
1C
/C
1C
) different
from (X
2A
/C
2A
, X
2B
/C
2B
, X
2C
/C
2C
) ?

Gaussian (Normal) Distributions I


Mean (x
) =



Standard Deviation (s) =

0
0.2
0.4
0.6
0.8
1
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
height

# of people



n
i
i
x
n
1
1
1
)
(
1
2




n
x
x
n
i
i
What is a significant change?


Arbitrary


2
-
fold


Top 20


Two sample t
-
test






P
-
values, multiple hypotheses, and the Bonferroni
correction


SAM, Tusher et al. (2001) PNAS 98,5116
-
5121


ANOVA

2
1
0
2
2
2
1
2
1
2
1
:
,






H
n
s
n
s
x
x
t
Microarray vs. RNA.seq


Cost, Time, Throughput


Serial vs. Parallel


Sensitivity


Specificity


Signal to Noise


Dynamic Range

Gene Clustering


Metrics for determining
coexpression


Unsupervised
Clustering


Supervised Clustering

Gene Clustering

0
2
4
6
8
10
12
0
2
4
6
8
10
12
0
2
4
6
8
10
12
Condition 3
Condition1
Condition 2
Gene 1

Clustering Gene Expression Data


Choose a distance metric


Pearson Correlation


Spearman Correlation


Euclidean Distance


Mutual Information


Choose clustering algorithm


Hierarchical


Agglomerative


Principle Component Analysis


Super
-
paramagnetic and others

Condition 1

Condition 2

v
1

v
2

v
5

v
4

v
3

Pearson Correlation Coefficient


Compares
scaled

profiles!


Can detect inverse relationships


Most commonly used


Spearman rank correlation technically more correct























n
i
y
i
x
i
s
y
y
s
x
x
n
r
1
1
1
n=number of conditions

x=average expression of gene x in all n conditions

y=average expression of gene y in all n conditions

s
x
=standard deviation of x

S
y
=standard deviation of y

Correlation Examples

Raw Data

Normalized

Correlation = 0.94

Correlation =
-
0.087

Correlation Pitfalls 1

Raw Data
0
20
40
60
80
100
120
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Gene A
Gene B
Normalized Data
-1
-0.5
0
0.5
1
1.5
2
2.5
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Gene A
Gene B
Correlation=0.97

Correlation Pitfalls 2

Correlation=
-
0.02

Raw Data
0
10
20
30
40
50
60
70
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Gene C
Gene D
Normalized Data
-2
-1.5
-1
-0.5
0
0.5
1
1.5
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Gene C
Gene D
Avoid Pitfalls By Filtering The Data


Remove Genes that do not reach some
threshold level in at least one (or more)
conditions


Remove
genes whose
stdev
/mean ratio does
not reach some threshold


For spotted arrays, remove genes whose
stdev

does not reach some threshold

Euclidean Distance


Based on Pythagoras


Scaled versus unscaled


Cannot detect inverse relation ships









2
2
2
2
2
1
1
.
.
.
,
n
n
y
x
y
x
y
x
Y
X
d






c

a

b

a
2
+ b
2
= c
2

For Gene X=(x
1
, x
2
,…x
n
) and Gene Y=(y
1
, y
2
,…y
n
)

Clustering: Example 1, Step 1

Algorithm: Hierarchical, Distance Metric: Correlation

-2
-1.5
-1
-0.5
0
0.5
1
1.5
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
time (minutes)
expression
A
B
C
D
E
A
B
C
D
E
A
-
0.23
0.00
0.95
-0.63
B
-
-
0.91
0.56
0.56
C
-
-
-
0.32
0.77
D
-
-
-
-
-0.36
E
-
-
-
-
-
A

D

Clustering: Example 1, Step 2

Algorithm: Hierarchical, Distance Metric: Correlation

A

D

-2
-1.5
-1
-0.5
0
0.5
1
1.5
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
time (minutes)
expression
AD
B
C
E
AD
B
C
E
AD
-
0.37
0.16
-0.52
B
-
-
0.91
0.56
C
-
-
-
0.77
E
-
-
-
-
B

C

Clustering: Example 1, Step 3

Algorithm: Hierarchical, Distance Metric: Correlation

A

D

B

C

-2
-1.5
-1
-0.5
0
0.5
1
1.5
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
time (minutes)
expression
AD
BC
E
AD
BC
E
AD
-
0.27
-0.52
BC
-
-
0.68
E
-
-
-
E

Tree View

Eisen et al. (1998) PNAS 95: 14863
-
14868

conditions

genes

Hierarchical Clustering Summary

Advantages


Easy


Very Visual


Flexible (mean,
median, etc.)

Disadvantages


Unrelated Genes Are
Eventually Joined


Hard To Define Clusters


Manual Interpretation
Often Required


A

D

B

C

E

0
1
2
3
4
5
0
1
2
3
4
5
expression in condition 1
expression in condition 2
Clustering: Example 2, Step 1

Algorithm: k
-
means, Distance Metric: Euclidean Distance

k
1

k
2

k
3

0
1
2
3
4
5
0
1
2
3
4
5
expression in condition 1
expression in condition 2
Clustering: Example 2, Step 2

Algorithm: k
-
means, Distance Metric: Euclidean Distance

k
1

k
2

k
3

0
1
2
3
4
5
0
1
2
3
4
5
expression in condition 1
expression in condition 2
Clustering: Example 2, Step 3

Algorithm: k
-
means, Distance Metric: Euclidean Distance

k
1

k
2

k
3

0
1
2
3
4
5
0
1
2
3
4
5
expression in condition 1
expression in condition 2
Clustering: Example 2, Step 4

Algorithm: k
-
means, Distance Metric: Euclidean Distance

k
1

k
2

k
3

0
1
2
3
4
5
0
1
2
3
4
5
expression in condition 1
expression in condition 2
Clustering: Example 2, Step 5

Algorithm: k
-
means, Distance Metric: Euclidean Distance

k
1

k
2

k
3

K
-
means algorithm

1)
Pick a number (k) of cluster centers

2)
Assign every gene to its nearest cluster
center

3)
Move each cluster center to the mean of
its assigned genes

4)
Repeat 2
-
3 until convergence

K
-
means clustering summary

Advantages


Genes automatically
assigned to clusters


Can vary starting
locations of cluster
centers to determine
initial condition
dependence

Disadvantages


Must pick number of
clusters before hand


All genes forced into a
cluster

Keep in Mind.


Clustering is NOT an analysis in itself.


Clustering cannot NOT work.

0
1
2
3
4
5
0
1
2
3
4
5
expression in condition 1
expression in condition 2
0
1
2
3
4
5
0
1
2
3
4
5
expression in condition 1
expression in condition 2
Evaluating/Analyzing Clusters 1


Measure spread within and between clusters

Evaluating/Analyzing Clusters
2

Enrichment of genes with similar functions



MIPS (Munich Information Center For Protein
Sequences)


http://mips.gsf.de/


GO (Gene Ontology) Annotations
http://www.geneontology.org/


KEGGS (Kyoto Encyclopedia of Genes and
Genomes)
http://www.genome.ad.jp/kegg/kegg2.html

Example

A particular cluster has 25 coexpressed
genes in it. 15 of these genes are
annotated as being involved in rRNA
transcription.

Is 15/25 significant?

Hypergeometric Probability Distribution

the “overlap problem” or sampling without replacement


N = number of genes in the genome (6000 for yeast)


n = number of genes in the cluster (25)


m = number of rRNA transcription genes (109 from MIPS)


s = number of rRNA transcription genes in the cluster (15)

N

n

s

m

Hypergeometric Probability Distribution

N

n

s

m

)!
(
!
!
)
(
b
a
b
a
b
a
where
n
N
i
n
m
N
i
m
s
X
P
n
s
i








































Hypergeometric Probability Distribution

6000

25

15

109

16
25
15
10
1
25
6000
25
109
6000
109
)
15
(

































i
i
i
X
P
Therefore in our example…

15/25 rRNA transcription genes in the cluster is significant.


BUT…


1)
10 out of 25 genes in the cluster are not rRNA
transcription genes.

2)
94 rRNA transcription genes are not in the cluster.

3)
What about the other genes in the cluster?

6000

25

15

109

Class Prediction/Supervised
Clustering


Fishers
Linear
Discriminant

Analysis


Perceptrons

(Support Vector Machines)

Questions:

Given
a set of samples that are known to be derived from
two different classes
can we


1) find the genes that best discriminate between the samples,


2)
classify a new sample?

Fisher’s Linear Discriminant





B
A
B
A
B
B
B
B
B
B
A
A
A
A
A
A
s
s
x
x
X
FLD
x
x
x
x
x
x
x
x
x
x
x
x
X
gene
Given




6
5
4
3
2
1
6
5
4
3
2
1
,
,
,
,
,
,
,
,
,
,
,
Where:

x
A

= average expression of X in condition A

x
B

=average expression of X in condition B

s
A

= standard deviation of x
A

s
B

= standard deviation of x
B

In one dimension the classifier is a
point, b

Expression of Gene
X

For a new sample ( ):

If X < b then red

If X > b the blue

If X = b then ????

b

In two dimensions…

Expression of Gene
X

Expression of Gene
Y

For a new sample ( ):

If y <
mx

+ b then red

If y >
mx

+ b then blue

If y =
mx

+ b then ????

y =
mx

+ b

In three dimensions…

Expression of Gene X

Expression of Gene y

A
plane
is defined by two quantities (
m
,
b
)

b

For a new sample ( ):

If
m >
b then blue

If
m <
b then red

If
m =
b then ????


In many dimensions…

Expression of Gene X

Expression of Gene y

A
hyperplane

is defined by two quantities (
m
,
b
)

b

For a new sample ( ):

If
m >
b then blue

If
m <
b then red

If
m =
b then ????


What you need to know in practice


How good is the separating hyperplane?


How sensitive is the hyperplane to any one
condition?


How good is any particular classification?


Is the perceptron doing better than
randomly guessing?

Kernel
-
Induced Feature Spaces

(Support Vector Machines)

o

x

x

x

x

x

x

o

o

o

o

o

S

S
'

ф

Ф(x)

Ф(x)

Ф(x)

Ф(x)

Ф(x)

Ф(x)

Ф(o)

Ф(o)

Ф(o)

Ф(o)

Ф(o)

Ф(o)

Example Kernel Functions:

1)
Polynomial,

(x
i
,x
j
)=(x
i

x
j

+1)
d

2)
Gaussian,

(x
i
,x
j
)=e
-
|| x
i
-
x
j
||/

2

How many genes should I use?


Number of genes used can effect accuracy of classification

Number of genes used in the model

Number of misclassified samples

From Khan et al. (2001) Nature Medicine 7, 673
-
679

Which Genes Should I Use?

1)
Genes with highest discriminative power?
(maybe)


2)
Some combinations of genes have good
discriminative power even though the
individual genes do not


3)
See Xiong et al. (2001) Genome Research
11, 1878
-
1887 for the “sequential floating
forward selection” algorithm