Machine Learning for

cabbageswerveAI and Robotics

Nov 7, 2013 (3 years and 9 months ago)

66 views

1

Machine Learning for

Functional Genomics I

Matt Hibbs


http://
cbfg.jax.org

2

Central Dogma

Gene

Expression

DNA

Proteins

Phenotypes

3

Functional Genomics


Identify the roles played by genes/proteins

Sealfon
et al.
, 2006.

4

Gene Expression Microarrays

Simultaneous measurements of mRNA abundance levels for
every gene in a genome




Genes

Conditions

5

Simultaneous measurements of mRNA abundance levels for
every gene in a genome


in thousands of conditions




Gene Expression Microarrays


Rich functional information in these data, but
how can we utilize the entire compendia?

6

Biological Data Explosion

Huge repositories of biological data…

…are not directly translating into knowledge

0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2003
2004
2005
2006
2007
Year

# of genes

Mouse genes with known process association

Publically available microarrays in GEO

# of measurements

Year

0
100000000
200000000
300000000
400000000
500000000
600000000
700000000
2003
2004
2005
2006
2007
0
1000
2000
3000
4000
5000
6000
2003
2004
2005
2006
2007
# of genes

Year

S. cerevisiae
genes with known function

7

Why is there a Data
-
Knowledge
Gap?


Many datasets are analyzed only once


Initial publication looks for hypothesis


Need standards for naming, formats,
collection


Data should be aggregated and integrated


Modestly significant clues seen repeatedly
can become convincing


“a preponderance of circumstantial evidence”


Scale of this problem overwhelms
traditional biology

8

Scalable Artificial Intelligence


Computer science is really a study in scalability


Use machine learning and data mining
techniques to quickly identify important patterns

9

Amazon Recommendations

10

Amazon Recommendations

Purchase History

Item Rankings

Recommendations

Machine Learning

(Bayesian networks)


Compare your purchase history to
all other customers



Find commonalities between
profiles



Predict potential purchases

Observe Browsing
Patterns and Account
Activity

11

Gene Function Prediction

Purchase History

Item Rankings

Recommendations

Observe Browsing
Patterns and
Account Activity

Machine Learning

(Bayesian networks)

Genome Scale Data

MGI Annotations

Predictions

Laboratory
Experiments

Machine Learning

(Bayesian networks)



12

Challenges for AI from Biology


Input data is noisy, heterogeneous,
constantly evolving




Current knowledge is incomplete and
biased





Can be difficult to determine accuracy

13

Promise of Computational Functional
Genomics

Data &
Existing
Knowledge

Computational
Approaches

Predictions

Laboratory
Experiments

14

Reality of Computational Functional Genomics

Data &
Existing
Knowledge

Computational
Approaches

Predictions

Laboratory
Experiments

15

Computational Solutions


Machine learning & data mining


Use existing data to make new predictions


Similarity search algorithms


Bayesian networks


Support vector machines


etc.


Validate predictions with follow
-
up lab work



Visualization & exploratory analysis


Seeing and interacting with data important


Show data so that questions can be
answered


Scalability, incorporate statistics, etc.

16

Computational Solutions


Machine learning & data mining


Use existing data to make new predictions


Similarity search algorithms


Bayesian networks


Support vector machines


etc.


Validate predictions with follow
-
up lab work



Visualization & exploratory analysis


Seeing and interacting with data important


Show data so that questions can be
answered


Scalability, incorporate statistics, etc.

17

Similarity Search Approach


Re
-
frame analysis as exploratory search

Data Collection

Query Genes

Search

Algorithm

(SPELL)

Relevant Datasets

Related Genes

18


Context
-
Sensitive Search Process




Signal Balancing




Correlation Comparability

X

U



V
t

=

Key Insights

19


Context
-
Sensitive Search Process




Signal Balancing




Correlation Comparability

X

U



V
t

=

Key Insights

20

Dataset relevance weighting

Datasets

Calculate correlation measure
among query for each dataset


--

This is each datasets’ weight

0.15

0.82

0.05

0.55

Query Genes:

Q

= {
YQG1, YQG2,


YQG3}

YQG1

YQG2

YQG3

21

Identify Novel Partners

Datasets

0.15

0.82

0.05

0.55

Query Genes:

Q

= {
YQG1, YQG2,


YQG3}

YQG1

YQG2

YQG3

Calculate weighted distance score
for all other genes to the query set


geneA

geneB

geneC

22

Identify Novel Partners

Datasets

0.15

0.82

0.05

0.55

Query Genes:

Q

= {
YQG1, YQG2,


YQG3}

YQG1

YQG2

YQG3

geneA

geneB

geneC

Calculate weighted distance score
for all other genes to the query set


Best score



Worst score

+ Takes advantage of functional diversity

+ Addresses statistical concerns

+ Fast running times [O(GDQ
2
)] (ms per query)


+ Top results are candidates for investigation

+ Search process is iterative to refine results

23


Context
-
Sensitive Search Process




Signal Balancing




Correlation Comparability

X

U



V
t

=

Key Insights

24


Singular Value Decomposition (SVD)


Projects data into another
orthonormal

basis









Correlations in
U

(rather than
X
) often contain
better biological signals

Signal Balancing Data
-

SVD

25

Signal Balancing

SVD

Signal

Balancing

26

Signal Balancing


Use correlations among left singular
vectors


Downweights dominant patterns, amplifies
subtle patterns


Top eigengenes dominate data


Sometimes correspond to systematic bias


Often correspond to common biological
processes


eg. ribosome biogenesis, etc.


Accuracy of signal balancing improved
over re
-
projection

27


Context
-
Sensitive Search Process




Signal Balancing




Correlation Comparability

X

U



V
t

=

Key Insights

28

Between
-
dataset normalization


Commonly used Pearson correlation yields greatly
different distributions of correlation


These differences complicate comparisons



DeRisi et al., 97

Primig et al., 00

Histograms of Pearson correlations between all pairs of genes

29


Fisher Z
-
transform, Z
-
score equalizes distributions


Increases comparability between datasets


Histograms of Z
-
scores between all pairs of genes

Between
-
dataset normalization

30

SPELL Algorithm Overview

Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the
functional landscape of gene expression: directed search of large microarray
compendia.
Bioinformatics
, 2007.

31

Web Interface

http://spell.princeton.edu

32

Evaluation of Performance


Leave
-
k
-
in cross validation / bootstrapping


Results averaged across 125 diverse GO
biological process terms
(defined in the GRIFn
system, Myers et al., 2006)


Many predictions also verified through
experimental validations in other studies


Hibbs et al.,
Bioinf
, 2007


Hess et al.,
PLoS Gen
, 2009


Hibbs*, Myers*, Huttenhower*, et al.,
PLoS Comp
Biol
, 2009

33

Order
Genome

Search Accuracy


Perform “leave
-
k
-
in” cross
-
validation

Genes with
common
function



For all pairs

Master
List

Rank
Average

34

Search Accuracy


Precision
-
Recall Curve

Master
List

Precision

TP

TP + FP

Recall

TP

TP + FN

0

0

1

1

35

Accuracy of Context
-
Sensitive
Search

36

Sample & Query Size Effects

Even relatively small sample sizes
produce similar results

(1000 samples used for all other tests)

Significant performance gain between 2
and 3 query genes, little change beyond

(5 query genes used for all other tests)

37

Effect of Signal Balancing

Signal balancing further improves
context
-
specific search performance

Improvement is robust to missing value
imputation method

38

Effects of Signal Balancing

n
% re
-
projection

n
% balanced

signal balanced

39

Effects of Signal Balancing

n
% re
-
projection

n
% balanced

40

Specific Performance

41

Computational Solutions


Machine learning & data mining


Use existing data to make new predictions


Similarity search algorithms


Bayesian networks


Support vector machines


etc.


Validate predictions with follow
-
up lab work



Visualization & exploratory analysis


Seeing and interacting with data important


Show data so that questions can be
answered


Scalability, incorporate statistics, etc.

42


Cross
-
validation based on known biology


Most often used method in literature


Results are useful, but can be biased




Laboratory evaluation


More accurate, more difficult


Ultimate goal of functional genomics


Identify novel biology


Publish biological corpus

Function Prediction Evaluation

Huttenhower C*, Hibbs MA*, Myers CL*
et al.

The impact of incomplete knowledge on gene
function prediction.
Bioinformatics
, 2009.

43

Promise of Computational Functional
Genomics

Data &
Existing
Knowledge

Computational
Approaches

Predictions

Laboratory
Experiments

44

45

Petite Frequency Assay

46

Petite Frequency Phenotypes for
Predictions

47

Overall Result Summary

48

Double mutant petite freq.

49

Mitochondrial Motility

50

Respiratory Growth Rate

51

Biological Benefits of Computational Direction


Effective Candidate prioritization


6 months of work vs. 8 years for whole genome
screen


“Unbiased” (actually, just less biased)


Both uncharacterized genes and genes with
known function predicted and verified


40 of 75 (53%) for genes with known function


60 of 118 (51%) for uncharacterized genes


Testing only mitochondrial localized proteins
would miss 43% of our discoveries


59% accuracy among mitochondria localized


44% accuracy among non
-
mitochondria localized


52

Computational Expectations

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
10
100
1000
Precision [ TP/(TP+FP) ]

# of TPs

Combined
MEFIT
bioPIXIE
SPELL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
10
100
1000
Precision [ TP/(TP+FP) ]

# of TPs

Original Gold Standard

Experimental Results

53

Complementary Computational
Approaches

54

Computational Reality

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
10
100
1000
Precision [ TP/(TP+FP) ]

# of TPs

Combined
MEFIT
bioPIXIE
SPELL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
10
100
1000
Precision [ TP/(TP+FP) ]

# of TPs

Original Gold Standard

Experimental Results

55

Method Comparison

Input Data

Microarrays Only

Microarrays Only

Diverse Data

Algorithmic
Approach

Context
-
specific
search

Bayesian
integration

Bayesian
integration

Details

heavily cross
-
validated, only
pos. correlation,
uses signal
balanced data

naïve Bayes
inference after
training, pairwise
correlations
binned

naïve Bayes
inference after
training, each
data type
converted to
pairwise scores

56

Method Accuracy is Biologically
Diverse

57

Underlying Data Changes Predictions

58

Methods Converge During
Iteration

59

Computational Lessons


Underlying data, Choice of algorithm important


Data affects which biological areas can be studied


Algorithm affects biological context, nature of results


Possible for many combinations to be accurate


Utilizing an ensemble of methods broadens
scope and reliability


Iteration in an ensemble can lead to converging
predictions


Evaluating the results of computational
prediction methods is not as simple as
recapitulating GO

60

Conclusions


Microarray search system (& Bayesian
data integration) produce good predictions
of gene function


Experimental verification of predictions is
important


109 novel gene functions discovered


Subtle phenotypes important to consider


Big challenge: Make this work in
mammals

61

Acknowledgements


Hibbs Lab


Karen Dowell


Tongjun Gu


Al Simons


Olga Troyanskaya Lab


Patrick Bradley


Maria Chikina


Yuanfang Guan


Chad Myers


David Hess


Florian Markowetz


Edo Airoldi


Curtis Huttenhower




Kai Li Lab


Grant Wallace



Amy Caudy


Maitreya Dunham


Botstein, Kruglyak,
Broach, Rose labs


Kyuson Yun


Carol Bult