Large scale genomic data integration for

desertcockatooΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 28 μέρες)

55 εμφανίσεις

Large scale genomic data integration for
functional genomics and
metagenomics

Curtis Huttenhower







05
-
21
-
10

Harvard School of Public Health

Department of Biostatistics

Greatest Biological Discoveries?

2

Are We There Yet?

3



How much biology is out there?



How much have we found?



How fast are we finding it?

Human Proteins with

Annotated Biological Roles

Age
-
Adjusted Citation Rates for

Major Sequencing Projects

Species Diversity of

Environmental Samples

Fierer

2008

#

Distinct

Roles

Matt Hibbs

#

Distinct

Roles

Matt Hibbs

Are We There Yet?

4



How much biology is out there?



How much have we found?



How fast are we finding it?

Human Proteins with

Annotated Biological Roles

Age
-
Adjusted Cost per Citation for

Major Sequencing Projects

Species Diversity of

Environmental Samples

Fierer

2008

Lots!

Not nearly all

Not fast enough

Our job is to create
computational microscopes:

To
ask

and
answer

specific
biomedical questions using
millions of experimental results

Outline

5

1. Data mining:

Algorithms for integrating
very large data compendia

2.
Metagenomics
:

Network models of
microbial communities

A framework for functional genomics

6

High

Similarity

Low

Similarity

High

Correlation

Low

Correlation

G1

G2


+

G4

G9


+



G3

G6


-

G7

G8


-



G2

G5


?

0.9

0.7



0.1

0.2



0.8

+

-



-

-



+

0.8

0.5



0.05

0.1



0.6

High

Correlation

Low

Correlation

Frequency

Coloc
.

Not
coloc
.

Frequency

Similar

Dissim
.

Frequency

P(
G2
-
G5|Data
) = 0.85

100Ms gene pairs →

← 1Ks datasets

+

=

Functional network

prediction and analysis

7

Global interaction network

Metabolism network

Signaling network

Gut community network

Currently includes data from

30,000 human experimental results,

15,000 expression conditions +

15,000 diverse others, analyzed for

200 biological functions and

150 diseases

HEFalMp

HEFalMp
:

Predicting human gene function

8

HEFalMp

HEFalMp
:

Predicting human

genetic interactions

9

HEFalMp

HEFalMp
:

Analyzing human genomic data

10

HEFalMp

HEFalMp
:

Understanding human disease

11

HEFalMp

Meta
-
analysis for unsupervised

functional data integration

12

Evangelou

2007

Huttenhower 2006

Hibbs

2007















1
1
log
2
1
'
'
'
'







z




e
i
e
y
,
i
e
e
e
i
e
y
,
,








i
i
e
i
e
e
y
w
,
*
,
ˆ

2
2
,
*
,
ˆ
1
e
i
e
i
e
s
w



Simple regression:

All datasets are
equally accurate

Random effects:

Variation within and
among datasets
and interactions

Meta
-
analysis for unsupervised

functional data integration

13

Following up with semi
-
supervised approach

Evangelou

2007

Huttenhower 2006

Hibbs

2007















1
1
log
2
1
'
'
'
'







z
+

=

Functional mapping: mining integrated networks

14

Predicted relationships
between genes

High

Confidence

Low

Confidence

The strength of these
relationships indicates how
cohesive

a process is.

Chemotaxis

Functional mapping: mining integrated networks

15

Predicted relationships
between genes

High

Confidence

Low

Confidence

Chemotaxis

Functional mapping: mining integrated networks

16

Flagellar

assembly

The strength of these
relationships indicates how
associated

two processes are.

Predicted relationships
between genes

High

Confidence

Low

Confidence

Chemotaxis

Functional Mapping:

Functional Associations Between Processes

17

Edges

Associations between processes

Very

Strong

Moderately

Strong

Hydrogen

Transport

Electron

Transport

Cellular
Respiration

Protein
Processing

Peptide
Metabolism

Cell
Redox

Homeostasis

Aldehyde

Metabolism

Energy
Reserve
Metabolism

Vacuolar
Protein
Catabolism

Negative Regulation

of Protein
Metabolism

Organelle
Fusion

Protein
Depolymerization

Organelle
Inheritance

Functional Mapping:

Functional Associations Between Processes

18

Edges

Associations between processes

Very

Strong

Moderately

Strong

Borders

Data coverage of processes

Well

Covered

Sparsely

Covered

Hydrogen

Transport

Electron

Transport

Cellular
Respiration

Protein
Processing

Peptide
Metabolism

Cell
Redox

Homeostasis

Aldehyde

Metabolism

Energy
Reserve
Metabolism

Vacuolar
Protein
Catabolism

Negative Regulation

of Protein
Metabolism

Organelle
Fusion

Protein
Depolymerization

Organelle
Inheritance

Functional Mapping:

Functional Associations Between Processes

19

Edges

Associations between processes

Very

Strong

Moderately

Strong

Nodes

Cohesiveness of processes

Below

Baseline

Baseline

(genomic

background)

Very

Cohesive

Borders

Data coverage of processes

Well

Covered

Sparsely

Covered

Hydrogen

Transport

Electron

Transport

Cellular
Respiration

Protein
Processing

Peptide
Metabolism

Cell
Redox

Homeostasis

Aldehyde

Metabolism

Energy
Reserve
Metabolism

Vacuolar
Protein
Catabolism

Negative Regulation

of Protein
Metabolism

Organelle
Fusion

Protein
Depolymerization

Organelle
Inheritance

Functional Mapping:

Functional Associations Between Processes

20

Edges

Associations between processes

Very

Strong

Moderately

Strong

Nodes

Cohesiveness of processes

Below

Baseline

Baseline

(genomic

background)

Very

Cohesive

Borders

Data coverage of processes

Well

Covered

Sparsely

Covered

Functional Maps:

Focused Data Summarization

21

ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA

Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions;
what next?

Functional Maps:

Focused Data Summarization

22

ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA

How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?

Functional mapping



Very large collections of genomic data



Specific predicted molecular interactions



Pathway, process, or disease associations



Underlying experimental results and


functional activities in data

Outline

23

1. Data mining:

Algorithms for integrating
very large data compendia

2.
Metagenomics
:

Network models of
microbial communities

Microbial Communities and

Functional
Metagenomics


Metagenomics
: data analysis from environmental samples


Microflora
: environment includes us!



Pathogen collections of “single” organisms form similar communities



Another data integration problem


Must include datasets from multiple organisms



What questions can we answer?


What pathways/processes are present/over/under
-

enriched in a newly sequences microbe/community?


What’s shared within community X?

What’s different? What’s unique?


How do human
microflora

interact with diabetes,

obesity, oral health, antibiotics, aging, …


Current functional methods annotate

~50% of synthetic data,
<5% of environmental data

24

With Jacques Izard, Wendy Garrett

Data Integration for Microbial Communities

25

~300 available
expression
datasets

~30 species

Weskamp

et al 2004

Flannick

et al 2006

Kanehisa

et al 2008

Tatusov

et al 1997



Data integration works just as well in microbes as it does in yeast and humans



We know an awful lot about some microorganisms and almost nothing about others



Sequence
-
based and network
-
based tools for function transfer both work in isolation



We can use data integration to leverage both and mine out additional biology

Functional network prediction from
diverse microbial data

26

486 bacterial
expression
experiments

876 raw
datasets

310
postprocessed

datasets

304 normalized
coexpression

networks
in 27 species

Integrated functional
interaction networks
in 15 species

307 bacterial
interaction
experiments

154796 raw
interactions

114786
postprocessed

interactions

E. Coli Integration

← Precision ↑, Recall ↓

Functional maps for cross
-
species

knowledge transfer

27

G17

G16

G15

G10

G6

G9

G8

G5

G11

G7

G12

G13

G14

G2

G1

G4

G3

O8

O4

O5

O7

O9

O6

O2

O3

O1
: G1
, G2, G3

O2
: G4

O3
: G6



ECG1
,
ECG2

BSG1

ECG3
,
BSG2



Functional maps for cross
-
species

knowledge transfer

28

← Precision ↑, Recall ↓

Following up with unsupervised and
partially anchored network alignment

Functional maps for functional
metagenomics

29

GOS

4441599.3

Hypersaline

Lagoon, Ecuador

KEGG

Pathways

Organisms

Pathog

ens

Env
.

Mapping genes
into pathways

Mapping pathways
into organisms

+

Integrated functional
interaction networks
in 27 species

Mapping organisms
into phyla

=

Functional maps for functional
metagenomics

30

Nodes

Process cohesiveness in obesity

Very

Downregulated

Baseline

(no change)

Very

Upregulated

Edges

Process association in obesity

More

Coregulated

Less

Coregulated

Baseline

(no change)


Sleipnir

C++ library

for computational
functional genomics


Data types for biological entities


Microarray

data, interaction data, genes and gene sets,
functional catalogs, etc. etc.


Network

communication, parallelization


Efficient

machine learning algorithms


Generative

(Bayesian) and discriminative (
SVM
)


And

it’s fully documented!

Efficient Computation For Biological Discovery

Massive datasets and genomes require
efficient algorithms and implementations.

31

It’s also speedy: microbial

data integration

computation


takes <3hrs.

Outline

32

1. Data mining:

Algorithms for integrating
very large data compendia

2.
Metagenomics
:

Network models of
microbial communities



Bayesian and unsupervised


methods for data integration



HEFalMp

system for human data


analysis and integration



Functional mapping to statistically


summarize large data collections



Integration for microbial


communities and
metagenomics



Accurate cross
-
species


interactome

transfer



Sleipnir software for efficient


large scale data mining

Thanks!

33

http://function.princeton.edu/hefalmp

http://huttenhower.sph.harvard.edu/sleipnir

Olga
Troyanskaya

Chris Park

David Hess

Matt
Hibbs

Chad Myers

Ana Pop

Aaron Wong

Hilary Coller

Erin Haley

Jacques Izard

Wendy Garrett

Sarah Fortune

Tracy
Rosebrock