Beyond Bioinformatics: Statistical and

ocelotgiantAI and Robotics

Nov 7, 2013 (3 years and 9 months ago)

64 views

SAMSI 2014
-
2015 Program

Beyond Bioinformatics: Statistical and
Mathematical Challenges


Topic: Data Integration

Katerina
Kechris
, PhD

Associate Professor

Biostatistics and Informatics

Colorado School of Public Health

University of Colorado Denver

Omics


Large
-
scale analyses for studying a population
of molecules or molecular mechanisms


High
-
throughput data


Examples


Genomics (entire genome


DNA)


Proteomics (study of protein repertoire)


Epigenomics

(study of DNA and
histone

modifications)


Omics

Epigenome

Phenome

Adapted from
http://www.sciencebasedmedicine.org

http://www.scientificpsychic.com/fitness/transcription.gif

http://themedicalbiochemistrypage.org/images/hemoglobin.jpg

http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png

http://creatia2013.files.wordpress.com/2013/03/dna.gif


Large
-
scale Projects & Databases






NCI 60 Database

Integration of Omics Data


Each type of
data gives a different snapshot of
the biological or disease system


Why integrate data?


Reduce false
positives/negatives


Identify interactions between different
molecules


Explore
functional mechanisms


Challenges

1.
When to integrate?

2.
Dimensionality

3.
Resolution

4.
Heterogeneity

5.
Interactions and Pathways

Challenge 1: When to integrate?


Early


Merging data to increase sample size


Intermediate


Convert different data sources into common format
(e.g., ranks, correlation matrices), kernel
-
based
analysis


Late


Meta
-
analysis (combine effect size or
p
-
value),
aggregate voting for classifiers, genomic enrichment
and overlap of significant results

Genomic Meta
-
analysis:







Combining Multiple
Transcriptomic

Studies

Tseng Lab, U. of Pitt.

Assessing Genomic Overlap:

Permutation
-
based Strategies






Bickel Lab, Berkeley & ENCODE

Ann. Appl. Stat.
(2010) 4:4 1660
-
1697.

Challenge 2: Dimensionality


Most technologies produce 10Ks to 100Ks
measurements per sample


Exponential increase with 2+ data types


Dimension reduction


Process data type separately (filtering)


Combine with model fitting


Multivariate analysis


Sparse Multivariate Methods



Variable Selection,
Discriminant

Analysis,
Visualization


Penalties (or regularization)
to reduce parameter space,
only a few entries are non
-
zero (
sparsity
)


Sparse Canonical Correlation
Analysis (CCA) and Partial
Least Squares Regression
(PLS)

Le Cao, U. of Queensland;
Besse
, U. of
Toulose
; Witten, U. of Wash;
Tibshirani
, Stanford

Stat
Appl

Genet Mol Biol.
2009 January 1; 8(1): Article 28;
Stat
Appl

Genet Mol Biol.
2008;7(1):Article 35

Challenge 3: Genomic Resolution


Base level
(conservation, motif scores)




Regular intervals
(expression/binding from tiling arrays)


Irregular intervals


Gene/
ncRNA

level data
(expression)


Individual positions
(SNP,
methylation

sites)


Challenge 4: Heterogeneity


Technology
-
specific sources of error


Different pre
-
processing, normalization


Different amounts of missing values


Data matching


Different identifiers


Not always one
-
to
-
one (microarrays)


Imputation


Challenge 4: Heterogeneity


Continuous


expression and binding data from microarrays,
motif scores, protein/metabolite abundance


Counts


expression data from sequencing


0
-
1


conservation (UCSC), DNA
methylation


Binary/Categorical


Thresh
-
holding (e.g., motif scores), genotype


Case Study: Development

Ci



important for differentiation of
appendages during development



transcription factor


binds to DNA
near
target

genes

http://www.biology.ualberta.ca/locke.hp/research.htm

http://howardhughes.trinity.duke.edu

Kechris

Lab, CU Denver

Hierarchical Mixture Model


Data

-
Transcriptome
:

Ci

pathway
mutants (
expr
)


irregular
interval

-
Genome:
DNA
binding data of
Ci

(bind)


regular interval
,
DNA
conservation across 14 insect species

(cons)


base
level


Goal: Predict gene targets of
Ci


Hidden variable is gene target


hierarchical
mixture model


Dvorkin

et al., 2013 (under review)

Challenge 5: Interactions and Pathways


Known Pathways


Incorporate information in databases (
curated

but
sparse)


e.g., KEGG pathways have metabolite


protein
interactions (directed graphs)



De novo
Pathways


Discover novel interactions

Known Pathways

Jornsten
, Chalmers &
Michailidis
, U. Michigan

Biostatistics
(2012) 13:4 748
-
761

Joint modeling of metabolite
and transcript data to
identify active pathways

metabolite

gene

de novo
Interaction
s


Single data


INTEGRATION


Pair
-
wise


Correlations (e.g.,
eQTL
)


Bayesian networks



Multiple


Kernel
-
based methods








Probabilistic graphical models


Network analysis



gene

SNP

protein

metabolite

gene

methylation

site

PHENOTYPE

de novo
Interaction
s

Shojaie

Lab U. Washington

Biometrika

(2010) 97 (3): 519
-
538.

Summary Methodology

1.
Meta
-
analysis

2.
Permutation
-
based Methods

3.
Sparse Multivariate Methods

4.
Graphical Models

5.
Network Analysis