# Beyond Bioinformatics: Statistical and

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

74 εμφανίσεις

SAMSI 2014
-
2015 Program

Beyond Bioinformatics: Statistical and
Mathematical Challenges

Topic: Data Integration

Katerina
Kechris
, PhD

Associate Professor

Biostatistics and Informatics

Omics

Large
-
scale analyses for studying a population
of molecules or molecular mechanisms

High
-
throughput data

Examples

Genomics (entire genome

DNA)

Proteomics (study of protein repertoire)

Epigenomics

(study of DNA and
histone

modifications)

Omics

Epigenome

Phenome

http://www.sciencebasedmedicine.org

http://www.scientificpsychic.com/fitness/transcription.gif

http://themedicalbiochemistrypage.org/images/hemoglobin.jpg

http://creatia2013.files.wordpress.com/2013/03/dna.gif

Large
-
scale Projects & Databases

NCI 60 Database

Integration of Omics Data

Each type of
data gives a different snapshot of
the biological or disease system

Why integrate data?

Reduce false
positives/negatives

Identify interactions between different
molecules

Explore
functional mechanisms

Challenges

1.
When to integrate?

2.
Dimensionality

3.
Resolution

4.
Heterogeneity

5.
Interactions and Pathways

Challenge 1: When to integrate?

Early

Merging data to increase sample size

Intermediate

Convert different data sources into common format
(e.g., ranks, correlation matrices), kernel
-
based
analysis

Late

Meta
-
analysis (combine effect size or
p
-
value),
aggregate voting for classifiers, genomic enrichment
and overlap of significant results

Genomic Meta
-
analysis:

Combining Multiple
Transcriptomic

Studies

Tseng Lab, U. of Pitt.

Assessing Genomic Overlap:

Permutation
-
based Strategies

Bickel Lab, Berkeley & ENCODE

Ann. Appl. Stat.
(2010) 4:4 1660
-
1697.

Challenge 2: Dimensionality

Most technologies produce 10Ks to 100Ks
measurements per sample

Exponential increase with 2+ data types

Dimension reduction

Process data type separately (filtering)

Combine with model fitting

Multivariate analysis

Sparse Multivariate Methods

Variable Selection,
Discriminant

Analysis,
Visualization

Penalties (or regularization)
to reduce parameter space,
only a few entries are non
-
zero (
sparsity
)

Sparse Canonical Correlation
Analysis (CCA) and Partial
Least Squares Regression
(PLS)

Le Cao, U. of Queensland;
Besse
, U. of
Toulose
; Witten, U. of Wash;
Tibshirani
, Stanford

Stat
Appl

Genet Mol Biol.
2009 January 1; 8(1): Article 28;
Stat
Appl

Genet Mol Biol.
2008;7(1):Article 35

Challenge 3: Genomic Resolution

Base level
(conservation, motif scores)

Regular intervals
(expression/binding from tiling arrays)

Irregular intervals

Gene/
ncRNA

level data
(expression)

Individual positions
(SNP,
methylation

sites)

Challenge 4: Heterogeneity

Technology
-
specific sources of error

Different pre
-
processing, normalization

Different amounts of missing values

Data matching

Different identifiers

Not always one
-
to
-
one (microarrays)

Imputation

Challenge 4: Heterogeneity

Continuous

expression and binding data from microarrays,
motif scores, protein/metabolite abundance

Counts

expression data from sequencing

0
-
1

conservation (UCSC), DNA
methylation

Binary/Categorical

Thresh
-
holding (e.g., motif scores), genotype

Case Study: Development

Ci

important for differentiation of
appendages during development

transcription factor

binds to DNA
near
target

genes

http://www.biology.ualberta.ca/locke.hp/research.htm

http://howardhughes.trinity.duke.edu

Kechris

Lab, CU Denver

Hierarchical Mixture Model

Data

-
Transcriptome
:

Ci

pathway
mutants (
expr
)

irregular
interval

-
Genome:
DNA
binding data of
Ci

(bind)

regular interval
,
DNA
conservation across 14 insect species

(cons)

base
level

Goal: Predict gene targets of
Ci

Hidden variable is gene target

hierarchical
mixture model

Dvorkin

et al., 2013 (under review)

Challenge 5: Interactions and Pathways

Known Pathways

Incorporate information in databases (
curated

but
sparse)

e.g., KEGG pathways have metabolite

protein
interactions (directed graphs)

De novo
Pathways

Discover novel interactions

Known Pathways

Jornsten
, Chalmers &
Michailidis
, U. Michigan

Biostatistics
(2012) 13:4 748
-
761

Joint modeling of metabolite
and transcript data to
identify active pathways

metabolite

gene

de novo
Interaction
s

Single data

INTEGRATION

Pair
-
wise

Correlations (e.g.,
eQTL
)

Bayesian networks

Multiple

Kernel
-
based methods

Probabilistic graphical models

Network analysis

gene

SNP

protein

metabolite

gene

methylation

site

PHENOTYPE

de novo
Interaction
s

Shojaie

Lab U. Washington

Biometrika

(2010) 97 (3): 519
-
538.

Summary Methodology

1.
Meta
-
analysis

2.
Permutation
-
based Methods

3.
Sparse Multivariate Methods

4.
Graphical Models

5.
Network Analysis