ppt - University of Pennsylvania

signtruculentΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

79 εμφανίσεις

Bioinformatics: Making sense of
functional

genomics data

Chris Stoeckert, Ph.D.

Dept of Genetics and Penn Center for Bioinformatics,
University of Pennsylvania School of Medicine


GoldenHelix

Symposia, Athens, Greece

December
3
,
2010

What is Functional Genomics?

From http://
en.wikipedia.org/wiki/Functional_genomics


Functional genomics is a field of molecular biology that attempts to
make use of the vast wealth of data produced by genomic projects (such
as genome sequencing projects) to describe gene (and protein) functions
and interactions. Unlike genomics and proteomics,
functional genomics
focuses on the dynamic aspects such as gene transcription, translation,
and protein
-
protein interactions,
as opposed to the static aspects of the
genomic information such as DNA sequence or structures. Functional
genomics attempts to answer questions about the function of DNA at
the levels of genes, RNA transcripts, and protein products. A key
characteristic of functional genomics studies is their
genome
-
wide
approach to these questions, generally involving high
-
throughput
methods
rather than a more traditional “gene
-
by
-
gene” approach.

The Impact of Functional Genomics


More data than can be fit in a publication, or a
lab book, and often not generated by a single
lab.


Microarray
-
based datasets are an example.


High throughput sequencing (HTS)
-
based
datasets are becoming the “new” microarrays.

What is Needed for Reproducible
Research


Public archives of data


Minimum information about experiments so
that the steps to generate the results can be
followed


An example of what can go wrong:

Potti

A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J,
Cottrill H, Kelley MJ, Petersen R, Harpole D, Marks J, Berchuck A,
Ginsburg GS, Febbo P, Lancaster J, Nevins JR. Genomic signatures
to guide the use of chemotherapeutics. Nat Med.
2006
Nov;
12
(
11
):
1294
-
300
. Epub
2006
Oct
22
. Erratum in: Nat Med.
2008
Aug;
14
(
8
):
889
. Nat Med.
2007
Nov;
13
(
11
):
1388
.

Keith A. Baggerly and Kevin R. Coombes

Deriving chemosensitivity from cell lines:
Forensic bioinformatics and reproducible
research in high
-
throughput biology

Ann. Appl. Stat. Volume
3
, Number
4
(
2009
),
1309
-
1334
.

From July
21
,
2010
N.Y. Times:

“Last year, two biostatisticians at the University
of Texas MD Anderson Cancer Center
published an article in the scientific journal
Annals of Applied Statistics in which they
identified errors in Duke’s data analysis and
said they had not been able to reproduce
Duke’s results
.


Recent examples of discoveries from integrating
usable functional genomics datasets


A global map of human gene
expression.
Lukk

et al. Nature
Biotech.
2010






Differentially Expressed RNA
from Public Microarray Data
Identifies Serum Protein
Biomarkers for Cross
-
Organ
Transplant Rejection and Other
Conditions.
R. Chen et al.
PLoS

Comp. Biol.
2010

In the beginning there were
microarrays…


And they were developed for gene expression profiling.


Microarray studies were published but only gene lists were
provided.


No details, no place to get the data


In
1999
, the Microarray Gene Expression Data Society

(MGED) began to address the lack

of verifiable and reproducible

datasets by developing and

promoting standards.

MGED Standards


What information is needed for a microarray
experiment?


MIAME
: Minimal Information About a Microarray
Experiment.
Brazma et al., Nature Genetics
2001


How do you “code up” microarray data?


MAGE
-
OM: MicroArray Gene Expression Object Model.
Spellman et al., Genome Biology
2002



MAGE
-
TAB
Rayner et al., BMC Bioinformatics
2006


What words do you use to describe a microarray
experiment?


MO
: MGED Ontology.
Whetzel et al. Bioinformatics
2006

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

Array design

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation

labelled

nucleic acid

Microarray

RNA extract

Sample

Experiment

Gene expression
data matrix

normalization

integration

Protocol

Protocol

Protocol

Protocol

Protocol

Protocol

genes

MIAME in a nutshell (ala
Alvis

Brazma
)

Stoeckert et al.

Drug Discovery Today TARGETS
2004

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

Array design

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation

labelled

nucleic acid

Microarray

RNA extract

Sample

Experiment

Gene expression
data matrix

normalization

integration

Protocol

Protocol

Protocol

Protocol

Protocol

Protocol

genes

Sequencing is replacing array technology

@HWI
-
EAS
266
_
0011
:
8
:
1
:
6
:
969
#
0
/
1

GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTG
TCT

+HWI
-
EAS
266
_
0011
:
8
:
1
:
6
:
969
#
0
/
1

_
abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[
\
\
aZTZVY

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1688
#
0
/
1

AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGG
CAGG

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1688
#
0
/
1

a`^ab`^D
\
a]a`b``b_bbbaabb^abaa``^a_^_aa
\
]_VR

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
593
#
0
/
1

CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGC
CTG

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
593
#
0
/
1

abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa
_

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
139
#
0
/
1

CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGG
GGT

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
139
#
0
/
1

aab`[^YDY]Z
\
baa`aabaaaa`aa`a]aa```
\
aY]^
\
]ZVX

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1390
#
0
/
1

GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCG
GA

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1390
#
0
/
1

_
U^b_`]D
\
__a_a`S```Y[a__]a
\
aa_`]`aTVZ__
\
HYVX

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1663
#
0
/
1

TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGC
GGGG

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1663
#
0
/
1

a`[_X]
\
DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

Array design

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation

labelled

nucleic acid

array

RNA extract

Sample

hybridisation


nucleic
acid

Microarray

Chromatin,
DNA
extract

Sample

Experiment

ChiP
-
Seq

MeDIP
-
Seq

Etc.

normalization

integration

Protocol

Protocol

Protocol

Protocol

Protocol

Protocol

genes

Sequencing is replacing array technology

@HWI
-
EAS
266
_
0011
:
8
:
1
:
6
:
969
#
0
/
1

GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTG
TCT

+HWI
-
EAS
266
_
0011
:
8
:
1
:
6
:
969
#
0
/
1

_
abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[
\
\
aZTZVY

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1688
#
0
/
1

AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGG
CAGG

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1688
#
0
/
1

a`^ab`^D
\
a]a`b``b_bbbaabb^abaa``^a_^_aa
\
]_VR

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
593
#
0
/
1

CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGC
CTG

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
593
#
0
/
1

abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa
_

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
139
#
0
/
1

CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGG
GGT

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
139
#
0
/
1

aab`[^YDY]Z
\
baa`aabaaaa`aa`a]aa```
\
aY]^
\
]ZVX

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1390
#
0
/
1

GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCG
GA

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1390
#
0
/
1

_
U^b_`]D
\
__a_a`S```Y[a__]a
\
aa_`]`aTVZ__
\
HYVX

@HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1663
#
0
/
1

TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGC
GGGG

+HWI
-
EAS
266
_
0011
:
8
:
1
:
7
:
1663
#
0
/
1

a`[_X]
\
DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB

MGED is now the

Functional Genomics Data Society


The Functional Genomics Data Society
-

FGED (
eff



jed
) Society, founded
in
1999
as the MGED Society, advocates for open access to genomic data
sets and works towards providing concrete solutions to achieve this. Our
goal is to assure that investment in functional genomics data generates
the maximum public benefit. Our work on defining minimum information
specifications for reporting data in functional genomics papers have
already enabled large data sets to be used and reused to their greater
potential in biological and medical research.


We work with other organizations to develop standards for biological
research data quality, annotation and exchange.


We facilitate the creation and use of software tools that build on these
standards and allow researchers to annotate and share their data easily.


We promote scientific discovery that is driven by genome wide and other
biological research data integration and meta
-
analysis.

The Functional Genomics Data Society

For more information see http://mged.org or http://fged.org

Look here for information on a new site in the coming year …

New Directions for FGED Standards


What information is needed for a
UHTS

experiment?


MINSEQE
: Minimal Information about a high throughput
SEQuencing Experiment.


http://www.mged.org/minseqe/


How do you annotate functional genomics data?


Annotare
: Tool to create MAGE
-
TAB.


http://code.google.com/p/annotare/


What words do you use to describe an
investigation
?


OBI
: Ontology for Biomedical Investigations.


http://obi
-
ontology.org/

Minimum Information about a high
-
throughput
Nucleotide
SeQuencing

Experiment


MINSEQE

(April,
2008
)


The description of the biological system and the particular
states that are studied


The sequence read data for each assay


The 'final' processed (or summary) data for the set of
assays in the study


The experiment design including sample data
relationships


General information about the experiment


Essential experimental and data processing protocols

Minimal Information for
Biological and Biomedical
Investigations


33
Projects registered as
of November,
2010

Related standards for exchange of
phenotype data for reuse in meta analysis.


statement of informed consent that the data
may (not) be used in additional analysis.


enrollment criteria / exclusion criteria


Population stratification: race/ethnicity/
strain, gender, age


tissue or cellular makeup of specimen


non
-
omics

phenotype for each participant
(subject or group)



Jennifer
Fostel

(NIEHS): Leader Phenotypes Working group

FGED

design
-
specific criteria

Different study or trial designs would have additional reporting requirements:



(A) study involving a disease


(A
1
) type and stage of disease


(A
2
) pertinent genotype / genomic conditions



(B) study involving chemical exposure


(B
1
) dose administered, dosing frequency and duration, route of administration


(B
2
) ID to permit unambiguous identification of chemical, i.e. Cas No



(C) study involving surgical intervention


(C
1
) details of intervention


(C
2
) timing with respect to pertinent events



(D) Observational or epidemiological study


(D
1
) demographic factors


(D
2
) pertinent medical history

Jennifer
Fostel

(NIEHS): Leader Phenotypes Working group

Annotare
-

An open source
standalone MAGE
-
TAB editor

Shankar R, Parkinson H, Burdett T, Hastings E, Liu J, Miller M,
Srinivasa R, White J, Brazma A, Sherlock G, Stoeckert CJ Jr, Ball CA.

Annotare

-

a tool for annotating high
-
throughput biomedical
investigations and resulting data.

Bioinformatics.
2010
Aug
23
.

MAGE
-
TAB Format

What’s MAGE
-
TAB?



MAGE
-
TAB is a simple spreadsheet view which has two files




IDF

-

describing the experiment design, contact details, variables and
protocols




SDRF

-

a spreadsheet with columns that describe samples, annotations,
protocol references, hybridizations and data


Linked data files, e.g. CEL files, these are referenced by the SDRF



For single channel data one row in the SDRF =
1
hybridization, for two channel
data one row =
1
channel


MAGE
-
TAB can also be used to annotate Next Gen Sequencing data



Where can I get MAGE
-
TAB from?



~
10
,
000
MAGE
-
TAB files are available for download from
ArrayExpress

(includes GEO
derived and
ArrayExpress

data)


caArray

also provides MAGE
-
TAB files for download
.


Who’s using MAGE
-
TAB?



BioConductor


GenePattern


MeV



IDF file for E
-
TABM
-
34

IDF = Investigation Description Format

SDRF file for E
-
TABM
-
34

SDRF = Sample and Data Relationship Format

Annotare

-

an open source MAGE
-
TAB Editor


Annotare

is an annotation tool for high throughput gene
expression experiments in MAGE
-
TAB format.

Researchers
can
describe their investigations with the investigators’ contact
details, experimental design, protocols that were employed,
references to publications, details of biological samples,
arrays, and experimental data produced in the investigation.




Annotare

Features



Intuitive graphical user interface forms for
editing


Ontology support, an inbuilt ontology and
web services connectivity to
bioportal


Searchable standard templates


Design wizard


Validation module



Mac
and Windows Support



http://
code.google.com/p/annotare
/

http://
genomics.betacell.org

supporting the Beta Cell Biology Consortium

MAGE
-
TAB is used to load datasets into Beta Cell Genomics


University of Pennsylvania

Chris Stoeckert

Elisabetta

Manduchi

Chiao
-
Feng

Lin

Emily Allen

Junmin

Liu




V
anderbilt

University

Mark Magnuson

Jean
-
Philippe
Cartailler

Thomas Houfek

Josh Norman

Chris Howard



http://
genomics.betacell.org

supporting the Beta Cell Biology Consortium

OBI


Ontology for Biomedical
Investigations

Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone
J, Parkinson H, Peters B, Rocca
-
Serra P, Ruttenberg A, Sansone SA,
Soldatova LN, Stoeckert CJ Jr, Turner JA, Zheng J; OBI consortium.

Modeling biomedical experimental processes with OBI.

J Biomed Semantics.
2010
Jun
22
;
1
Suppl
1
:S
7
.

Ontology


A set of concepts within a domain and the
relationships between those concepts


Unambiguous description of a domain


Controlled vocabulary


Relations (logical constraints among terms)


Human readable and machine
interpretable


Uses


Annotation


Text mining


Semantic web


Reasoning



Gene Ontology (GO)

(generated by NCBO BioPortal)


The Gene ontology has been
developed to provide terms to
consistently describe the gene
and gene product attributes
across species and databases


Standardized annotation
facilitate data integration cross
various databases and enable
the ability to query and retrieve
genes or proteins based on their
shared biology

Explosion of biomedical
ontologies

available

http://
bioportal.bioontology.org
/

OBI


Ontology for Biomedical
Investigations


FGED is one of many communities contributing to OBI


Whereas the MGED Ontology is primarily a controlled
vocabulary for use with MAGE, OBI is a well
-
founded
ontology with logical definitions and restrictions to be
used for multiple purposes (e.g., database models, text
mining, file annotation)


OBI intends to be part of the OBO Foundry


Interoperable with Gene Ontology,
CheBI
, Phenotypic qualities
(PATO), Cell Type (CL)…


OBI is available through browsers like the NCBO
BioPortal


OBI and IAO (Information Artifact Ontology) classes are shown in blue. Classes
imported from other external ontologies are shown in red. Some example
subclasses, such as
PCR product

and
cell culture

are included to illustrate the use
of the class
processed material
.

Partial high level structure of OBI classes

Measuring the glucose concentration in blood

From The OBI Consortium, The Ontology for Biomedical Investigations, under revision

Using OBI and other OBO Foundry
ontologies

to capture phenotype data

Jie

Zheng
, Omar
Harb

Enhance data mining in functional genomics
resources

EuPathDB
: a portal to eukaryotic pathogen
databases.Aurrecoechea

C,
Brestelli

J,
Brunk

BP, Fischer S,
Gajria

B,
Gao

X,
Gingle

A, Grant G,
Harb

OS,
Heiges

M,
Innamorato

F,
Iodice

J, Kissinger JC, Kraemer ET, Li W, Miller JA,
Nayak

V, Pennington C,
Pinney

DF,
Roos

DS, Ross C,
Srinivasamoorthy

G, Stoeckert CJ
Jr
,
Thibodeau

R,
Treatman

C, Wang
H.Nucleic

Acids Res.
2010

Find T.
brucei

genes by phenotype based on
RNAi

knockdown and insufficiency studies.

TriTrypDB
: a functional genomic resource for the
Trypanosomatidae.Aslett

M,
Aurrecoechea

C,
Berriman

M,
Brestelli

J,
Brunk

BP, Carrington
M,
Depledge

DP, Fischer S,
Gajria

B,
Gao

X, Gardner MJ,
Gingle

A, Grant G,
Harb

OS,
Heiges

M, Hertz
-
Fowler C, Houston R,
Innamorato

F,
Iodice

J, Kissinger JC, Kraemer E, Li W, Logan FJ, Miller JA,
Mitra

S,
Myler

PJ,
Nayak

V, Pennington C,
Phan

I,
Pinney

DF,
Ramasamy

G, Rogers
MB,
Roos

DS, Ross C,
Sivam

D, Smith DF,
Srinivasamoorthy

G, Stoeckert CJ
Jr
, Subramanian S,
Thibodeau

R,
Tivey

A,
Treatman

C,
Velarde

G,
Wang
H.Nucleic

Acids Res.
2010

Making sense of
functional

genomics
data: Standards &
Ontologies


Enhance sharing data and understanding what
is shared by computers and humans.


Applications include:


Data mining of databases


Meta
-
analysis of multiple datasets