warwick - European Bioinformatics Institute (EBI) Home Page

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 9 months ago)

125 views

EMBL Outstation


The European Bioinformatics Institute

MIAME and ArrayExpress

-

a

standard for microarray data
annotation and a database to store it

Helen Parkinson

Microarray Informatics Team

European Bioinformatics Institute
Hinxton

EMBL Outstation


The European Bioinformatics Institute

Three parts of my talk


Microarray data standards


Ontologies for gene expression data


ArrayExpress
-

a public database for
microarray data


Analysis tools at the EBI

EMBL Outstation


The European Bioinformatics Institute

The size of the datasets


Experiments:


~100 000 different transcripts in human


~320 cell types


2000 compounds


3 time points


2 concentrations


2 replicates


Data


8 x 10
11

data
-
points


1 x 10
15

= 1 Peta Byte for Affymetrix

(data from Jerry Lanfear)

EMBL Outstation


The European Bioinformatics Institute

Microarray data


Microarrays are widely used in experiments
and already producing massive amounts of
data


These data have to be stored in a well
organised and standard way, if they are to
be accessed and analysed by the wide
research community


There is a general consensus that there is a
need for a public repository for microarray
data


It is much less clear what exactly should be
stored in such a repository


EMBL Outstation


The European Bioinformatics Institute

A gene expression database from
the data analyst’s point of view

Samples

Genes

Gene expression
levels

Sample
annotations

Gene
annotations

Gene expression
matrix

EMBL Outstation


The European Bioinformatics Institute

Three parts of a gene
expression database


Gene annotation



can be given by links to
gene sequence databases and GO
(
function,process,cell compartment
)


not perfect but
lets not worry about it


Sample annotation



we do not have any
external databases for sample description
(except species taxonomy)


problem 1


Gene expression matrix



what are the
measurement units for gene expression
levels?


problem 2


EMBL Outstation


The European Bioinformatics Institute

Problem/consideration 1


sample annotation


Gene expression data only have meaning in
the context of detailed sample descriptions


If the data is going to be interpreted by
independent parties, sample information has
to be searchable and in the database


Controlled vocabularies and ontologies
(species, cell types, compound
nomenclature, treatments, etc) are needed
for unambiguous sample description

EMBL Outstation


The European Bioinformatics Institute

Sample annotation
-

what can be done?


Few cv’s and ontologies for sample
description are available (species taxonomy,
model organisms)


Some use of free text descriptions are
unavoidable (curation workload)


Existing efforts of creating such ontologies
should be coordinated (MGED ontology
working group)


Use existing ontologies and cv’s wherever
possible


EMBL Outstation


The European Bioinformatics Institute

Problem 2


the lack of gene
expression measurement units


What we would like to have


gene expression levels expressed in
some standard units (e.g. molecules per
cell)


reliability measure associated with each
value (e.g. standard deviation)


What have we got


each experiment using different units


no reliability information

EMBL Outstation


The European Bioinformatics Institute

Comparing expression data

cm

inc

EMBL Outstation


The European Bioinformatics Institute

Comparing expression data

?

?

EMBL Outstation


The European Bioinformatics Institute

Comparing expression data

EMBL Outstation


The European Bioinformatics Institute

What to do in the absence of
standard measurement
units?


Record raw, intermediate and final
analysis data together with the detailed
annotation of how the analysis has
been performed


This effectively passes on the
responsibility about interpreting the
final analysis data to the user

EMBL Outstation


The European Bioinformatics Institute

Raw data

Array scans

Genes

Samples

Gene expression

data

Gene exp.
levels

Three levels of microarray data processing

Spots

Quantitations

Quantitation

matrices

Spot
quantitations

EMBL Outstation


The European Bioinformatics Institute

Measurement units


In perspective:


standard controls for experiments (on chips
and in the samples) should be introduced


replicate measurements will become a norm


Temporary solution:


storing intermediate analysis results (including
the images) and annotations of how they were
obtained


Standards within experiments themselves
(standard controls and protocols)


EMBL Outstation


The European Bioinformatics Institute

Standards for microarray
data


Standards are needed to build a well
organised microarray database


Standards for annotation


Standards for data exchange


Standards for controls in the experiment
and data normalisation

www.dnachip.org/mged/normalization.html

EMBL Outstation


The European Bioinformatics Institute

How to create microarray
data standards

1.
To understand thoroughly what is the
minimum information about a microarray
experiment that is needed to interpret it
unambiguously and what is the structure of
this information (objects and relationships)

2.
To create the technical data format able to
capture this information

3.
Finding appropriate controlled vocabularies

EMBL Outstation


The European Bioinformatics Institute

Standardisation of microarray
data and annotations
-
MGED
group

The goal of the group is
to facilitate the adoption
of standards

for DNA
-
array experiment
annotation and data representation, as well as
the introduction of standard experimental
controls and data normalisation methods.
Includes most of the worlds largest microarray
laboratories and companies (TIGR,Affymetrix
Stanford,Sanger,Agilent etc)




www.mged.org



EMBL Outstation


The European Bioinformatics Institute

MGED


MGED 2 meeting in Heidelberg in 2000,
MGED 3 in Stanford in 2001, both ~ 300
participants


Minimum Information About a Microarray
Experiment


MIAME version 1.0 posted


Collaboration with OMG on data formats
MAML+GEML = MAGE
-
ML and MAGE
-
OM


MGED 4 meeting in February 2001, Boston


MGED will become an ISCB Special Interest
Group


EMBL Outstation


The European Bioinformatics Institute

MIAME


Minimum Information
About a Microarray Experiment

Publication

External links

6 parts of a

microarray

experiment

www.mged.org


Hybridisation

Array

Gene

(e.g., EMBL
)

Sample

Source

(e.g., Taxonomy
)

Data

Experiment

Normalisation

EMBL Outstation


The European Bioinformatics Institute

sample source and treatment ID as used in section 1

organism (NCBI taxonomy)

additional "qualifier, value, source" list; the list includes:

cell source
-

provider

type (if derived from primary sources (s))

sex

age

growth conditions

development stage

organism part (tissue)

animal/plant strain or line

genetic variation (e.g., gene knockout, transgenic variation)

individual

individual genetic characteristics (e.g., disease alleles, polymorphisms)

disease state or normal

target cell type

cell line and source (if applicable)

in vivo treatments (organism or individual treatments)

in vitro treatments (cell culture conditions)

treatment type (e.g., small molecule, heat shock, cold shock, food deprivation)

compound

is additional clinical information available (link)

separation technique (e.g., none, trimming, microdissection, FACS)

laboratory protocol for sample treatment……


MIAME Section on Sample Source and Treatment

EMBL Outstation


The European Bioinformatics Institute

What is an ontology?


An ontology is a specification of
concepts that includes the relationships
between those concepts.


Provides semantics and constraints


Allows for computational inferences and
reliable comparisons


EMBL Outstation


The European Bioinformatics Institute

MGED Biomaterial Ontology


Under construction by Chris Stoeckert


Using OILed (may use others)


Motivated by MIAME and coordinated
with the database model


Extend classes, provide constraints,
define terms, provide terms to
use,develop cv’s for submissions (EBI)

EMBL Outstation


The European Bioinformatics Institute

Use case scenario

EMBL Outstation


The European Bioinformatics Institute

Ontology Example


Concept=Age def=in standard units
referenced to an identifiable time point
from (class) developmental stage


Age=6 {units=days},


{dev_stage}=dauer


Hierarchy=Dev_stage
-
>larva
-
>dauer

EMBL Outstation


The European Bioinformatics Institute

Excerpts from a Sample Description

courtesy of M. Hoffman, S. Schmidtke, Lion BioSciences

Organism
: mus musculus
[ NCBI taxonomy browser ]

Cell source
: in
-
house bred mice (contact: person@somewhere.ac.uk)

Sex
: female [ MGED ]

Age: 3
-

4 weeks after birth
[ MGED ]

Growth conditions
: normal

controlled environment

20
-

22
o
C average temperature

housed in cages according to EU legislation

specified pathogen free conditions (SPF)

14 hours light cycle

10 hours dark cycle

Developmental stage
: stage 28 (juvenile (young) mice))
[ GXD "Mouse Anatomical Dictionary" ]

Organism part
: thymus
[ GXD "Mouse Anatomical Dictionary" ]

Strain or line
: C57BL/6
[International Committee on Standardized Genetic Nomenclature for Mice]

Genetic Variation
: Inbr (J) 150. Origin: substrains 6 and 10 were separated prior to 1937. This
substrain is now probably the most widely used of all inbred strains. Substrain 6 and 10 differ at the
H9, Igh2 and Lv loci. Maint. by J,N, Ola.
[International Committee on Standardized Genetic Nomenclature
for Mice ]

Treatment
: in vivo
[MGED]

intraperitoneal injection of Dexamethasone into mice, 10 microgram per
25 g bodyweight of the mouse

Compound
: drug
[MGED]

synthetic glucocorticoid Dexamethasone, dissolved in PBS

EMBL Outstation


The European Bioinformatics Institute

ArrayExpress conceptual model

Publication

External links

Hybridisation

Array

Sample

Source

(e.g., Taxonomy
)

Experiment

Normalisation

Gene

(e.g., EMBL
)

Data

EMBL Outstation


The European Bioinformatics Institute

ArrayExpress object model

EMBL Outstation


The European Bioinformatics Institute

ArrayExpress


the state of
the art


ArrayExpress Object model supporting
MIAME requirements developed


Data model implemented in Oracle


Data loader from MAML file format


Expression Profiler


data analysis tool
already available


EMBL Outstation


The European Bioinformatics Institute

ArrayExpress


plans and
schedule


EU grant


new staff being recruited


A web based query interface
-

under
development


A web based submission tool


under test


Participation in OMG


MAGE
-
OM & MAGE
-
ML


MAGE
-
ML will replace MAML in October


Full scale database operation expected to
start at the beginning of 2002


Expression Profiler to link to ArrayExpress

EMBL Outstation


The European Bioinformatics Institute

Microarray data analysis


Expression Profiler


a web based
gene expression data analysis tool:
www.ebi.ac.uk/microarray/



EPCLUST

(
cluster Expression profiles
)

GENOMES

sequence, function,


annotation

SPEXS

(Sequence Pattern Exhaustive Search)

novel patterns

URLMAP
:

provide links

Expression Profiler
-

web based tool for
microarray data analysis

http://www.ebi.ac.uk/microarray/

Expression data

External data, tools

pathways, function,

etc.

PATMATCH
k
nown

patterns

EMBL Outstation


The European Bioinformatics Institute

Conclusions


Microarray standardisation is a challenge
and an imperative


Join MGED to contribute to this process
www.mged.org


Participate in the development of ontologies
and controlled vocabularies


Send me your protocols


Make your data available


Feedback on MIAME, it’s up for discussion



EMBL Outstation


The European Bioinformatics Institute

Acknowledgments


Microarray Informatics Team, EBI


Alvis Brazma, Katja Kivinen, Helen Parkinson, Olga Perez,


Johan Rung, Ugis Sarkans,Thomas Schlitt, Mohammad Shojatalab,
Lev Soinov, Koichi Tazaki, Jaak Vilo


Industry Support team, EBI


Alan Robinson


MGED steering committee


MIAME working group


Chris Stoeckert, U. Penn. and MGED

EMBL Outstation


The European Bioinformatics Institute

Useful URL’s


www.mged.org


www.tigr.org


www.ebi.ac.uk/array


www.geneontology.org


www.hgmp.mrc.ac.uk


www.dnachip.org/mged/normalization.html


parkinson@ebi.ac.uk